TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • R Package Structure
      • The Data
      • How to Participate
        • Data Dictionary
    • cran_20221122.csv
    • external_calls.csv
    • internal_calls.csv
      • Cleaning Script

    R Package Structure

    Happy Boxing Day! While you’re dealing with your physical packages, we’re looking into R packages!

    The dataset this week comes from “Historical Trends in R Package Structure and Interdependency on CRAN” by Mark Padgham and Noam Ross. In that paper, they use the {pkgstats} R package to analyze the structure of R packages over time, using an archive of all packages on CRAN as of 2022-11-22. We’ve provided csv versions of two of the datasets from that paper.

    The paper focuses on package characteristics over time. It might be interesting to look at the distribution of similar features (such as lines of code) across packages.

    If you’re unfamiliar with some of the terminology in this dataset, you might find the R Packages book by Hadley Wickham and Jennifer Bryan helpful.

    The Data

    # Option 1: tidytuesdayR package 
    ## install.packages("tidytuesdayR")
    
    tuesdata <- tidytuesdayR::tt_load('2023-12-26')
    ## OR
    tuesdata <- tidytuesdayR::tt_load(2023, week = 52)
    
    cran_20221122 <- tuesdata$cran_20221122
    external_calls <- tuesdata$external_calls
    internal_calls <- tuesdata$internal_calls
    
    # Option 2: Read directly from GitHub
    
    cran_20221122 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/cran_20221122.csv')
    external_calls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/external_calls.csv')
    internal_calls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-12-26/internal_calls.csv')

    If you would like to dive deeper, you can download the larger dataset with this code:

    cran_all_20221122 <- readr::read_rds("https://zenodo.org/records/7414296/files/pkgstats-CRAN-all.Rds?download=1")

    How to Participate

    • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
    • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
    • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.

    Data Dictionary

    cran_20221122.csv

    variable class description
    package character The name of the package
    version character The package version
    date double The release date of that version of the package
    license character License information
    files_R double Number of files in the /R directory, where numbers are recursively counted in all sub-directories
    files_src double Number of files in the /src directory, where numbers are recursively counted in all sub-directories
    files_inst double Number of files in the /inst/include directory, where numbers are recursively counted in all sub-directories
    files_vignettes double Number of files in the /vignettes directory, where numbers are recursively counted in all sub-directories
    files_tests double Number of files in the /tests directory, where numbers are recursively counted in all sub-directories
    loc_R double Total lines of code across all files in the /R directory
    loc_src double Total lines of code across all files in the /src directory
    loc_inst double Total lines of code across all files in the /inst/include directory
    loc_vignettes double Total lines of code across all files in the /vignettes directory
    loc_tests double Total lines of code across all files in the /tests directory
    blank_lines_R double Total numbers of blank lines across all files in the /R directory
    blank_lines_src double Total numbers of blank lines across all files in the /src directory
    blank_lines_inst double Total numbers of blank lines across all files in the /inst directory
    blank_lines_vignettes double Total numbers of blank lines across all files in the /vignettes directory
    blank_lines_tests double Total numbers of blank lines across all files in the /tests directory
    comment_lines_R double Total numbers of comment lines across all files in the /R directory
    comment_lines_src double Total numbers of comment lines across all files in the /src directory
    comment_lines_inst double Total numbers of comment lines across all files in the /inst directory
    comment_lines_vignettes double Total numbers of comment lines across all files in the /vignettes directory
    comment_lines_tests double Total numbers of comment lines across all files in the /tests directory
    rel_space double Measure of relative white space across all files in the /R, /src, and /inst directories
    rel_space_R double Measure of relative white space across all files in the /R directory
    rel_space_src double Measure of relative white space across all files in the /src directory
    rel_space_inst double Measure of relative white space across all files in the /inst directory
    rel_space_vignettes double Measure of relative white space across all files in the /vignettes directory
    rel_space_tests double Measure of relative white space across all files in the /tests directory
    indentation double The number of spaces used to indent code, with values of -1 indicating indentation with tab characters
    nexpr double The median number of nested expression per line of code, counting only those lines which have any expressions
    num_vignettes double Number of vignettes
    num_demos double Number of demos
    num_data_files double Number of data files
    data_size_total double Total size of all package data
    data_size_median double Median size of package data files
    translations character List of translations where package includes translations files, given as a comma-separated list of (spoken) language codes
    urls character Package URL(s)
    bugs character URL for BugReports
    desc_n_aut double Number of contributors with role of author
    desc_n_ctb double Number of contributors with role of contributor
    desc_n_fnd double Number of contributors with role of funder
    desc_n_rev double Number of contributors with role of reviewer
    desc_n_ths double Number of contributors with role of thesis advisor
    desc_n_trl double Number of contributors with role of translator (relating to translation between computer and not spoken languages)
    depends character Comma-separated character entries for all depends packages
    imports character Comma-separated character entries for all imports packages
    suggests character Comma-separated character entries for all suggests packages
    enhances character Comma-separated character entries for all enhances packages
    linking_to character Comma-separated character entries for all linking_to packages
    n_fns_r double Numbers of functions in R
    n_fns_r_exported double Numbers of exported R functions
    n_fns_r_not_exported double Numbers of non-exported R functions
    n_fns_src double Number of functions (or objects) in other computer languages, including functions in both src and inst/include directories
    n_fns_per_file_r double Number of functions (or objects) per individual file in /R
    n_fns_per_file_src double Number of functions (or objects) per individual file in source directories other than /R
    npars_exported_mn double Mean number of parameters per exported R function
    npars_exported_md double Median number of parameters per exported R function
    loc_per_fn_r_mn double Mean lines of code per function in /R
    loc_per_fn_r_md double Median lines of code per function in /R
    loc_per_fn_r_exp_mn double Mean lines of code per exported function in /R
    loc_per_fn_r_exp_md double Median lines of code per exported function in /R
    loc_per_fn_r_not_exp_mn double Mean lines of code per non-exported function in /R
    loc_per_fn_r_not_exp_md double Median lines of code per non-exported function in /R
    loc_per_fn_src_mn double Mean lines of code per in other source directories
    loc_per_fn_src_md double Median lines of code per in other source directories
    languages character languages
    doclines_per_fn_exp_mn double Mean lines of documentation per exported function in /R
    doclines_per_fn_exp_md double Median lines of documentation per exported function in /R
    doclines_per_fn_not_exp_mn double Mean lines of documentation per non-exported function in /R
    doclines_per_fn_not_exp_md double Median lines of documentation per non-exported function in /R
    doclines_per_fn_src_mn double Mean lines of code per in other source directories
    doclines_per_fn_src_md double Median lines of code per in other source directories
    docchars_per_par_exp_mn double Mean number of documentation characters per parameter of exported R functions
    docchars_per_par_exp_md double Median number of documentation characters per parameter of exported R functions
    n_edges double Number of edges connecting functions (and other objects) across all languages in package code
    n_edges_r double Number of edges connecting R functions (and other objects)
    n_edges_src double Number of edges connecting functions (and other objects) in other languages
    n_clusters double Number of distinct clusters in package network
    centrality_dir_mn double Mean centrality of all network edges, calculated from directed representation of network
    centrality_dir_md double Median centrality of all network edges, calculated from directed representation of network
    centrality_dir_mn_no0 double Mean centrality of all network edges, calculated from directed representation of network, excluding edges with centrality of zero
    centrality_dir_md_no0 double Median centrality of all network edges, calculated from directed representation of network, excluding edges with centrality of zero
    centrality_undir_mn double Mean centrality of all network edges, calculated from undirected representation of network
    centrality_undir_md double Median centrality of all network edges, calculated from undirected representation of network
    centrality_undir_mn_no0 double Mean centrality of all network edges, calculated from undirected representation of network, excluding edges with centrality of zero
    centrality_undir_md_no0 double Median centrality of all network edges, calculated from undirected representation of network, excluding edges with centrality of zero
    num_terminal_edges_dir double Numbers of terminal edges, calculated from directed representation of network
    num_terminal_edges_undir double num_terminal_edges_undir, calculated from undirected representation of network
    node_degree_mn double Mean node degree
    node_degree_md double Median node degree
    node_degree_max double Maximum node degree
    cpl_instability_pkg double Coupling instability, a measure of the extent to which packages depend on external functionality without other packages in turn depending on them

    external_calls.csv

    variable class description
    package_from character The package that makes the call
    package_to character The package that the source package calls
    n_total double The total number of calls from package_from to package_to
    n_unique double The number of unique calls from package_from to package_to

    internal_calls.csv

    variable class description
    package character The package being evaluated
    n_total double The total number of calls from functions in one file to functions in another file within the package
    n_unique double The number of unique calls from functions in one file to functions in another file within the package

    Cleaning Script

    The authors provided mostly [clean data](https://zenodo.org/records/7414296. We chose one of their datasets, lightly cleaned the data, and saved it as a CSV. We also split the external_calls data into two files, one for calls to functions in other files in the same package (internal_calls.csv) and one for calls to functions in other packages (external_calls.csv).

    library(tidyverse)
    library(here)
    library(fs)
    
    working_dir <- here::here("data", "2023", "2023-12-26")
    
    cran_20221122_url <- "https://zenodo.org/records/7414296/files/pkgstats-CRAN-current.Rds?download=1"
    cran_20221122 <- readr::read_rds(cran_20221122_url) |>
      dplyr::ungroup() |>
      dplyr::mutate(
        dplyr::across(
          c(translations, depends:linking_to, languages, external_calls),
          \(x) {
            dplyr::na_if(x, "NA") |> 
              dplyr::na_if("")
          }
        )
      )
    dplyr::glimpse(cran_20221122)
    
    
    calls_20221122 <- cran_20221122 |>
      dplyr::select(package_from = package, external_calls) |>
      tidyr::separate_longer_delim(
        external_calls,
        ","
      ) |>
      # In at least one case, an extra "L:" prefix was picked up from a 1:10-style
      # range.
      dplyr::mutate(
        external_calls = stringr::str_remove(external_calls, "^L:")
      ) |> 
      tidyr::separate_wider_delim(
        external_calls,
        ":",
        names = c("package_to", "n_total", "n_unique")
      )
    
    cran_20221122$external_calls <- NULL
    
    external_calls <- calls_20221122 |> 
      dplyr::filter(package_from != package_to)
    internal_calls <- calls_20221122 |> 
      dplyr::filter(package_from == package_to) |> 
      dplyr::select(package = package_from, n_total, n_unique)
    
    readr::write_csv(
      cran_20221122,
      fs::path(working_dir, "cran_20221122.csv")
    )
    readr::write_csv(
      external_calls,
      fs::path(working_dir, "external_calls.csv")
    )
    readr::write_csv(
      internal_calls,
      fs::path(working_dir, "internal_calls.csv")
    )