TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • Worldwide Bureaucracy Indicators
      • The Data
      • How to Participate
        • Data Dictionary
    • wwbi_data.csv
    • wwbi_series.csv
    • wwbi_country.csv
      • Cleaning Script

    Worldwide Bureaucracy Indicators

    This week we’re looking at the Worldwide Bureaucracy Indicators (WWBI) dataset from the World Bank.

    The Worldwide Bureaucracy Indicators (WWBI) database is a unique cross-national dataset on public sector employment and wages that aims to fill an information gap, thereby helping researchers, development practitioners, and policymakers gain a better understanding of the personnel dimensions of state capability, the footprint of the public sector within the overall labor market, and the fiscal implications of the public sector wage bill. The dataset is derived from administrative data and household surveys, thereby complementing existing, expert perception-based approaches.

    The World Bank introduced the dataset with a series of four blogs:

    • blog1
    • blog2
    • blog3
    • blog4

    Can you replicate the figures in the blogs? Can you display any of the data more clearly than in the blogs?

    The Data

    # Option 1: tidytuesdayR package 
    ## install.packages("tidytuesdayR")
    
    tuesdata <- tidytuesdayR::tt_load('2024-04-30')
    ## OR
    tuesdata <- tidytuesdayR::tt_load(2024, week = 18)
    
    wwbi_data <- tuesdata$wwbi_data
    wwbi_series <- tuesdata$wwbi_series
    wwbi_country <- tuesdata$wwbi_country
    
    
    # Option 2: Read directly from GitHub
    
    wwbi_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-04-30/wwbi_data.csv')
    wwbi_series <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-04-30/wwbi_series.csv')
    wwbi_country <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-04-30/wwbi_country.csv')

    How to Participate

    • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
    • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
    • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.

    Data Dictionary

    wwbi_data.csv

    variable class description
    country_code character 3-letter ISO_3166-1 code
    indicator_code character code identifying the indicator of bureaucracy
    year numeric year of the data
    value numeric numeric value of the data

    wwbi_series.csv

    variable class description
    indicator_code character code identifying the indicator of bureaucracy
    indicator_name character name of the indicator

    wwbi_country.csv

    variable class description
    country_code character 3-letter ISO_3166-1 code
    short_name character short or common name for the country
    table_name character more alphabetically sortable name of the country
    long_name character full name of the country
    x2_alpha_code character 2-letter ISO_3166-1 code
    currency_unit character currency unit
    special_notes character special notes
    region character region
    income_group character low, lower middle, upper middle, or high income
    wb_2_code character alternate 2-letter code
    national_accounts_base_year integer national accounts base year
    national_accounts_reference_year integer national accounts reference year
    sna_price_valuation character UN system of national accounts price valuation
    lending_category character International Development Association (IDA), Interanational Bank of Reconstruction and Development (IBRD), a blend or neither
    other_groups character Heavily Indebted Poor Countries initiative (HIPC), or countries classified as the “Euro area”
    system_of_national_accounts integer which System of National Accounts methodology the country uses (1968, 1993, or 2008 version)
    balance_of_payments_manual_in_use character the version of the Balance of Payments Manual used by the country
    external_debt_reporting_status character estimate, preliminary, or actual
    system_of_trade character Under the general system imports include goods imported for domestic consumption and imports into bonded warehouses and free trade zones. Under the special system imports comprise goods imported for domestic consumption (including transformation and repair) and withdrawals for domestic consumption from bonded warehouses and free trade zones. Goods transported through a country en route to another are excluded.
    government_accounting_concept character government accounting concept
    imf_data_dissemination_standard character International Monetary Fund data-dissemination standard: Special Data Dissemination Standard (SDDS, 1996, created for countries that have or seek to have access to international markets), SDDS Plus (2012, the highest tier of data standards, intended for systemically important economies), enhanced GDDS (e-GDDS, 2015, encouraging participants to emphasize data publication)
    latest_household_survey character which household survey was most recently administered
    source_of_most_recent_income_and_expenditure_data character which survey serves as the basis for income and expenditure data
    vital_registration_complete logical whether the vital registration is complete
    latest_agricultural_census integer year of latest agricultural census
    latest_industrial_data integer year of latest industrial data
    latest_trade_data integer year of latest trade data
    latest_population_census_year integer year of latest population census
    latest_population_census_notes character notes about latest population census

    Cleaning Script

    library(tidyverse)
    library(janitor)
    library(here)
    library(fs)
    library(withr)
    
    working_dir <- here::here("data", "2024", "2024-04-30")
    
    url <- "https://databank.worldbank.org/data/download/WWBI_CSV.zip"
    
    file_path <- withr::local_tempfile(fileext = ".zip")
    download.file(url, file_path)
    
    extract_dir <- withr::local_tempdir("csvs")
    unzip(file_path, exdir = extract_dir)
    
    wwbi_country <- readr::read_csv(
      fs::path(extract_dir, "WWBICountry.csv")
    ) |> 
      janitor::clean_names() |> 
      janitor::remove_empty("cols") |> 
      dplyr::mutate(
        # Several columns are years, make them integers
        national_accounts_reference_year = as.integer(national_accounts_reference_year),
        latest_industrial_data = as.integer(latest_industrial_data),
        latest_trade_data = as.integer(latest_trade_data),
        latest_population_census_year = as.integer(stringr::str_extract(
          latest_population_census,
          "^\\d{4}"
        )),
        latest_agricultural_census = as.integer(stringr::str_extract(
          latest_agricultural_census,
          "^\\d{4}"
        )),
        national_accounts_base_year = as.integer(stringr::str_extract(
          national_accounts_base_year,
          "^\\d{4}"
        )),
        system_of_national_accounts = as.integer(stringr::str_extract(
          system_of_national_accounts,
          "\\d{4}"
        )),
        latest_population_census_notes = stringr::str_remove(
          latest_population_census,
          "^\\d{4}\\.?\\s*"
        ),
        latest_population_census_notes = dplyr::na_if(
          latest_population_census_notes,
          ""
        ),
        # vital_registration_complete is either "yes" or "NA"
        vital_registration_complete = !is.na(vital_registration_complete) 
      ) |> 
      dplyr::select(-"latest_population_census")
    
    wwbi_series <- readr::read_csv(
      fs::path(extract_dir, "WWBISeries.csv"),
      col_types = paste(rep("c", 21), collapse = "")
    ) |> 
      janitor::clean_names() |> 
      janitor::remove_empty("cols") |> 
      dplyr::rename(indicator_code = "series_code")
    
    wwbi_data <- readr::read_csv(
      fs::path(extract_dir, "WWBIData.csv"),
      col_types = paste(c(rep("c", 4), rep("d", 21), "c"), collapse = "")
    ) |> 
      janitor::clean_names() |> 
      # indicator_name and country_name are redundant.
      dplyr::select(-"indicator_name", -"country_name") |> 
      janitor::remove_empty("cols") |> 
      tidyr::pivot_longer(
        cols = -c(country_code, indicator_code),
        names_to = "year",
        names_transform = ~ as.integer(stringr::str_remove(.x, "x")),
        values_to = "value"
      ) |> 
      dplyr::filter(!is.na(value))
    
    readr::write_csv(
      wwbi_data,
      fs::path(working_dir, "wwbi_data.csv")
    )
    readr::write_csv(
      wwbi_series,
      fs::path(working_dir, "wwbi_series.csv")
    )
    readr::write_csv(
      wwbi_country,
      fs::path(working_dir, "wwbi_country.csv")
    )