TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • Leap Day
      • The Data
      • How to Participate
        • Data Dictionary
    • events.csv
    • births.csv
    • deaths.csv
      • Cleaning Script

    Leap Day

    Happy Leap Day! This week’s data comes from the February 29 article on Wikipedia.

    February 29 is a leap day (or “leap year day”), an intercalary date added periodically to create leap years in the Julian and Gregorian calendars.

    One event that’s missing from Wikipedia’s list: R version 1.0 was released on February 29, 2000.

    Which cohort of leap day births is most represented in Wikipedia’s data? Are any years surprisingly underrepresented compared to nearby years? What other patterns can you find in the data?

    The Data

    # Option 1: tidytuesdayR package 
    ## install.packages("tidytuesdayR")
    
    tuesdata <- tidytuesdayR::tt_load('2024-02-27')
    ## OR
    tuesdata <- tidytuesdayR::tt_load(2024, week = 9)
    
    events <- tuesdata$events
    births <- tuesdata$births
    deaths <- tuesdata$deaths
    
    # Option 2: Read directly from GitHub
    
    events <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-02-27/events.csv')
    births <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-02-27/births.csv')
    deaths <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-02-27/deaths.csv')

    How to Participate

    • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
    • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
    • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.

    Data Dictionary

    events.csv

    variable class description
    year integer Year of the event.
    event character A short, free-text description of the event.

    births.csv

    variable class description
    year_birth integer Year in which this person was born.
    person character The name of the person.
    description character A short description of the person.
    year_death integer Year in which this person died.

    deaths.csv

    variable class description
    year_death integer Year in which this person died.
    person character The name of the person.
    description character A short description of the person.
    year_birth integer Year in which this person was born.

    Cleaning Script

    library(tidyverse)
    library(rlang)
    library(rvest)
    library(here)
    
    working_dir <- here::here("data", "2024", "2024-02-27")
    
    feb29 <- "https://en.wikipedia.org/wiki/February_29"
    
    # Read the HTML once so we don't have to keep hitting it.
    feb29_html <- rvest::read_html(feb29)
    
    # Find the headers. We'll use these to figure out which bullets are "inside"
    # each header, since nothing "contains" them to make it easy.
    h2s <- feb29_html |> 
      rvest::html_elements("h2") |> 
      rvest::html_text2() |> 
      stringr::str_remove("\\[edit\\]")
    
    # We'll get all bullets that are after each header. We can then subtract out
    # later lists to figure out what's under a particular header.
    bullets_after_headers <- purrr::map(
      h2s,
      \(this_header) {
        this_selector <- glue::glue("h2:contains('{this_header}') ~ ul > li")
        feb29_html |> 
          rvest::html_elements(this_selector) |> 
          rvest::html_text2() |> 
          # Remove footnotes.
          stringr::str_remove_all("\\[\\d+\\]")
      }
    ) |> 
      rlang::set_names(h2s)
    
    # Subtract subsequent bullets from each set.
    bullets_in_headers <- purrr::map2(
      bullets_after_headers[-length(h2s)],
      bullets_after_headers[-1],
      setdiff
    )
    
    # The three sets we care about (Events, Births, Deaths) each have their own
    # format.
    events <- tibble::tibble(events = bullets_in_headers[["Events"]]) |> 
      tidyr::separate_wider_regex(
        "events",
        patterns = c(
          year = "^\\d+",
          " – ",
          event = ".*"
        )
      )
    births <- tibble::tibble(births = bullets_in_headers[["Births"]]) |> 
      tidyr::separate_wider_regex(
        "births",
        patterns = c(
          year_birth = "^\\d+",
          " – ",
          person = ".*"
        )
      ) |> 
      tidyr::separate_wider_regex(
        "person",
        patterns = c(
          person = "[^(]*",
          "\\(d\\. ",
          "(?:February 29, )*",
          year_death = "\\d+",
          "\\)\\.?"
        ),
        too_few = "align_start"
      ) |> 
      tidyr::separate_wider_regex(
        "person",
        patterns = c(
          person = "[^,]*",
          ", ",
          description = ".*"
        ),
        too_few = "align_start"
      )
    
    deaths <- tibble::tibble(deaths = bullets_in_headers[["Deaths"]]) |> 
      tidyr::separate_wider_regex(
        "deaths",
        patterns = c(
          year_death = "^\\d+",
          " – ",
          person = ".*"
        )
      ) |> 
      tidyr::separate_wider_regex(
        "person",
        patterns = c(
          person = "[^(]*",
          "\\(b\\. ",
          "(?:February 29, )*",
          year_birth = "\\d+",
          "\\)\\.?"
        ),
        too_few = "align_start"
      ) |> 
      tidyr::separate_wider_regex(
        "person",
        patterns = c(
          person = "[^,]*",
          ", ",
          description = ".*"
        ),
        too_few = "align_start"
      )
    
    readr::write_csv(
      events,
      fs::path(working_dir, "events.csv")
    )
    readr::write_csv(
      births,
      fs::path(working_dir, "births.csv")
    )
    readr::write_csv(
      deaths,
      fs::path(working_dir, "deaths.csv")
    )