TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • Women of 2020
      • Get the data here
      • Data Dictionary
    • women.csv
      • Cleaning Script

    Women of 2020

    The data this week comes from the BBC by way of Joshua Feldman.

    The BBC has revealed its list of 100 inspiring and influential women from around the world for 2020.

    This year 100 Women is highlighting those who are leading change and making a difference during these turbulent times.

    The list includes Sanna Marin, who leads Finland’s all-female coalition government, Michelle Yeoh, star of the new Avatar and Marvel films and Sarah Gilbert, who heads the Oxford University research into a coronavirus vaccine, as well as Jane Fonda, a climate activist and actress.

    And in an extraordinary year - when countless women around the world have made sacrifices to help others - one name on the 100 Women list has been left blank as a tribute.

    Get the data here

    # Get the Data
    
    # Read in with tidytuesdayR package 
    # Install from CRAN via: install.packages("tidytuesdayR")
    # This loads the readme and all the datasets for the week of interest
    
    # Either ISO-8601 date or year/week works!
    
    tuesdata <- tidytuesdayR::tt_load('2020-12-08')
    tuesdata <- tidytuesdayR::tt_load(2020, week = 50)
    
    women <- tuesdata$women
    
    # Or read in the data manually
    
    women <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-12-08/women.csv')

    Data Dictionary

    women.csv

    variable class description
    name character Name of woman
    img character Link to headshot
    category character Category for award
    country character Country of residence
    role character Role/Career
    description character Description of the woman and their achievements

    Cleaning Script

    # Load packages
    
    library(rvest)
    library(tidyverse)
    
    # Load web page
    
    bbc_women <- html("https://www.bbc.co.uk/news/world-55042935")
    
    # Save even and odd indices for data extraction later
    
    odd_index <- seq(1,200,2)
    even_index <- seq(2,200,2)
    
    # Extract name
    
    name <- bbc_women %>% 
      html_nodes("article h4") %>% 
      html_text()
    
    # Extract image
    
    img <- bbc_women %>% 
      html_nodes(".card__header") %>% 
      html_nodes("img") %>% 
      html_attr("src")
    
    img <- img[odd_index]
    
    # Extract category
    
    category <- bbc_women %>% 
      html_nodes("article .card") %>% 
      str_extract("card category--[A-Z][a-z]+") %>% 
      str_remove_all("card category--")
    
    # Extract country & role
    
    country_role <- bbc_women %>% 
      html_nodes(".card__header__strapline__location") %>% 
      html_text()
    
    country <- country_role[odd_index]
    role <- country_role[even_index]
    
    # Extract description
    
    description <- bbc_women %>% 
      html_nodes(".first_paragraph") %>% 
      html_text()
    
    # Finalise data frame
    
    df <- data.frame(
      name,
      img,
      category,
      country,
      role,
      description
    )
    
    # Export
    
    write.csv(df, "data.csv", row.names=FALSE)