TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • The Languages of Africa
      • The Data
      • How to Participate
        • PydyTuesday: A Posit collaboration with TidyTuesday
      • Data Dictionary
        • africa.csv
      • Cleaning Script

    The Languages of Africa

    This week we’re exploring data about popular languages spoken on the African continent. The dataset this week comes from the Languages of Africa page on Wikipedia.

    The number of languages natively spoken in Africa is variously estimated (depending on the delineation of language vs. dialect) at between 1,250 and 2,100 and by some counts at over 3,000.

    The dataset is rich with information on the number of languages spoken across the continent. Some of the questions that could be thought of include:

    • Which country in Africa has the largest number of spoken languages?
    • Which family of languages has the highest density of speakers?
    • Are there any languages that cut across multiple countries?

    Can’t wait to see the kind of visualisations that can be created!

    Thank you to Robert Muwanga, Data Enthusiast for curating this week’s dataset.

    The Data

    # Using R
    # Option 1: tidytuesdayR R package 
    ## install.packages("tidytuesdayR")
    
    tuesdata <- tidytuesdayR::tt_load('2026-01-13')
    ## OR
    tuesdata <- tidytuesdayR::tt_load(2026, week = 2)
    
    africa <- tuesdata$africa
    
    # Option 2: Read directly from GitHub
    
    africa <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-01-13/africa.csv')
    # Using Python
    # Option 1: pydytuesday python library
    ## pip install pydytuesday
    
    import pydytuesday
    
    # Download files from the week, which you can then read in locally
    pydytuesday.get_date('2026-01-13')
    
    # Option 2: Read directly from GitHub and assign to an object
    
    africa = pandas.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-01-13/africa.csv')
    # Using Julia
    # Option 1: TidierTuesday.jl library
    ## Pkg.add(url="https://github.com/TidierOrg/TidierTuesday.jl")
    
    using TidierTuesday
    
    # Download files from the week, which you can then read in locally
    download_dataset('2026-01-13')
    
    # Option 2: Read directly from GitHub and assign to an object with TidierFiles
    
    africa = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-01-13/africa.csv")
    
    # Option 3: Read directly from Github and assign without Tidier dependencies
    africa = CSV.read("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-01-13/africa.csv", DataFrame)

    How to Participate

    • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
    • Create a visualization, a model, a Quarto report, a shiny app, or some other piece of data-science-related output, using R, Python, or another programming language.
    • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
    • Submit your own dataset!

    PydyTuesday: A Posit collaboration with TidyTuesday

    • Exploring the TidyTuesday data in Python? Posit has some extra resources for you! Have you tried making a Quarto dashboard? Find videos and other resources in Posit’s PydyTuesday repo.
    • Share your work with the world using the hashtags #TidyTuesday and #PydyTuesday so that Posit has the chance to highlight your work, too!
    • Deploy or share your work however you want! If you’d like a super easy way to publish your work, give Connect Cloud a try.

    Data Dictionary

    africa.csv

    variable class description
    language character Name of popular African language.
    family character Group of languages with similar ancestry, having similar vocabulary, phonetics and grammar.
    native_speakers integer Number of known native speakers of the language.
    country character Country where this language is spoken.

    Cleaning Script

    #########################################################################
    # Author: Muwanga Robert                                                #
    # Date: 24 December 2025                                                #
    # License : Creative Commons (CC-BY)                                    #
    # Purpose: Scrapes data from Wikipedia on popular languages in Africa.  #
    #########################################################################
    
    require(tidyr)
    require(stringr)
    require(dplyr)
    require(purrr)
    require(rvest)
    
    # Let's get a list of African countries that we shall use in our data cleaning
    # process 
    
    african_countries <- 
      rvest::read_html('https://www.worldometers.info/geography/how-many-countries-in-africa/') |> 
      rvest::html_table() |> 
      purrr::pluck(1) |> 
      pull('Country') %>% 
      c(., 'Ivory Coast', 'Cape Verde')
    
    # Let's extract the table of interest from Wikipedia
    dataset <- 
      rvest::read_html("https://en.wikipedia.org/wiki/Languages_of_Africa") |> 
      rvest::html_table() |> 
      purrr::pluck(5) |> 
      dplyr::select(
        language = Language,
        family = Family,
        native_speakers = `Native speakers within Africa (L1)`, 
        country = `Official status per country`
      ) 
    
    # Clean up the family and native speaker columns, and extract the countries
    # from the country column
    dt <- dataset |> 
      mutate(
        family = str_split_i(family, " ", 1), 
        native_speakers = str_split_i(native_speakers, " ", 1),
        country = str_extract_all(
          country, str_c(african_countries, collapse = "|"), simplify = TRUE))
    
    # From the country column, we "expand" the cells that have vectors, creating 
    # a "wider" tibble
    countries <- 
      dataset |> 
      select(country) |> 
      map(.f = function(x) str_extract_all(
        x, str_c(
          african_countries, collapse = "|"), 
        simplify = TRUE)) |> 
      as.data.frame()
    
    # We binding columns between the two datasets, make the dataset longer, and 
    # remove rows that have blank entries in the country column
    africa <- bind_cols(dt, countries) |> 
      select(-country) |> 
      pivot_longer(
        cols = contains("country"),
        names_to = 'country_index', values_to = 'country') |> 
      mutate(
        country = na_if(country, "")) |> 
      drop_na()
    
    # We then clean up the native_speakers column and prepare our final dataset 
    # by removing the unwanted country_index column
    africa <- africa |> 
      mutate(native_speakers = str_split_i(native_speakers, pattern = '\\[|-', i = 1), 
             native_speakers = str_remove_all(native_speakers, pattern = ','), 
             native_speakers = as.integer(native_speakers)) |> 
      filter(!is.na(native_speakers)) |> 
      select(-country_index)