TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • Stack Overflow Annual Developer Survey 2024
      • The Data
      • How to Participate
        • Data Dictionary
    • qname_levels_single_response_crosswalk.csv
    • stackoverflow_survey_questions.csv
    • stackoverflow_survey_single_response.csv
      • Cleaning Script

    Stack Overflow Annual Developer Survey 2024

    This week’s dataset is derived from the 2024 Stack Overflow Annual Developer Survey. Conducted in May 2024, the survey gathered responses from over 65,000 developers across seven key sections:

    1. Basic information
    2. Education, work, and career
    3. Tech and tech culture
    4. Stack Overflow community
    5. Artificial Intelligence (AI)
    6. Professional Developer Series - Not part of the main survey
    7. Thoughts on Survey

    The dataset provided for this analysis focuses exclusively on the single-response questions from the main survey sections. Each categorical response in the survey has been integer-coded, with corresponding labels available in the crosswalk file.

    What can you see about developer demographics? How do developers engage with Stack Overflow? What do they think about AI?

    Thank you to Havisha Khurana for curating this week’s dataset!

    The Data

    # Option 1: tidytuesdayR package 
    ## install.packages("tidytuesdayR")
    
    tuesdata <- tidytuesdayR::tt_load('2024-09-03')
    ## OR
    tuesdata <- tidytuesdayR::tt_load(2024, week = 36)
    
    qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalk
    stackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questions
    stackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response
    
    # Option 2: Read directly from GitHub
    
    qname_levels_single_response_crosswalk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/qname_levels_single_response_crosswalk.csv')
    stackoverflow_survey_questions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/stackoverflow_survey_questions.csv')
    stackoverflow_survey_single_response <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/stackoverflow_survey_single_response.csv')

    How to Participate

    • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
    • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
    • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
    • Submit your own dataset!

    Data Dictionary

    qname_levels_single_response_crosswalk.csv

    variable class description
    qname character Categorical Question/Column Name in main data
    level integer Integer index associated with each column response
    label character Label associated with integer index

    stackoverflow_survey_questions.csv

    variable class description
    qname character Categorical Question/Column Name in main data
    question character Text of the question as it was presented to the respondent

    stackoverflow_survey_single_response.csv

    variable class description
    response_id double Respondent ID
    main_branch integer Professional coding level of the respondent
    age integer Age
    remote_work integer Current work situation
    ed_level integer Highest education level completed
    years_code integer Years the respondent has coded in total; More than 50 years coded as 51
    years_code_pro integer Years the respondent has coded professionally; More than 50 years coded as 51
    dev_type integer Best current-job description
    org_size integer People in the organization
    purchase_influence integer Level of influence in purchasing new technology at their organization
    buildvs_buy integer How much customization was needed in most recent tool recommendation
    country character Country in which the respondent lives
    currency character Currency of the country
    comp_total double Total compensation
    so_visit_freq integer Stack Overflow visiting frequency
    so_account integer Stack Overflow account status
    so_part_freq integer Stack Overflow participation frequency
    so_comm integer Whether the respondent considers themself a member of the Stack Overflow community?
    ai_select integer Use AI in development process
    ai_sent integer Stance on using AI tools as part of development workflow
    ai_acc integer Trust in accuracy of AI as part of development workflow
    ai_complex integer How well the respondent believes the AI tools they use in development workflows handle complex tasks
    ai_threat integer Belief that AI is a threat to current job
    survey_length integer Feeling about the length of the Stack Overflow Developer Survey this year
    survey_ease integer Ease of completion of this survey
    converted_comp_yearly double Converted compensation
    r_used integer Flag if respondent used R in the previous year
    r_want_to_use integer Flag if respondent want to use R in the next year

    Cleaning Script

    ## Load libraries
    
    library(janitor)
    library(dplyr)
    library(purrr)
    library(tidyr)
    
    ## Data available in a zip file on https://survey.stackoverflow.co/
    ## Downloading the zip file and storing in a temp location
    ## Then extracting the results
    
    temp <- tempfile()
    download.file("https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip",temp, mode = "wb")
    unzip_file <- unzip(temp, "survey_results_public.csv", exdir = tempdir())
    stackoverflow_survey <- readr::read_csv(unzip_file)
    unzip_file <- unzip(temp, "survey_results_schema.csv", exdir = tempdir())
    stackoverflow_survey_questions <- readr::read_csv(unzip_file)
    unlink(temp)
    unlink(file.path(tempdir(), "survey_results_public.csv"))
    unlink(file.path(tempdir(), "survey_results_schema.csv"))
    rm(temp, unzip_file)
    
    ## Survey Description
    
    # The Stackoverflow annual survey has 7 sections:
    
    # 1. Basic information
    # 2. Education, work, and career
    # 3. Tech and tech culture
    # 4. Stack community
    # 5. AI
    # 6. Professional Developer Series - not part of the main survey and was shown to only professional developers
    # 7. Thoughts on Survey
    
    # Within each section, there are single-response and multiple-response questions. The original file is >150 MB
    # To subset the the data file so it is < 20MB, I am taking the following steps:  
    # Only keeping single-response questions.
    # Saving response labels in a crosswalk file, and encoding each column value as an integer.
    # Adding a binary variable if the responded is an R user
    
    ## Step 1: Remove professional developer series questions and Check question used to gauge attention
    
    stackoverflow_survey <- stackoverflow_survey %>%
        select(-c(Check, TBranch:JobSatPoints_11, JobSat))
    
    ## Step 2: Separate single-response questions
    
    # Select cols with multi-response questions
    # Flagging if they have semi-colon in the response (;)
    
    multi_response <- colnames(stackoverflow_survey)[grepl(";",stackoverflow_survey[1:1000,])]
    #57 columns are multi-response
    
    # The remaining are single response
    single_response <- setdiff(colnames(stackoverflow_survey),
                               multi_response)
    
    ## Step 3: Create a data.frame for single-response questions
    
    ## Create a flag if someone used R in the past year or wnat to use it in the next year
    
    r_used <- stackoverflow_survey %>%
      select(ResponseId, LanguageHaveWorkedWith) %>%
      separate_rows(LanguageHaveWorkedWith, sep = ";") %>%
      filter(LanguageHaveWorkedWith == "R") %>%
      pull(ResponseId)
    
    r_want_to_use <- stackoverflow_survey %>%
      select(ResponseId, LanguageWantToWorkWith) %>%
      separate_rows(LanguageWantToWorkWith, sep = ";") %>%
      filter(LanguageWantToWorkWith == "R") %>%
      pull(ResponseId)
    
    stackoverflow_survey_single_response <- stackoverflow_survey %>%
        
        # add flags
        mutate(
          RUsed = case_when(
            is.na(LanguageHaveWorkedWith) ~ NA_integer_,
            ResponseId %in% r_used ~ 1L,
            TRUE ~ 0L
          ),
          RWantToUse = case_when(
            is.na(LanguageWantToWorkWith) ~ NA_integer_,
            ResponseId %in% r_want_to_use ~ 1L,
            TRUE ~ 0L
          )
        ) %>%
        select(ResponseId, all_of(single_response), RUsed, RWantToUse) %>%
        clean_names() %>%
        # remove if all columns are NA
        filter(!if_all(-response_id, is.na)) %>%
        
        # Changing years coding and years coding professionally to integer
        mutate(
            across(c(years_code, years_code_pro), ~case_when(
                .x == "Less than 1 year" ~ "0",
                .x == "More than 50 years" ~ "51",
                TRUE ~ .x
            )),
            across(c(years_code, years_code_pro), as.integer),
            
            # Changings character variables to factor
            across(where(is.character), 
                   ~case_when(
                       .x == "Other (please specify):" ~ "Other",
                       TRUE ~ .x
                   )),
            across(where(is.character),as.factor),
            # Changing country and currency back to character
            across(c(country, currency), as.character)
            )
    
    ## Step 4: Create a data-dictionary file which saves each factor level for character variables, so they can be saved as integer in the main dataset.
    
    # Select only factor columns
    factor_var <- stackoverflow_survey_single_response %>%
        select(where(is.factor)) 
    
    # Record how many levels associated which each column
    column_n_levels <- map_int(colnames(factor_var), 
                               ~length(levels(factor_var[,.x] %>% pull)))
    
    # Create a tibble with variable name and level
    qname_levels_single_response_crosswalk <- tibble(
        qname = rep(colnames(factor_var), column_n_levels) 
    ) %>%
        group_by(qname) %>%
        mutate(level = row_number()) %>%
        ungroup() %>%
        mutate(
            label = unlist(map(colnames(factor_var), ~levels(factor_var[,.x] %>% pull))) 
        )
    
    # Change factor variables to integer in main survey response
    
    stackoverflow_survey_single_response <- stackoverflow_survey_single_response %>%
        mutate(across(where(is.factor), as.integer))
    
    ### Ready to export - stackoverflow_survey_single_response and qname_levels_single_response_crosswalk files
    
    ## To use the dataset, re-label data, using the code:
    
    # survey_relabel <- stackoverflow_survey_single_response
    # relabel_columns <- unique(qname_levels_single_response_crosswalk$qname)
    # 
    # for (col_name in relabel_columns) {
    #   survey_relabel[[col_name]] <- stackoverflow_survey_single_response %>%
    #     select(all_of(col_name)) %>%
    #     rename("level" = col_name) %>%
    #     left_join(qname_levels_single_response_crosswalk %>% filter(qname == col_name), 
    #               by = c("level")) %>%
    #     pull(labels)
    # }
    
    stackoverflow_survey_questions <- stackoverflow_survey_questions %>%
        filter(qname %in% single_response) %>%
        select(qname, question) %>%
        mutate(qname = make_clean_names(qname))