TidyTuesday
    • About TidyTuesday
    • Datasets
      • 2025
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
    • Useful links

    On this page

    • Australian Pets
      • Get the data here
      • Data Dictionary
    • animal_outcomes.csv
    • animal_complaints.csv
    • brisbane_complaints.csv
      • Cleaning Script

    Pets

    Australian Pets

    The data this week comes from the RSPCA, Townsville Animal Complaints and Brisbane Open Data - Animal Complaints.

    h/t to Georgios Karamanis for suggesting part of this data.

    The RSPCA is Australia’s oldest, largest and most trusted animal welfare organisation. With this privileged position comes great responsibility. This year we received1 124,146 animals into our animal shelters and adoption centres across the country.

    With a great deal of effort from RSPCAs all over the country, adoption and reclaiming rates nationally have been increasing over time and significant improvements in the outcomes for cats and dogs (including kittens and puppies) have been achieved. This can be attributed to the introduction of new approaches and programs to increase the number of animals adopted and reunited with their owners.

    For general data on the states/regions of Australian with population as of Dec 2019 - Wikipedia.

    There’s a remarkable amount of possible data cleaning/aggregation here. The animal_outcomes dataset is pretty much ready to go, although you could pivot it longer to stack the regions. If you want to go really far down the rabbit hole, check out the cleaning script and see if you can improve upon it or approach things a different way. Note the data came from PDFs and I had to do a lot of manual spot-checks. Some fun dplyr::update_rows() which saved me a LOT of time.

    The brisbane_complaints dataset has an interesting structure. No dates were reported within a dataset, so we had to add them from the file names within a purrr call. This cleaning step is pretty realistic! I left the data_range messy so you can further clean it up.

    Note that there is state-level data from the RSPCA, but only two cities for the animal complaints. Up to you if you want to extrapolate between the data, but I’m not confident it will be meaningful.

    The RSPCA Report (lots of graphs to recreate) - RSPCA Report.

    Journal article - A Retrospective Analysis of Complaints to RSPCA Queensland, Australia, about Dog Welfare.

    Get the data here

    # Get the Data
    
    # Read in with tidytuesdayR package 
    # Install from CRAN via: install.packages("tidytuesdayR")
    # This loads the readme and all the datasets for the week of interest
    
    # Either ISO-8601 date or year/week works!
    
    tuesdata <- tidytuesdayR::tt_load('2020-07-21')
    tuesdata <- tidytuesdayR::tt_load(2020, week = 30)
    
    animal_outcomes <- tuesdata$animal_outcomes
    
    # Or read in the data manually
    
    animal_outcomes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/animal_outcomes.csv')
    animal_complaints <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/animal_complaints.csv')
    brisbane_complaints <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/brisbane_complaints.csv')

    Data Dictionary

    animal_outcomes.csv

    variable class description
    year double Full year
    animal_type character Animal type (horse, wildlife, dog, cat, etc)
    outcome character Animal outcome - euthanized, released, rehomed, etc
    ACT double ACT - Australian Capital Territory
    NSW double New South Wales
    NT double Northern Territory
    QLD double Queensland
    SA double South Australia
    TAS double Tasmania
    VIC double Victoria
    WA double Western Australian
    Total double Australian Total

    animal_complaints.csv

    variable class description
    Animal Type character Animal Type
    Complaint Type character Complaint type
    Date Received character Date received (Month - year)
    Suburb character Suburb/region
    Electoral Division character Electoral Division

    brisbane_complaints.csv

    variable class description
    nature character Nature of complaints (animal)
    animal_type character Animal type
    category character Category of complaint
    suburb character Suburb where reported
    date_range character Date range (typically 1 quarter + year)
    responsible_office character Responsible office for the complaint
    city character City (Brisbane)

    Cleaning Script

    
    library(tidyverse)
    library(pdftools)
    library(rvest)
    library(glue)
    
    # Brisbane data for complaints:
    
    # Source: https://www.data.brisbane.qld.gov.au/data/dataset/96bec69c-6170-4ef0-93f1-eda279149b97
    
    all_complaints <- list.files("2020/2020-07-21/raw_csv") %>% 
      map_dfr(.f = function(x){
        read_csv(file = paste0("2020/2020-07-21/raw_csv/", x)) %>% 
          mutate(date_range = x)
      })
    
    brisbane_complaints <- all_complaints %>% 
      mutate(date_range = str_remove(date_range, "animal-compliance-|cars-bis-open-data-animal-related-complaints-")) %>% 
      set_names(nm= c("nature", "animal_type", "category", "suburb", "date_range", "responsible_office")) %>% 
      mutate(city = "Brisbane")
    
    brisbane_complaints %>% write_csv("2020/2020-07-21/brisbane_complaints.csv")
    
    library(tidyverse)
    library(pdftools)
    library(rvest)
    library(glue)
    
    # Get all urls for PDFs ---------------------------------------------------
    
    raw_pca <- "https://www.rspca.org.au/what-we-do/our-role-caring-animals/annual-statistics" %>%
      read_html()
    
    all_urls <- raw_pca %>%
      html_node("body > div:nth-child(5) > div.node.node-full-page > div.paragraphs-items.paragraphs-items-field-shared-sections.paragraphs-items-field-shared-sections-full.paragraphs-items-full > div > div > div > div > div > div:nth-child(6)") %>%
      html_nodes("p > a") %>%
      html_attr("href") %>%
      enframe() %>%
      arrange(desc(name)) %>%
      mutate(name = c(1999:2015, 2017)) %>%
      add_row(name = 2018, value = "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202018-2019.pdf") %>%
      set_names(nm = c("year", "url"))
    
    # Download all the files
    save_rspca <- function(year, url){
      download.file(url, destfile = glue::glue("rspca_{year}.pdf"))
    }
    
    # Iterate across the files and save
    all_urls %>% 
      pwalk(save_rspca)
    
    ### Function to get data
    get_animals <- function(year, url) {
      url_rspca <- url
      
      # Read PDF as character strings
      raw_pdf <- pdftools::pdf_text(url_rspca)
      
      # Get table number by page
      page_num_df <- raw_pdf %>%
        enframe() %>%
        mutate(value = str_trim(value) %>%
                 word(1, 2))
      
      # Add logic for getting the right table number
      page_num_pull <- page_num_df %>%
        filter(str_detect(value, if_else(year < 2007, "Table 3", "Table 4"))) %>%
        pull(name)
      
      # More logic because someone in 2011 decided to further wrap tables :facepalm:
      page_num <- if_else(year == 2011, list(5), list(page_num_pull))[[1]]
      
      # Tables can span up to 2 additional pages
      # Combine these pages and split into rows by new lines
      raw_text <- raw_pdf[c(page_num:(page_num + 2))] %>%
        str_split("\n") %>%
        flatten_chr()
      
      # Find and limit to start/end of table
      table_start <- stringr::str_which(raw_text, "Dogs")
      table_end <- stringr::str_which(raw_text, "Total animals|Total received|Total Animals")
      
      table_end <- table_end[[length(table_end)]]
      
      # Trim table of extra whitespace and limit to right rows
      table_trimmed <- raw_text[c(table_start:table_end)] %>% str_trim()
      
      # squish table together and replace excess whitespace with a '|'
      squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
        str_remove_all(",")
      
      raw_df <- enframe(squished_table)
      
      # separate the columns
      animal_df <- raw_df %>%
        separate(
          value,
          into = c(
            "outcome",
            "ACT",
            "NSW",
            "NT",
            "QLD",
            "SA",
            "TAS",
            "VIC",
            "WA",
            "Total"
          ),
          sep = "\\|"
        ) %>%
        # convert columns to double
        mutate(across(c(ACT:Total), as.double)) %>%
        mutate(
          year = year,
          # Clean up the outcome column
          outcome = str_to_title(outcome) %>% str_remove("[:digit:]+")
        ) %>%
        mutate(
          # create animal_type column
          animal_type = if_else(
            str_detect(outcome, paste(c("Dogs", "Cats", "Horses", "Livestock", "Other animals", "Other Animals", "Wildlife"), collapse = "|")),
            outcome,
            NA_character_
          ),
          animal_type = str_to_title(animal_type)
        ) %>%
        # fill down the animal type column
        fill(animal_type)
      
      
      animal_df
    }
    
    # Get all the data --------------------------------------------------------
    
    # Note you could do some of this in bulk with purrr, but since I had to
    # do so much extra cleaning outside of the core function I decided to 
    # just do it manually and merge at the end
    
    data_1999 <- get_animals(1999, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics1999-2000.pdf")
    data_2000 <- get_animals(2000, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202000%20-%202001.pdf")
    data_2001 <- get_animals(2001, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202001%20-%202002.pdf")
    data_2002 <- get_animals(2002, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202002-2003.pdf")
    data_2003 <- get_animals(2003, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202003%20-%202004.pdf") %>%
      filter(!is.na(Total))
    data_2004 <- get_animals(2004, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202004%20-%202005.pdf") %>%
      filter(!is.na(Total))
    data_2005 <- get_animals(2005, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202005%20-%202006.pdf") %>%
      rows_update(
        tibble(name = 5, NT = NA, QLD = 456, SA = 97, TAS = 79, VIC = 472, WA = 133, Total = 1951)
      ) %>%
      rows_update(
        tibble(name = 10, SA = NA, TAS = 8, VIC = 100, WA = 0, Total = 1730)
      ) %>%
      rows_update(
        tibble(name = 7, NT = NA, QLD = 47, SA = 451, TAS = 127, VIC = 0, WA = 0, Total = 1015)
      ) %>%
      rows_update(
        tibble(
          name = c(18, 19), NT = c(NA, NA), QLD = c(474, 21), SA = c(96, 0),
          TAS = c(52, 300), VIC = c(218, 829), WA = c(52, 0), Total = c(1336, 1168)
        )
      ) %>%
      rows_update(
        tibble(
          name = c(60:66), NT = c(rep(NA, 7)),
          QLD = c(54, 694, 80, 0, 645, 97, 1570),
          SA = c(10, 108, 2, 0, 222, NA, 342),
          TAS = c(5, 95, 1, 0, 11, 2, 114),
          VIC = c(47, 329, 76, 0, 373, 20, 845),
          WA = c(0, 6, 0, 0, 1, 0, 7),
          Total = c(162, 1594, 224, 260, 2507, 481, 5228)
        )
      ) %>%
      filter(!is.na(Total))
    
    data_2006 <- get_animals(2006, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202006%20-%202007.pdf") %>%
      rows_update(
        tibble(name = 7, TAS = NA, VIC = 153, WA = 5, Total = 2087)
      ) %>%
      rows_update(
        tibble(name = 32, NT = NA, QLD = 30, SA = 5, TAS = 7, VIC = 42, WA = 0, Total = 95)
      )
    
    # Note decided to use data_2006_upd instead of overwriting
    # One of the columns got dropped in parsing step, so I used it to
    # update rows in place
    
    data_2006_upd <- data_2006 %>%
      rows_update(
        data_2006 %>%
          filter(name %in% c(44:50, 53:60)) %>%
          select(-ACT) %>%
          set_names(nm = c(
            "name", "outcome", "ACT", "NSW", "NT", "QLD",
            "SA", "TAS", "VIC", "WA", "year", "animal_type"
          )) %>%
          mutate(Total = c(
            1893, 142, 1502, 6035, 1318, 10890, 11051,
            167, 1631, 169, 186, 2291, 227, 4671, 5228
          ), .after = WA) %>%
          mutate(
            outcome = c(
              "Released", "In Stock", "Transferred", "Euthanased", "Other", "Total", "Last Year's Total",
              "Reclaimed", "Rehomed", "In Stock", "Transferred", "Euthanased", "Other", "Total", "Last Year's Total"
            ),
            animal_type = c(rep("Wildlife", 7), rep("Other Animals", 8))
          )
      ) %>%
      filter(!is.na(Total))
    
    data_2007 <- get_animals(2007, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20national%20Statistics%202007-2008.pdf") %>%
      filter(!is.na(Total))
    
    data_2008 <- get_animals(2008, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202008%20-%202009.pdf") %>%
      rows_update(
        tibble(
          name = c(2, 3, 8, 16),
          WA = c(621, 1045, 248, 958),
          Total = c(22896, 19236, 22085, 19666)
        )
      ) %>%
      rows_update(
        tibble(
          name = 19,
          NT = 212, QLD = 11045, SA = 2712, TAS = 1974, VIC = 9801, WA = 99, Total = 39495
        )
      ) %>%
      filter(!is.na(Total))
    
    data_2009 <- get_animals(2009, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202009%20-%202010.pdf") %>%
      filter(!is.na(Total))
    
    data_2010 <- get_animals(2010, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202010%20-%202011.pdf") %>%
      filter(!is.na(Total))
    
    data_2011 <- get_animals(2011, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202011-2012.pdf") %>%
      filter(!is.na(Total))
    
    data_2012 <- get_animals(2012, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202012%20-%202013.pdf") %>%
      filter(!is.na(Total))
    
    data_2013 <- get_animals(2013, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA_Australia-Annual_Statistics_2013-2014.pdf") %>%
      filter(!is.na(Total))
    
    data_2014 <- get_animals(2014, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA_Australia-Annual_Statistics_2014-2015.pdf") %>%
      filter(!is.na(Total))
    
    data_2015 <- get_animals(2015, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%202015-2016%20.pdf") %>%
      rows_update(
        tibble(
          name = 62, ACT = 581, NSW = 1361, NT = 34, QLD = 720, SA = 449, TAS = 191, VIC = 716, WA = 38, Total = 4090
        )
      ) %>%
      filter(!is.na(Total))
    
    data_2016 <- get_animals(2016, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202016-2017.pdf") %>% 
      filter(!is.na(Total))
    
    data_2017 <- get_animals(2017, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%202017-2018.pdf") %>%
      filter(!is.na(Total))
    
    data_2018 <- get_animals(2018, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202018-2019.pdf") %>%
      filter(!is.na(Total))
    
    # Combine all the cleaned data by year!
    # note that 2016 was not reported. :sad:
    # I would recommend imputing if you want
    
    comb_data <- bind_rows(
      list(
        data_1999, data_2000, data_2001, data_2002, data_2003, data_2004, data_2005,
        data_2006_upd, data_2007, data_2008, data_2009, data_2010, data_2011, 
        data_2012, data_2013, data_2014, data_2015, data_2016, data_2017, data_2018
      )
    )
    
    # clean the final data
    clean_data <- comb_data %>%
      filter(outcome %in% c(
        "Euthanased", "Other", "Rehomed", "Transferred",
        "Reclaimed", "Currently In Care", "In Stock",
        "Released", "In Our Care"
      )) %>%
      mutate(
        outcome = case_when(
          outcome == "In Our Care" ~ "Currently In Care",
          outcome == "Euthanased" ~ "Euthanized",
          TRUE ~ outcome
        )
      ) %>%
      select(year, animal_type, outcome, ACT:Total)
    
    # Sanity checking
    clean_data %>% ggplot(aes(x = year, y = Total, color = outcome)) +
      geom_line() +
      facet_wrap(~animal_type, scales = "free_y")
    
    # Sanity checking
    clean_data %>%
      filter(outcome %in% c("Euthanized", "Rehomed"), animal_type != "Wildlife") %>%
      ggplot(aes(x = year, y = Total, fill = outcome)) +
      geom_col(position = "fill") +
      facet_wrap(~animal_type)
    
    # Sanity checking
    clean_data %>%
      count(animal_type, sort = T)
    
    # Sanity checking Wildlife being an odd number
    # It's because first 3 years only had 3 outcomes
    
    clean_data %>%
      filter(animal_type == "Wildlife") %>%
      count(year)
    
    clean_data %>% 
      write_csv("2020/2020-07-21/animal_outcomes.csv")