Films Based on Video Games
The dataset this week comes from the Wikipedia article List of films based on video games. It covers theatrical releases, direct-to-video productions, television films, short films, and documentaries adapted from video games, spanning from the early 1990s to upcoming releases. Each row is a film, with box office figures, critic scores, budgets, and release dates where available.
The list covers feature films, animated films, live-action films, television films, and short films that are based on or inspired by a video game franchise.
Some questions worth exploring:
- Which video game franchise has generated the most film adaptations, and which has earned the most at the box office?
- Which video game publishers have the most film adaptations, and how have they performed at the box office?
- Do audiences and critics agree? Compare CinemaScore grades against Rotten Tomatoes scores.
Thank you to Georgios Karamanis for curating this week’s dataset.
The Data
# Using R
# Option 1: tidytuesdayR R package
## install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2026-06-09')
## OR
tuesdata <- tidytuesdayR::tt_load(2026, week = 23)
game_films <- tuesdata$game_films
# Option 2: Read directly from GitHub
game_films <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-06-09/game_films.csv')# Using Python
# Option 1: pydytuesday python library
## pip install pydytuesday
import pydytuesday
# Download files from the week, which you can then read in locally
pydytuesday.get_date('2026-06-09')
# Option 2: Read directly from GitHub and assign to an object
game_films = pandas.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-06-09/game_films.csv')# Using Julia
# Option 1: TidierTuesday.jl library
## Pkg.add(url="https://github.com/TidierOrg/TidierTuesday.jl")
using TidierTuesday
# Download datasets for the week, and load them as a NamedTuple of DataFrames
data = tt_load("2026-06-09")
# Option 2: Read directly from GitHub and assign to an object with TidierFiles
game_films = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-06-09/game_films.csv")
# Option 3: Read directly from Github and assign without Tidier dependencies
game_films = CSV.read("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-06-09/game_films.csv", DataFrame)How to Participate
- Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
- Create a visualization, a model, a Quarto report, a shiny app, or some other piece of data-science-related output, using R, Python, or another programming language.
- Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
- Submit your own dataset!
PydyTuesday: A Posit collaboration with TidyTuesday
- Exploring the TidyTuesday data in Python? Posit has some extra resources for you! Have you tried making a Quarto dashboard? Find videos and other resources in Posit’s PydyTuesday repo.
- Share your work with the world using the hashtags #TidyTuesday and #PydyTuesday so that Posit has the chance to highlight your work, too!
- Deploy or share your work however you want! If you’d like a super easy way to publish your work, give Connect Cloud a try.
Data Dictionary
game_films.csv
| variable | class | description |
|---|---|---|
| category | character | Top-level Wikipedia section the film belongs to (e.g. “Theatrical releases”, “Direct-to-video”, “Television films”). |
| subcategory | character | Second-level Wikipedia section within the category (e.g. “English”, “Japanese”, “Animation”). NA for films with no subsection. |
| title | character | Title of the film or production. |
| director | character | Director(s) of the film. Multiple directors are separated by ” | “. |
| release_date | date | Parsed release date. For films with regional releases, this is the earliest date. For year-only or month-year entries, the date is set to the first day of that period. TBA and undated entries are NA. |
| release_date_raw | character | Original release date string as it appeared on Wikipedia before parsing. |
| air_date_raw | character | Original air date string for television productions, as it appeared on Wikipedia. NA for non-television entries. |
| worldwide_box_office_currency | character | Currency symbol of the worldwide box office figure (e.g. “\(", "¥"). | |worldwide_box_office |double |Worldwide box office gross in the original currency units. See worldwide_box_office_currency. | |rotten_tomatoes |double |Rotten Tomatoes critic score as a percentage (0–100). | |metacritic |double |Metacritic score out of 100. | |cinema_score |character |CinemaScore audience grade (e.g. "A+", "B−"). | |distributor |character |Film distributor(s). | |original_game_publisher |character |Publisher of the video game the film is based on. | |budget_currency |character |Currency symbol of the budget figures (e.g. "\)”, “¥”). |
| budget_low | double | Lower bound of the reported production budget in the original currency units. Equal to budget_high for single-value budgets. |
| budget_high | double | Upper bound of the reported production budget in the original currency units. Equal to budget_low for single-value budgets. |
| domestic_box_office | character | Domestic box office gross as reported on Wikipedia (documentary sections). |
| subject | character | Subject or topic of the documentary (documentary sections only). |
| network | character | Broadcasting network for television productions. |
Cleaning Script
library(rvest)
library(tidyverse)
url <- "https://en.wikipedia.org/wiki/List_of_films_based_on_video_games"
page <- read_html(url)
parse_scaled <- function(x, fn) {
mult <- case_when(
str_detect(x, regex("billion", ignore_case = TRUE)) ~ 1e9,
str_detect(x, regex("million", ignore_case = TRUE)) ~ 1e6,
.default = 1
)
nums <- str_extract_all(x, "[0-9][0-9,]*(?:\\.[0-9]+)?")
map_dbl(nums, ~ { v <- parse_number(.x); if (length(v) == 0) NA_real_ else fn(v) }) * mult
}
clean_film_table <- function(df, category, subcategory) {
df |>
janitor::clean_names() |>
rename(any_of(c(cinema_score = "cinema_score_1"))) |>
mutate(across(where(is.character), ~ str_remove_all(., "\\[.*?\\]"))) |>
(\(d) if ("release_date" %in% names(d))
d |> mutate(
release_date_raw = as.character(release_date),
release_date = release_date_raw |>
str_remove("\\s*\\(.*") |> # strip regional suffixes: "... (JP)..."
str_remove("\\s*[–—-].*") |> # strip date ranges: "... – end date"
str_trim() |>
parse_date_time(orders = c("mdy", "my", "Y"), quiet = TRUE) |>
as.Date()
) |>
relocate(release_date_raw, .after = release_date)
else d)() |>
mutate(across(any_of(c("worldwide_box_office", "budget")),
~ str_extract(., "^[£$€¥₹]"), .names = "{.col}_currency")) |>
mutate(across(any_of("worldwide_box_office"), parse_number)) |>
(\(d) if ("budget" %in% names(d))
d |>
mutate(budget_low = parse_scaled(budget, first), budget_high = parse_scaled(budget, last)) |>
select(-budget)
else d)() |>
mutate(across(any_of("rotten_tomatoes"), ~ parse_number(str_remove(., "%")))) |>
mutate(across(any_of("metacritic"), ~ parse_number(str_remove(., "/100")))) |>
rename(any_of(c(director = "direction", air_date_raw = "original_air_date_s"))) |>
mutate(category = category, subcategory = subcategory) |>
relocate(category, subcategory) |>
relocate(any_of("worldwide_box_office_currency"), .before = any_of("worldwide_box_office")) |>
relocate(any_of("budget_currency"), .before = any_of("budget_low"))
}
get_heading_text <- function(node) {
text <- html_text2(html_element(node, ".mw-headline"))
if (is.na(text) || nchar(str_trim(text)) == 0) text <- html_text2(node)
str_trim(text)
}
game_films <- local({
nodes <- xml2::xml_find_all(page, ".//h2 | .//h3 | .//table[contains(@class,'wikitable')]")
current_h2 <- NA_character_
current_h3 <- NA_character_
raw_tables <- list()
categories <- character()
subcategories <- character()
for (i in seq_along(nodes)) {
node <- nodes[[i]]
tag <- html_name(node)
if (tag == "h2") {
current_h2 <- get_heading_text(node)
current_h3 <- NA_character_
} else if (tag == "h3") {
current_h3 <- get_heading_text(node)
} else if (tag == "table") {
tbl <- tryCatch(
node |>
as.character() |>
str_replace_all("<hr\\s*/?\\s*>\\s*", " | ") |>
read_html() |>
html_element("table") |>
html_table(),
error = function(e) NULL
)
if (!is.null(tbl) && nrow(tbl) > 0) {
raw_tables <- c(raw_tables, list(tbl))
categories <- c(categories, current_h2)
subcategories <- c(subcategories, current_h3)
}
}
}
pmap(list(raw_tables, categories, subcategories), clean_film_table) |>
bind_rows() |>
relocate(any_of("air_date_raw"), .after = any_of("release_date_raw"))
})