Welcome to the RuPaul-Predict-a-Looza (and winner)!
Build a predictive model that accurately predicts the winner and loser of each RuPaul’s Drag Race episode before that new episode airs.
Teams with the highest proportion of correct predictions over the season (i.e. lowest average 0-1 loss). The prize: bragging rights and the ability to add one policy question on the next Data for Progress survey.
Models can use any publicly-available information: e.g. blog posts, Twitter sentiment, Instagram followers. To get you started, we at Data for Progress provide datasets here:
We will be updating datasets throughout the current season (season 11).
Models cannot use non-public information. They can, however, create their own datasets based on publicly-available information.
Each week (starting February 28), each team must submit their predictions by 5:00 PM EST each day an episode airs (including the premiere on February 28). Each team must submit only one prediction (the winner and the loser) for each episode. If the episode has more than one winner (or more than one loser), a team is considered correct if they predict one of the winners (or losers). If the episode has no winner (or no loser), a team is considered correct regardless of their prediction. If a team does not submit for an episode, it counts as an incorrect prediction. If a team submits more than once for an episode, only the most recent (before the deadline) will count. We also require you to describe (at a high level) your modeling strategy and input variables (i.e. features/covariates).
Predictions should be submitted here.
At the end of the competition, we ask that you also share your code and data for educational purposes. The goal of the competition is to encourage engagement with data science.
Sidenote: We will also compare teams’ loser predictions to Corey Johnson’s (speaker of the New York City Council). Corey will submit a weekly guess as to which queen will be eliminated in the episode.
Questions? Email rupaulpredictalooza@gmail.com
Following advice from Bayesian Data Analysis, we “start with a simple model that uses only some of the available information”. Teams can start here, or somewhere else entirely! This is only one example approach. You can use a different modeling approach and/or add variables. Code for this notebook is here.
We wrangle data (get data into R in a useful form) in the tidyverse, a collection of R packages that share a common philosophy. We conduct analysis in Stan, open-source software to do fully Bayesian inference.
library(tidyverse)
library(knitr)
library(googlesheets)
library(rstan)
knitr::dep_prev()
We begin by data wrangling, getting data into R in a useful form. We import and join datasets. We standardize continuous variables (see this paper).
gs_auth()
rpdr_data <- "1Sotvl3o7J_ckKUg5sRiZTqNQn3hPqhepBSeOpMTK15Q" %>% gs_key
all_episodes <- rpdr_data %>% gs_read("all_episodes")
all_contestants <- rpdr_data %>% gs_read("all_contestants")
all_rankings <- rpdr_data %>% gs_read("all_rankings")
#--------------------- add next episode to predict ---------------------#
all_episodes_next <- add_row(all_episodes, season_number = 11, episode_number = 13,
episode_airdate = NA, episode_title = NA,
episode_type = "Competition", episode_maxi_challenge_type = NA) %>%
arrange(season_number, episode_number) %>% rowid_to_column("t")
season_11_contestants <- (all_contestants %>% filter(season_number == 11))$contestant_id
eliminated_contestants <- (all_rankings %>% filter(is.element(episode_placement,c('ELIM','Eliminated'))))$contestant_id
next_contestants <- setdiff(season_11_contestants, eliminated_contestants)
all_rankings_next <- add_row(all_rankings, season_number = 11, episode_number = 13,
contestant_id = next_contestants, episode_placement = NA)
#-----------------------------------------------------------------------#
wrangled <- all_rankings_next %>%
left_join(all_contestants, by = c("season_number","contestant_id")) %>%
left_join(all_episodes_next, by=c("season_number", "episode_number")) %>%
mutate(placement = case_when(is.element(episode_placement,c('WIN','Winner')) ~ 1,
is.element(episode_placement,c('ELIM','Eliminated')) ~ -1,
TRUE ~ 0)) %>%
group_by(t) %>% mutate(num_winners = sum(placement == 1),
num_losers = sum(placement == -1)) %>%
arrange(desc(placement), .by_group = TRUE) %>% ungroup() %>% # within episode: winner first, loser last
filter(is.element(episode_type,c('Competition','Finale'))) %>%
group_by(contestant_id) %>% mutate(past_wins = cumsum(placement == 1) - (placement == 1)) %>%
ungroup() %>%
mutate(z.past_wins = (past_wins - mean(past_wins))/(2*sd(past_wins))) %>%
mutate(z.age = (age - mean(age))/(2*sd(age))) %>%
filter((num_winners == 1 & num_losers == 1) | t == max(t)) %>% # use data on typical episodes
filter(!is.element(episode_placement,c('Guest','Miss C')) | t == max(t)) %>% # use data on typical contestants
select(season_number, episode_number, t, contestant_id, contestant_name, # identifiers
z.age, z.past_wins, # x variables
placement, num_winners, num_losers) # episode outcomes
# renumber episodes skipping the atypical ones:
wrangled$t <- as.numeric(as.factor(wrangled$t))
next_t = max(wrangled$t)
Our wrangled data look like this:
season_number | episode_number | t | contestant_id | contestant_name | z.age | z.past_wins | placement | num_winners | num_losers |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 2 | Nina Flowers | 0.46 | -0.36 | 1 | 1 | 1 |
1 | 1 | 1 | 9 | BeBe Zahara Benet | -0.10 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 3 | Rebecca Glasscock | -0.28 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | Shannel | 0.00 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 4 | Ongina | -0.28 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 8 | Jade | -0.37 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 6 | Akashia | -0.46 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 7 | Tammie Brown | 0.00 | -0.36 | 0 | 1 | 1 |
1 | 1 | 1 | 5 | Victoria Porkchop Parker | 0.92 | -0.36 | -1 | 1 | 1 |
… | |||||||||
… | |||||||||
… |
season_number | episode_number | t | contestant_id | contestant_name | z.age | z.past_wins | placement | num_winners | num_losers |
---|---|---|---|---|---|---|---|---|---|
11 | 13 | 97 | 137 | A’keria Chanel Davenport | 0.09 | 0.83 | 0 | 0 | 0 |
11 | 13 | 97 | 138 | Yvie Oddly | -0.46 | 0.24 | 0 | 0 | 0 |
11 | 13 | 97 | 139 | Silky Nutmeg Ganache | -0.10 | 0.83 | 0 | 0 | 0 |
11 | 13 | 97 | 140 | Brooke Lynn Hytes | 0.27 | 1.43 | 0 | 0 | 0 |
Now that we wrangled our data, we can model. Let \(i\) index the contestants and \(t\) index the episodes. We fit a multilevel, conditional logistic regression model with age, past wins, and a coefficient for each contestant. Here is our model in mathematical notation:
\[\eta_{it} = \beta_\text{age} \text{age}_i + \beta_\text{past_wins} \text{past_wins}_{it} + \alpha_i \ \ \text{ for } i \text{ playing in episode } t\]
\[P[i^* \text{ wins episode }t] = \frac{\exp(\eta_{i^*t})}{\sum_i \exp(\eta_{it})}\]
\[P[i_* \text{ loses episode }t] = \frac{\exp(-\eta_{i_*t})}{\sum_{i \ne i^*} \exp(-\eta_{it})}\] \[\alpha_i \sim N(0, \sigma)\]
We have prior knowledge from large corpuses of logistic regressions to rule out unreasonable values for coefficients. We express this by putting a Student’s t prior on \(\beta_\text{age}\) and \(\beta_\text{past_wins}\). This is a reasonable default prior when coefficients should be close to zero but have some chance of being large. We put a half-Normal prior on \(\sigma\). We follow the recommendations here.
data {
int N; // number of contestant-episode observations
int I; // number of contestants
int T; // number of episodes
int<lower=1,upper=I> ii[N]; // contestant for each observation
int num_contestants[T]; // number of contestants for each episode
vector[N] age;
vector[N] past_wins;
}
parameters {
real beta_age;
real beta_past_wins;
vector[I] alpha_contestant_raw;
real<lower=0> sigma;
}
transformed parameters {
vector[I] alpha_contestant = alpha_contestant_raw * sigma;
}
model {
vector[N] eta = beta_age * age + beta_past_wins * past_wins + alpha_contestant[ii];
beta_age ~ student_t(6,0,2.5);
beta_past_wins ~ student_t(6,0,2.5);
alpha_contestant_raw ~ normal(0,1);
sigma ~ normal(0,1);
{ int pos;
pos = 1;
for (t in 1:T) {
vector[num_contestants[t]] eta_t = segment(eta, pos, num_contestants[t]);
target += eta_t[1] - log_sum_exp(eta_t);
target += -1*eta_t[rows(eta_t)] - log_sum_exp(-1*eta_t[2:rows(eta_t)]); // remove winner (listed first)
pos = pos + num_contestants[t];
}}
}
fit_model <- function(df) {
standata <- list(
N = nrow(df),
I = max(df$contestant_id),
T = max(df$t),
ii = df$contestant_id,
num_contestants = (df %>% group_by(t) %>% summarise(n = n()))$n,
age = df$z.age,
past_wins = df$z.past_wins
)
sampling(object = model, data = standata, chains = 4, iter = 2000, control = list(adapt_delta = 0.99))
}
data_so_far <- wrangled %>% filter(t < next_t)
fit <- fit_model(df = data_so_far)
print(fit, pars = c('beta_age','beta_past_wins','sigma'))
## Inference for Stan model: 99f86f9a801588e1014aedc5c86c67f1.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta_age -0.05 0.00 0.22 -0.48 -0.19 -0.04 0.09 0.38 2658 1.00
## beta_past_wins 0.18 0.02 0.40 -0.70 -0.07 0.23 0.47 0.83 348 1.01
## sigma 0.66 0.02 0.37 0.04 0.38 0.65 0.91 1.42 240 1.01
##
## Samples were drawn using NUTS(diag_e) at Sat May 18 13:56:22 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
stan_plot(fit, pars = c('beta_age','beta_past_wins','sigma')) + geom_vline(xintercept=0, size = 2)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
To estimate the predictive performance for new episodes we use leave-future-out cross-validation:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
L = 30 # always use at least L past episodes of data
correct = c()
random_correct = c()
# this will take a long time!
for (t_current in (L+1):(next_t - 1)) {
data_current <- wrangled %>% filter(t < t_current) # leave-future-out
fit_current <- fit_model(df = data_current)
newdata <- wrangled %>% filter(t == t_current)
s <- as.data.frame(fit_current) # simulations from the posterior
eta_s <- matrix(NA, nrow = nrow(s), ncol = nrow(newdata))
for (n in 1:nrow(newdata)) {
i = newdata$contestant_id[n]
if (is.element(i,data_current$contestant_id)) { # we have seen this contestant before
alpha_contestant_s <- s[,c(paste('alpha_contestant[',i,']',sep=''))]
} else { # this is a new contestant
alpha_contestant_s <- rnorm(n = nrow(s), mean = 0, sd = s$sigma)
}
eta_s[,n] <- s$beta_age * newdata$z.age[n] + s$beta_past_wins * newdata$z.past_wins[n] + alpha_contestant_s
}
winner_s <- apply(eta_s, MARGIN = 1, FUN = which.max)
pred_winner <- Mode(winner_s)
loser_s <- apply(eta_s, MARGIN = 1, FUN = which.min)
pred_loser <- Mode(loser_s)
correct <- c(correct, (newdata$placement[pred_winner] == 1) &
(newdata$placement[pred_loser] == -1))
random_correct = c(random_correct, (1/nrow(newdata))*(1/(nrow(newdata)-1)))
}
We get 6% accuracy with our model. A random guess gets 2% accuracy. Can your model do better?
Sidenote: Computing leave-future-out cross-validation above took a long time. For an approximate leave-future-out cross-validation that is much faster, see this paper.
Let’s now use our model to predict the winner and loser of the next episode:
newdata <- wrangled %>% filter(t == next_t)
s <- as.data.frame(fit) # simulations from the posterior
eta_s <- matrix(NA, nrow = nrow(s), ncol = nrow(newdata))
for (n in 1:nrow(newdata)) {
i = newdata$contestant_id[n]
if (is.element(i,data_so_far$contestant_id)) { # we have seen this contestant before
alpha_contestant_s <- s[,c(paste('alpha_contestant[',i,']',sep=''))]
} else { # this is a new contestant
alpha_contestant_s <- rnorm(n = nrow(s), mean = 0, sd = s$sigma)
}
eta_s[,n] <- s$beta_age * newdata$z.age[n] + s$beta_past_wins * newdata$z.past_wins[n] + alpha_contestant_s
}
winner_s <- apply(eta_s, MARGIN = 1, FUN = which.max)
pred_winner <- Mode(winner_s)
loser_s <- apply(eta_s, MARGIN = 1, FUN = which.min)
pred_loser <- Mode(loser_s)
We predict Brooke Lynn Hytes will win and Yvie Oddly will lose.
To submit, we go here: