Modify/Impute Certain Rows with Group Statistic

In this article, I’ll show how to fill in a variable’s missing values, based on some computation done with respect to groupings.

The famous example here is to impute the Age variable of the Titanic dataset. Here the missing values are replaced with non-missing average age based on the Pclass attribute. First I’ll show the classic approach with a for loop. Second I’ll show the data.table way.

# Read in the titanic_train.csv file
library(data.table)
library(tidyverse)
titanic = fread(file_path)

Let’s first find average age of each passenger class

avg_age = titanic[!is.na(Age), .(avg = round(mean(Age))), by = .(Pclass)][order(Pclass)]
avg_age
##    Pclass avg
## 1:      1  38
## 2:      2  30
## 3:      3  25

The classic way

Run through each example, check whether Age is missing, and if so then change it with the average age within the group it belongs to.

# Create function
impute_age <- function(age, passenger_class){
    imputed_age <- age
    for (i in seq_along(age)){
        if (is.na(age[i])){
            if (passenger_class[i] == 1){
                imputed_age[i] <- 38
            }else if (passenger_class[i] == 2){
                imputed_age[i] <- 30
            }else{
                imputed_age[i] <- 25
            }
        }
    }
    return(imputed_age)
}

titanic[, age_imputed := impute_age(Age, Pclass)]

The more efficient way

This can be ok if you have few classes to check, but what if there were many more? Going through an if statement one-by-one could be very tedious.

We’ll first merge the average age table with the main table based on Pclass variable. This will create a new column called avg that has the average age for the corresponding class.

Then, we can just filter out the missing values and assign them the average age of their corresponding class.

titanic = merge(titanic, avg_age, by = 'Pclass')
titanic[, age_imputed2 := Age]
titanic[is.na(Age), age_imputed2 := avg]

# To check whether the imputed ages are the same using dplyr:
titanic %>% summarise(ages_not_equal= sum(age_imputed != age_imputed2))
##   ages_not_equal
## 1              0
# To check whether the imputed ages are the same using data.table:
titanic[is.na(Age), .(ages_not_equal = age_imputed!=age_imputed2)][,.(ages_not_equal= sum(ages_not_equal))]
##    ages_not_equal
## 1:              0

So as you can see it is much easier to deal with imputing (or modifying) a variable this way if you are filling it in with values based on another variable’s classes.

Thank you folks, and see you next time!

Ilyas Ustun
Jan 29, 2018
Detroit, MI

comments powered by Disqus