Converting Factors to Numbers in R

Attention should be paid when converting factors into numeric format in R. Even if your factors are seemingly numeric, attention should be paid when converting them to numbers. Here, I’ll try to explain the proper process, and the reason behind the seemingly awkward steps taken.

The dataset subject to analysis can be found at uci student performance data set. The math scores of the students will be used.

library(data.table)
library(tidyverse)

x = fread(paste0(file_path, 'student-mat.csv'), stringsAsFactors = TRUE)
str(x)
## Classes 'data.table' and 'data.frame':   395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Although G1, G2, and G3 are numeric columns, the first two are of character type in the data file, and thus need to be converted. Since these all look to be integers, let’s use the as.integer() or the as.numeric() functions to do conversion.

data.frame(x$G1, as.numeric(x$G1), as.integer(x$G1)) %>% head(10)
##    x.G1 as.numeric.x.G1. as.integer.x.G1.
## 1     5                5                5
## 2     5                5                5
## 3     7                7                7
## 4    15               15               15
## 5     6                6                6
## 6    15               15               15
## 7    12               12               12
## 8     6                6                6
## 9    16               16               16
## 10   14               14               14

OK. That’s unexpected. For example, the factor value of G1 is shown as 5, and after conversion it suddenly becomes 13, 7 becomes 15, and so on. Why is that? For that we’ll check the levels of the factor variable.

levels(x$G1)
## NULL

I think now we can decipher the fact behind this phenomena. Value 5 is the 13th level, and 7 is the 15th. That means factor level 5 has a corresponding numeric value of 13, and level 7 corresponds to 15. Thus, converting the factor variable to numeric format will take these corresponding integer values. So, before we start blaming R, we need to know this fact. And no, R is not guilty here :)

To get the expected results from R, we need to first convert factors to character format, and then to numeric format. This will get rid of the aforementioned issue.

data.frame(fac_G1 = x$G1, 
                num_G1 = as.numeric(as.character(x$G1)),
                int_G1 = as.integer(as.character(x$G1)) ) %>% 
    head(10)
##    fac_G1 num_G1 int_G1
## 1       5      5      5
## 2       5      5      5
## 3       7      7      7
## 4      15     15     15
## 5       6      6      6
## 6      15     15     15
## 7      12     12     12
## 8       6      6      6
## 9      16     16     16
## 10     14     14     14

Now we got it right.

comments powered by Disqus