2. Fips Data from US Census Bureau

Before, I had taken the fips data from maps package, joined with polygon data, and we were ready to perform geographical mapping.

I also have found the fips data from USA Census Bureau. Here we will explore this dataset, do cleaning, and see if it is more beneficial than the fips data within maps package.

This dataset, as we’ll see shortly, seems to be more convoluted, and requires more steps to make it ready. Nevertheless, I’ll still do it.

library(stringr)
library(knitr)
library(tidyverse)
library(data.table)
library(DT)

Data Cleaning

county.fips2 = fread(str_c(files_dir, 'fips_code_state_county.csv'), colClasses = "character")
county.fips2[, V5 := NULL]
setnames(county.fips2, c('state_alpha', 'state_fips', 'county_fips', 'county'))

county.fips2 = unite_(data = county.fips2, col = 'fips', from = c('state_fips', 'county_fips'), sep = '', remove = T)

state_names = fread(str_c(files_dir, 'state_names.csv'))
county.fips2 = merge(county.fips2, state_names, by = 'state_alpha')

setnames(county.fips2, 'state_name', 'state')
county.fips2 = county.fips2[, c(1,4:5,3,2)]

county_clean = county.fips2 %>% 
    select(county, state) %>%
    map_df(.f = str_to_lower) %>%
    map_df(.f = str_replace_all, pattern = '[[:punct:]]', replacement = '') %>%
    map_df(.f = str_replace_all, pattern = '[[:space:]]', replacement = '')

county.fips2$county = county_clean$county
county.fips2$state = county_clean$state

county.fips2[, fips := as.integer(fips)]

# Check for duplicates
county.fips2[duplicated(county.fips2$fips), ]
## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips

After importing the fips data, I have done some basic cleaning by getting rid of punctutation and spacings. There are no duplicated entries in this dataset. Before anything else, I’d like to see whether the number of entries, and the fips values are the same as the previous fips data that we obtained from the maps package.

# county.fips 3075 points

county.fips2[, .N]
## [1] 3235
county.fips2[, .(unique(state_alpha))][, .N]
## [1] 57

The number of rows are different, but that’s due to county.fips2 obtained from Census Bureau having states from overseas territories. The county.fips from maps package contains only 48 contiguous states and the District of Columbia. It does not have Alaska nor Hawaii. This might be a good reason to use Census Bureau data for people needing these other states.

The following are the states present in the fips data obtained from Census Bureau, but not available in the one obtained from maps package:

AK: Alaska HI: Hawaii MP: Mariana Islands PR: Puerto Rico AS: American Samoa VI: US Virgin Islands GU: Guam UM: U.S. Minor Outlying Islands

What if we exclude these states and only check for contiguous USA?

non_contiguous = c('AK', 'HI', 'MP', 'PR', 'AS', 'VI', 'GU', 'UM')

# Number of counties in 48 contiguous states
county.fips2[!state_alpha %in% non_contiguous, .N]
## [1] 3109

OK! So there are 34 more counties in the dataset obtained from US Census Bureau. This data is likely to be more up-to-date. Thus, this might be another good reason for people to use this dataset instead of the one in the maps package.

Check for Suffixes

Let’s now check for some suspected suffixes or prefixes and see whether they are present.

d_city = county.fips2[county %>% str_detect('city'), .SD]
d_city  %>% datatable()
d_borough = county.fips2[county %>% str_detect('borough'), .SD]
d_borough %>% datatable()
d_county = county.fips2[county %>% str_detect('county'), .SD]
d_county %>% datatable()
d_parish = county.fips2[county %>% str_detect('parish'), .SD]
d_parish %>% datatable()
d_muni = county.fips2[county %>% str_detect('municip'), .SD]
d_muni %>% datatable()
d_main = county.fips2[county %>% str_detect('main'), .SD]
d_main
## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips
  1. Checking for these suffixes reveals a new name: borough mostly used in Alaska, one in Florida, and one in NewHampshire. The ones in Alaska seem redundant, but the other two states will be investigated.
  2. Many counties in Virginia have a suffix of city. The city of the counties in other states are part of their name, and should not be discarded.
  3. Almost all the counties have county in their names, which will be discarded.
  4. The counties in Louisiana have parish in their names, which will also be discarded.
  5. Overseas territories such as Puerto Rico counties have municipio, and some like Alaska have municipality in their names. These will be discarded.
  6. There are no counties with main.

Investigation results: 1. borough in Florida and New Hampshire is part of the name. 2. city is part of the name for James City and Charles City in Virginia, the rest can be discarded.

# Remove the "county" and "parish" from the endings of the counties
county.fips2$county = county.fips2 %>% 
    select(county) %>% 
    map_df(.f = str_replace_all, pattern = 'county', '') %>% 
    map_df(.f = str_replace_all, pattern = 'parish', '')

# Clean "borough"
county_clean = county.fips2 %>%
    select(state, county) %>%
    filter(state == 'alaska') %>%
    map_df(.f = str_replace_all, pattern = 'borough', '') %>%
    map_df(.f = str_replace_all, pattern = 'censusarea', '') %>%
    map_df(.f = str_replace_all, pattern = 'and', '')

county.fips2[state == 'alaska', county := county_clean$county]

# Clean "municip"
county.fips2$county = county.fips2 %>%
    select(county) %>%
    map_df(.f = str_replace_all, pattern = 'municip', '')

# Clean "city"
va = county.fips2[state %in% 'virginia', ][, row_id := .I]

va2 = va %>% 
    select(state, county, row_id) %>% 
    filter(state == 'virginia') %>%
    filter(county %in% c('jamescity', 'charlescity')) 

va3 = va %>% 
    select(state, county, row_id) %>% 
    filter(state == 'virginia') %>%
    filter(!county %in% c('jamescity', 'charlescity')) %>%
    map_df(.f = str_replace_all, pattern = 'city', '') %>% 
    rbind(va2) %>% 
    arrange(as.integer(row_id)) %>% 
    as.data.table()

county.fips2[state %in% 'virginia', county := va3$county]

Finally, the county.fips2 dataset is clean. Ther might be a few more issues, but that’s left to the user who is interested in this dataset.

comments powered by Disqus