datacademy > R > Geospatial Visualization > 2. Fips Data from US Census Bureau

2. Fips Data from US Census Bureau

Data Cleaning
Check for Suffixes

Before, I had taken the fips data from maps package, joined with polygon data, and we were ready to perform geographical mapping.

I also have found the fips data from USA Census Bureau. Here we will explore this dataset, do cleaning, and see if it is more beneficial than the fips data within maps package.

This dataset, as we’ll see shortly, seems to be more convoluted, and requires more steps to make it ready. Nevertheless, I’ll still do it.

library(stringr)
library(knitr)
library(tidyverse)
library(data.table)
library(DT)

Data Cleaning

county.fips2 = fread(str_c(files_dir, 'fips_code_state_county.csv'), colClasses = "character")
county.fips2[, V5 := NULL]
setnames(county.fips2, c('state_alpha', 'state_fips', 'county_fips', 'county'))

county.fips2 = unite_(data = county.fips2, col = 'fips', from = c('state_fips', 'county_fips'), sep = '', remove = T)

state_names = fread(str_c(files_dir, 'state_names.csv'))
county.fips2 = merge(county.fips2, state_names, by = 'state_alpha')

setnames(county.fips2, 'state_name', 'state')
county.fips2 = county.fips2[, c(1,4:5,3,2)]

county_clean = county.fips2 %>% 
    select(county, state) %>%
    map_df(.f = str_to_lower) %>%
    map_df(.f = str_replace_all, pattern = '[[:punct:]]', replacement = '') %>%
    map_df(.f = str_replace_all, pattern = '[[:space:]]', replacement = '')

county.fips2$county = county_clean$county
county.fips2$state = county_clean$state

county.fips2[, fips := as.integer(fips)]

# Check for duplicates
county.fips2[duplicated(county.fips2$fips), ]

## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips

After importing the fips data, I have done some basic cleaning by getting rid of punctutation and spacings. There are no duplicated entries in this dataset. Before anything else, I’d like to see whether the number of entries, and the fips values are the same as the previous fips data that we obtained from the maps package.

# county.fips 3075 points

county.fips2[, .N]

## [1] 3235

county.fips2[, .(unique(state_alpha))][, .N]

## [1] 57

The number of rows are different, but that’s due to county.fips2 obtained from Census Bureau having states from overseas territories. The county.fips from maps package contains only 48 contiguous states and the District of Columbia. It does not have Alaska nor Hawaii. This might be a good reason to use Census Bureau data for people needing these other states.

The following are the states present in the fips data obtained from Census Bureau, but not available in the one obtained from maps package:

AK: Alaska HI: Hawaii MP: Mariana Islands PR: Puerto Rico AS: American Samoa VI: US Virgin Islands GU: Guam UM: U.S. Minor Outlying Islands

What if we exclude these states and only check for contiguous USA?

non_contiguous = c('AK', 'HI', 'MP', 'PR', 'AS', 'VI', 'GU', 'UM')

# Number of counties in 48 contiguous states
county.fips2[!state_alpha %in% non_contiguous, .N]

## [1] 3109

OK! So there are 34 more counties in the dataset obtained from US Census Bureau. This data is likely to be more up-to-date. Thus, this might be another good reason for people to use this dataset instead of the one in the maps package.

Check for Suffixes

Let’s now check for some suspected suffixes or prefixes and see whether they are present.

d_city = county.fips2[county %>% str_detect('city'), .SD]
d_city  %>% datatable()

Show entries

Search:

	state_alpha	state	state_ansi	county	fips
1	AK	alaska	2	juneaucityandborough	2110
2	AK	alaska	2	sitkacityandborough	2220
3	AK	alaska	2	wrangellcityandborough	2275
4	AK	alaska	2	yakutatcityandborough	2282
5	MD	maryland	24	baltimorecity	24510
6	MO	missouri	29	stlouiscity	29510
7	NV	nevada	32	carsoncity	32510
8	VA	virginia	51	charlescitycounty	51036
9	VA	virginia	51	jamescitycounty	51095
10	VA	virginia	51	alexandriacity	51510

Showing 1 to 10 of 48 entries

Previous1 2 3 4 5Next

d_borough = county.fips2[county %>% str_detect('borough'), .SD]
d_borough %>% datatable()

Show entries

Search:

	state_alpha	state	state_ansi	county	fips
1	AK	alaska	2	aleutianseastborough	2013
2	AK	alaska	2	bristolbayborough	2060
3	AK	alaska	2	denaliborough	2068
4	AK	alaska	2	fairbanksnorthstarborough	2090
5	AK	alaska	2	hainesborough	2100
6	AK	alaska	2	juneaucityandborough	2110
7	AK	alaska	2	kenaipeninsulaborough	2122
8	AK	alaska	2	ketchikangatewayborough	2130
9	AK	alaska	2	kodiakislandborough	2150
10	AK	alaska	2	lakeandpeninsulaborough	2164

Showing 1 to 10 of 18 entries

Previous1 2Next

d_county = county.fips2[county %>% str_detect('county'), .SD]
d_county %>% datatable()

Show entries

Search:

	state_alpha	state	state_ansi	county	fips
1	AL	alabama	1	autaugacounty	1001
2	AL	alabama	1	baldwincounty	1003
3	AL	alabama	1	barbourcounty	1005
4	AL	alabama	1	bibbcounty	1007
5	AL	alabama	1	blountcounty	1009
6	AL	alabama	1	bullockcounty	1011
7	AL	alabama	1	butlercounty	1013
8	AL	alabama	1	calhouncounty	1015
9	AL	alabama	1	chamberscounty	1017
10	AL	alabama	1	cherokeecounty	1019

Showing 1 to 10 of 3,007 entries

Previous1 2 3 4 5…301Next

d_parish = county.fips2[county %>% str_detect('parish'), .SD]
d_parish %>% datatable()

Show entries

Search:

	state_alpha	state	state_ansi	county	fips
1	LA	louisiana	22	acadiaparish	22001
2	LA	louisiana	22	allenparish	22003
3	LA	louisiana	22	ascensionparish	22005
4	LA	louisiana	22	assumptionparish	22007
5	LA	louisiana	22	avoyellesparish	22009
6	LA	louisiana	22	beauregardparish	22011
7	LA	louisiana	22	bienvilleparish	22013
8	LA	louisiana	22	bossierparish	22015
9	LA	louisiana	22	caddoparish	22017
10	LA	louisiana	22	calcasieuparish	22019

Showing 1 to 10 of 64 entries

Previous1 2 3 4 5 6 7Next

d_muni = county.fips2[county %>% str_detect('municip'), .SD]
d_muni %>% datatable()

Show entries

Search:

	state_alpha	state	state_ansi	county	fips
1	AK	alaska	2	anchoragemunicipality	2020
2	AK	alaska	2	skagwaymunicipality	2230
3	MP	northernmarianaislands	69	northernislandsmunicipality	69085
4	MP	northernmarianaislands	69	rotamunicipality	69100
5	MP	northernmarianaislands	69	saipanmunicipality	69110
6	MP	northernmarianaislands	69	tinianmunicipality	69120
7	PR	puertorico	72	adjuntasmunicipio	72001
8	PR	puertorico	72	aguadamunicipio	72003
9	PR	puertorico	72	aguadillamunicipio	72005
10	PR	puertorico	72	aguasbuenasmunicipio	72007

Showing 1 to 10 of 84 entries

Previous1 2 3 4 5…9Next

d_main = county.fips2[county %>% str_detect('main'), .SD]
d_main

## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips

Checking for these suffixes reveals a new name: borough mostly used in Alaska, one in Florida, and one in NewHampshire. The ones in Alaska seem redundant, but the other two states will be investigated.
Many counties in Virginia have a suffix of city. The city of the counties in other states are part of their name, and should not be discarded.
Almost all the counties have county in their names, which will be discarded.
The counties in Louisiana have parish in their names, which will also be discarded.
Overseas territories such as Puerto Rico counties have municipio, and some like Alaska have municipality in their names. These will be discarded.
There are no counties with main.

Investigation results: 1. borough in Florida and New Hampshire is part of the name. 2. city is part of the name for James City and Charles City in Virginia, the rest can be discarded.

# Remove the "county" and "parish" from the endings of the counties
county.fips2$county = county.fips2 %>% 
    select(county) %>% 
    map_df(.f = str_replace_all, pattern = 'county', '') %>% 
    map_df(.f = str_replace_all, pattern = 'parish', '')

# Clean "borough"
county_clean = county.fips2 %>%
    select(state, county) %>%
    filter(state == 'alaska') %>%
    map_df(.f = str_replace_all, pattern = 'borough', '') %>%
    map_df(.f = str_replace_all, pattern = 'censusarea', '') %>%
    map_df(.f = str_replace_all, pattern = 'and', '')

county.fips2[state == 'alaska', county := county_clean$county]

# Clean "municip"
county.fips2$county = county.fips2 %>%
    select(county) %>%
    map_df(.f = str_replace_all, pattern = 'municip', '')

# Clean "city"
va = county.fips2[state %in% 'virginia', ][, row_id := .I]

va2 = va %>% 
    select(state, county, row_id) %>% 
    filter(state == 'virginia') %>%
    filter(county %in% c('jamescity', 'charlescity')) 

va3 = va %>% 
    select(state, county, row_id) %>% 
    filter(state == 'virginia') %>%
    filter(!county %in% c('jamescity', 'charlescity')) %>%
    map_df(.f = str_replace_all, pattern = 'city', '') %>% 
    rbind(va2) %>% 
    arrange(as.integer(row_id)) %>% 
    as.data.table()

county.fips2[state %in% 'virginia', county := va3$county]

Finally, the county.fips2 dataset is clean. Ther might be a few more issues, but that’s left to the user who is interested in this dataset.