Load in a sample of five counties.
library(countyfipsR)
library(fuzzyjoin)
library(dplyr)
head(countyfips_sample)
#> # A tibble: 5 × 7
#> fips county_name state_name state_abb state_fips county_fips county_ns
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 13145 Harris County Georgia GA 13 145 326700
#> 2 78030 St. Thomas Island U.S. Virgi… VI 78 030 2378250
#> 3 38053 Mckenzie County North Dako… ND 38 053 1035299
#> 4 48045 Briscoe County Texas TX 48 045 1383808
#> 5 19147 Palo Alto County Iowa IA 19 147 465262
Perhaps one wants to merge in land and water areas by county; however, you don’t have any numerical identifies to do this with. All you have is an unclean text document with (bad) county names as the identifier. The ‘county_name’ variable is correct in some instances, missing parts of the name (county or a period), or entirely misspelled.
badmerge <- tribble(
~"county_name", ~"state_name", ~"aland", ~"awater",
"Harris", "Georgia", 41201240346, 23711353,
"St Thomas Island", "U.S. Virgin Islands", 81108880, 717920922,
"Mckenzie County", "North Dakota", 7148712746, 260119355,
"Briscoe County", "Texas", 2330991073, 4068657,
"Pal Alto County", "Iowa", 1460400553, 14428727,
"Mineral County", "Montana", 3158799812, 9897218
)
The fuzzyjoin
package is perfect for this situation.
The stringdist_left_join
command does the work. As an
aside, all the typical join methods are available by modifying the
‘inner’ portion of the command. You can also use a ‘mode’ option within
the command to specify the join method.
The choice of method is well beyond this article; however, they are
all the options available from the stringdist
package.
Documentation can be found here and more on
the various methods (theory mostly) can be found here.
method = "jw"
implements Jaro-Winkler distance, which
appears to be (to my untrained eye) the best character based matching
method.
joined <- countyfips_sample |>
stringdist_inner_join(
badmerge,
by = "county_name", #match based on county_name
method = "jw", #use jw distance metric
max_dist = 0.5,
distance_col = "dist"
) |>
group_by(county_name.x) |>
slice_min(order_by=dist, n=1)
joined |> select(fips, county_name.x, county_name.y, dist)
#> # A tibble: 5 × 4
#> # Groups: county_name.x [5]
#> fips county_name.x county_name.y dist
#> <chr> <chr> <chr> <dbl>
#> 1 48045 Briscoe County Briscoe County 0
#> 2 13145 Harris County Harris 0.179
#> 3 38053 Mckenzie County Mckenzie County 0
#> 4 19147 Palo Alto County Pal Alto County 0.110
#> 5 78030 St. Thomas Island St Thomas Island 0.0196
Here, stringdist_inner_join
is attempting to match the
‘county_name’ string in ‘countyfips_sample’ with the same variable in
the ‘badmerge’ data. It’s doing so with the Jaro-Winkler distance method
with a max distance of 0.5 (JW = 0 for no matched string and JW = 1 for
exact matches; the choice of max distance is relatively arbitrary) and
saves that distance into column ‘dist’. Then, data are grouped by
counties we want to merge to and we select the group-row with the
smallest value of dist (or what stringdist_inner_join
believes to be the best match).
In this example, stringdist_inner_join
does accurately
match the malformed data to our clean countyfipsR
data. If
you want to see the other potential matches, exclude the
slice_min
options to see the full range of matching
attempts.
joined |>
select(-c(county_name.y, state_name.y)) |>
rename(
county_name = county_name.x,
state_name = state_name.x
) |>
select(fips, county_name, state_fips, county_fips, county_ns, aland, awater)
#> # A tibble: 5 × 7
#> # Groups: county_name [5]
#> fips county_name state_fips county_fips county_ns aland awater
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 48045 Briscoe County 48 045 1383808 2330991073 4068657
#> 2 13145 Harris County 13 145 326700 41201240346 23711353
#> 3 38053 Mckenzie County 38 053 1035299 7148712746 260119355
#> 4 19147 Palo Alto County 19 147 465262 1460400553 14428727
#> 5 78030 St. Thomas Island 78 030 2378250 81108880 717920922
Et voila, cleaned and merged data.