Fuzzy Merging • countyfipsR

Load in a sample of five counties.

library(countyfipsR)
library(fuzzyjoin)
library(dplyr)

head(countyfips_sample)
#> # A tibble: 5 × 7
#>   fips  county_name       state_name  state_abb state_fips county_fips county_ns
#>   <chr> <chr>             <chr>       <chr>     <chr>      <chr>           <dbl>
#> 1 13145 Harris County     Georgia     GA        13         145            326700
#> 2 78030 St. Thomas Island U.S. Virgi… VI        78         030           2378250
#> 3 38053 Mckenzie County   North Dako… ND        38         053           1035299
#> 4 48045 Briscoe County    Texas       TX        48         045           1383808
#> 5 19147 Palo Alto County  Iowa        IA        19         147            465262

Perhaps one wants to merge in land and water areas by county; however, you don’t have any numerical identifies to do this with. All you have is an unclean text document with (bad) county names as the identifier. The ‘county_name’ variable is correct in some instances, missing parts of the name (county or a period), or entirely misspelled.


badmerge <- tribble(
  ~"county_name",        ~"state_name",         ~"aland",        ~"awater",
  "Harris",              "Georgia",             41201240346,     23711353,
  "St Thomas Island",   "U.S. Virgin Islands",  81108880,        717920922,
  "Mckenzie County",     "North Dakota",        7148712746,      260119355,
  "Briscoe County",      "Texas",               2330991073,      4068657,
  "Pal Alto County",     "Iowa",                1460400553,      14428727,
  "Mineral County",      "Montana",             3158799812,      9897218
)

The fuzzyjoin package is perfect for this situation.

The stringdist_left_join command does the work. As an aside, all the typical join methods are available by modifying the ‘inner’ portion of the command. You can also use a ‘mode’ option within the command to specify the join method.

The choice of method is well beyond this article; however, they are all the options available from the stringdist package. Documentation can be found here and more on the various methods (theory mostly) can be found here. method = "jw" implements Jaro-Winkler distance, which appears to be (to my untrained eye) the best character based matching method.


joined <- countyfips_sample |>
  stringdist_inner_join(
    badmerge, 
    by = "county_name",   #match based on county_name
    method = "jw",        #use jw distance metric
    max_dist = 0.5, 
    distance_col = "dist"
  ) |>
  group_by(county_name.x) |>
  slice_min(order_by=dist, n=1)

joined |> select(fips, county_name.x, county_name.y, dist)
#> # A tibble: 5 × 4
#> # Groups:   county_name.x [5]
#>   fips  county_name.x     county_name.y      dist
#>   <chr> <chr>             <chr>             <dbl>
#> 1 48045 Briscoe County    Briscoe County   0     
#> 2 13145 Harris County     Harris           0.179 
#> 3 38053 Mckenzie County   Mckenzie County  0     
#> 4 19147 Palo Alto County  Pal Alto County  0.110 
#> 5 78030 St. Thomas Island St Thomas Island 0.0196

Here, stringdist_inner_join is attempting to match the ‘county_name’ string in ‘countyfips_sample’ with the same variable in the ‘badmerge’ data. It’s doing so with the Jaro-Winkler distance method with a max distance of 0.5 (JW = 0 for no matched string and JW = 1 for exact matches; the choice of max distance is relatively arbitrary) and saves that distance into column ‘dist’. Then, data are grouped by counties we want to merge to and we select the group-row with the smallest value of dist (or what stringdist_inner_join believes to be the best match).

In this example, stringdist_inner_join does accurately match the malformed data to our clean countyfipsR data. If you want to see the other potential matches, exclude the slice_min options to see the full range of matching attempts.

joined |> 
  select(-c(county_name.y, state_name.y)) |> 
  rename(
    county_name = county_name.x, 
    state_name = state_name.x
    ) |>
  select(fips, county_name, state_fips, county_fips, county_ns, aland, awater)
#> # A tibble: 5 × 7
#> # Groups:   county_name [5]
#>   fips  county_name       state_fips county_fips county_ns       aland    awater
#>   <chr> <chr>             <chr>      <chr>           <dbl>       <dbl>     <dbl>
#> 1 48045 Briscoe County    48         045           1383808  2330991073   4068657
#> 2 13145 Harris County     13         145            326700 41201240346  23711353
#> 3 38053 Mckenzie County   38         053           1035299  7148712746 260119355
#> 4 19147 Palo Alto County  19         147            465262  1460400553  14428727
#> 5 78030 St. Thomas Island 78         030           2378250    81108880 717920922

Et voila, cleaned and merged data.