11.1 RecordLinkage

When joining data sets, you ideally have a unique id, or key, to join on. Sometimes that key does not exist, and you have to join on names or other information. Unfortunately, names are often spelled differently in the different data sets. Fortunately, the package RecordLinkage can help you identify near matches and deduplicate names.

11.1.1 Example data

Code
library(ff) ## keep this
library(RecordLinkage)
d = data.frame(name = c('Alexander Ovechkin', 
                        'Alex Ovechkin', 
                        'Alikandre Ohvechken', 
                        'Sydney Crosby', 
                        'Sidnee Krawsby'), 
                dob = c('1985-09-17', 
                        '1985-09-17', 
                        '1985-09-17', 
                        '1987-08-07',
                        '1987-08-07'))
d
                 name        dob
1  Alexander Ovechkin 1985-09-17
2       Alex Ovechkin 1985-09-17
3 Alikandre Ohvechken 1985-09-17
4       Sydney Crosby 1987-08-07
5      Sidnee Krawsby 1987-08-07

11.1.2 Compare/Dedup

Phonetic match

Code
dd1 = compare.dedup(d, 
                    phonetic = 'name', 
                    strcmp   = FALSE)

dd1$pairs
   id1 id2 name dob is_match
1    1   2    0   1       NA
2    1   3    1   1       NA
3    1   4    0   0       NA
4    1   5    0   0       NA
5    2   3    0   1       NA
6    2   4    0   0       NA
7    2   5    0   0       NA
8    3   4    0   0       NA
9    3   5    0   0       NA
10   4   5    1   1       NA

String match

Code
dd2 = compare.dedup(d, 
                    phonetic = FALSE , 
                    strcmp   = 'name', 
                    blockfld = 'dob')
dd2$pairs
  id1 id2      name dob is_match
1   1   2 0.8829060   1       NA
2   1   3 0.8615984   1       NA
3   2   3 0.7497240   1       NA
4   4   5 0.8005495   1       NA

Because blockfld = 'dob' is used, if dob doesn’t match, the pair doesn’t even get a string match rating and the pair is removed from the list. So there are only 4 rows here.

Merge both results:

Code
dd  = merge(dd1$pairs, 
            dd2$pairs, 
            by   = c('id1', 'id2'), 
            all  = T, 
            sort = F, 
            suffixes = c('.pho', '.str')) ## phonetic and string
dd
   id1 id2 name.pho dob.pho is_match.pho  name.str dob.str is_match.str
1    1   2        0       1           NA 0.8829060       1           NA
2    1   3        1       1           NA 0.8615984       1           NA
3    2   3        0       1           NA 0.7497240       1           NA
4    4   5        1       1           NA 0.8005495       1           NA
5    3   5        0       0           NA        NA      NA           NA
6    2   4        0       0           NA        NA      NA           NA
7    1   4        0       0           NA        NA      NA           NA
8    1   5        0       0           NA        NA      NA           NA
9    2   5        0       0           NA        NA      NA           NA
10   3   4        0       0           NA        NA      NA           NA

11.1.3 Attach names to IDs

Since id1 and id2 aren’t super helpful, let’s add names to this data frame. And get rid of NAs and a couple columns we don’t need.

Code
dd$name1 = d$name[dd$id1]
dd$name2 = d$name[dd$id2]

dd = dd %>% 
  filter(!is.na(name.str), 
         !is.na(name.pho)) %>%
  select(-is_match.pho, 
         -is_match.str) %>%
  arrange(desc(name.str)) 

head(dd,20)
  id1 id2 name.pho dob.pho  name.str dob.str              name1
1   1   2        0       1 0.8829060       1 Alexander Ovechkin
2   1   3        1       1 0.8615984       1 Alexander Ovechkin
3   4   5        1       1 0.8005495       1      Sydney Crosby
4   2   3        0       1 0.7497240       1      Alex Ovechkin
                name2
1       Alex Ovechkin
2 Alikandre Ohvechken
3      Sidnee Krawsby
4 Alikandre Ohvechken