11.1 RecordLinkage
When joining data sets, you ideally have a unique id, or key, to join on. Sometimes that key does not exist, and you have to join on names or other information. Unfortunately, names are often spelled differently in the different data sets. Fortunately, the package RecordLinkage
can help you identify near matches and deduplicate names.
11.1.1 Example data
Code
name dob
1 Alexander Ovechkin 1985-09-17
2 Alex Ovechkin 1985-09-17
3 Alikandre Ohvechken 1985-09-17
4 Sydney Crosby 1987-08-07
5 Sidnee Krawsby 1987-08-07
11.1.2 Compare/Dedup
Phonetic match
id1 id2 name dob is_match
1 1 2 0 1 NA
2 1 3 1 1 NA
3 1 4 0 0 NA
4 1 5 0 0 NA
5 2 3 0 1 NA
6 2 4 0 0 NA
7 2 5 0 0 NA
8 3 4 0 0 NA
9 3 5 0 0 NA
10 4 5 1 1 NA
String match
id1 id2 name dob is_match
1 1 2 0.8829060 1 NA
2 1 3 0.8615984 1 NA
3 2 3 0.7497240 1 NA
4 4 5 0.8005495 1 NA
Because blockfld = 'dob'
is used, if dob
doesn’t match, the pair doesn’t even get a string match rating and the pair is removed from the list. So there are only 4 rows here.
Merge both results:
Code
id1 id2 name.pho dob.pho is_match.pho name.str dob.str is_match.str
1 1 2 0 1 NA 0.8829060 1 NA
2 1 3 1 1 NA 0.8615984 1 NA
3 2 3 0 1 NA 0.7497240 1 NA
4 4 5 1 1 NA 0.8005495 1 NA
5 3 5 0 0 NA NA NA NA
6 2 4 0 0 NA NA NA NA
7 1 4 0 0 NA NA NA NA
8 1 5 0 0 NA NA NA NA
9 2 5 0 0 NA NA NA NA
10 3 4 0 0 NA NA NA NA
11.1.3 Attach names to IDs
Since id1
and id2
aren’t super helpful, let’s add names to this data frame. And get rid of NAs and a couple columns we don’t need.
Code
id1 id2 name.pho dob.pho name.str dob.str name1
1 1 2 0 1 0.8829060 1 Alexander Ovechkin
2 1 3 1 1 0.8615984 1 Alexander Ovechkin
3 4 5 1 1 0.8005495 1 Sydney Crosby
4 2 3 0 1 0.7497240 1 Alex Ovechkin
name2
1 Alex Ovechkin
2 Alikandre Ohvechken
3 Sidnee Krawsby
4 Alikandre Ohvechken