Name Matching Package

rMatching

Installation

You can install the development version of rMatching from GitHub with:

devtools::install_github("TerminologyStandardization/rMatching")

Build-In Tables

Below The basic steps how to match two data sets by company names.

The package contains three build-in datasets:

table_source: The source table with company names
table_target: The target table with company names
table_matches: A table in which source and target table are already matched

table_source

id	name	iso3	city	address	size
291C5CB8	ASM INTERNATIONAL NV	NLD	ALMERE	VERSTERKERSTRAAT 8	large
097A6454	TELEFONAKTIEBOLAGET LM ERICS	SWE	STOCKHOLM	TORSHAMNSGATAN 21, KISTA	small
0CA8A1F4	NOVO NORDISK A/S	DNK	BAGSVAERD	NOVO ALLE 1	small

table_target

id	name	iso3	city	address	size
40D62BF9	VOLKSWAGEN AG	DEU	WOLFSBURG	BRIEFFACH 1849	middle
18162F6F	DAIMLER AG	DEU	STUTTGART	MERCEDESSTRASSE 120	small
47F0DB5C	BAYERISCHE MOTOREN WERKE AG	DEU	MUENCHEN	PETUELRING 130	middle

table_matches

id_s	id_t	name_s	name_t	iso3_s	iso3_t	city_s	city_t	address_s	address_t	match
291C5CB8	1147DBEB	ASM INTERNATIONAL NV	ASM INTERNATIONAL NV	NLD	NLD	ALMERE	ALMERE	VERSTERKERSTRAAT 8	VERSTERKERSTRAAT 8	1
0CA8A1F4	BACB9C1F	NOVO NORDISK A/S	NOVO NORDISK A/S	DNK	DNK	BAGSVAERD	BAGSVAERD	NOVO ALLE 1	NOVO ALLE	1
80DC386E	C201D476	KONINKLIJKE PHILIPS NV	KONINKLIJKE PHILIPS N.V.	NLD	NLD	AMSTERDAM	AMSTERDAM	PHILIPS CENTER, AMSTELPLEIN 2	AMSTELPLEIN 2	1

Name Matching Pipeline

Step 1: Prepare Tables

Tables have to be prepared with the function: prep_tables()

Inputs:

.source: The Source Table - This is the table containing the names you want to match. These names will be compared to those in the target table to identify potential matches.
.target: The Target Table - This table contains the names to which the source table names will be matched. The function aims to find corresponding names in the target table for each name in the source table.
.fstd: Standardization Function - This is a user-defined or built-in function used to standardize the names in both source and target tables. By default, the function uses the built-in standardize_str function for this purpose.
.dir: Data Storage Directory - This parameter specifies the directory where the processed data will be stored, allowing for easy access and review of the intermediate data files.
.return: Return Tables as List - This is a boolean flag that determines whether the processed source and target tables should be returned as a list. If set to TRUE, the function will return the tables; otherwise, it will only store them in the specified directory.
.verbose: Verbose Output - This boolean flag controls the display of additional information during the function’s execution. If set to TRUE, the function will print extra details, which can be useful for troubleshooting and understanding the matching process.

The relevant tables are stored in the directory (.dir)

type	file
Source Table	sdata.fst
Source Table	sorig.fst
Groups	tdata.fst
Groups	torig.fst

Step 2: Match Data

The Matching can be now performed with the function match_data()

Inputs:

.dir Data Storage Directory - This parameter specifies the directory in which the processed data will be stored. It enables easy access and review of the intermediate data files created during the matching process.
.cols Named Vector for Columns - This parameter is a named vector containing the columns to be considered for matching. The names in the vector should be either “e” or “exact” for an exact comparison of the strings, or “f” or “fuzzy” for a fuzzy matching approach as specified by the .method argument. Names in the vector can be either quoted or unquoted, while the values must be quoted.
.range Character Range - This parameter defines the range of characters to be considered when matching names. For example, if a name in the source table has 10 characters and .range is set to 5, any name within the 5-15 character range in the target table will be considered for matching.
.weights Column Weights - This named vector assigns weights to each column used for matching. If not specified, all columns will be assigned equal weights by default.
.max_match Maximum Matches - This parameter determines the maximum number of matches to be returned for each record in the source table.
.allow_mult Multiple Match Permission - This boolean parameter indicates whether multiple matches are allowed for each record in the target table. If you want a 1-1 match, set it to FALSE; for a 1-n match, set it to TRUE.
.method Matching Method - This parameter specifies the method to be used for matching records. It can be one of the following: “osa”, “lv”, “dl”, “hamming”, “lcs”, “qgram”, “cosine”, “jaccard”, “jw”, or “soundex”. For more information on these methods, refer to the stringdist-metrics documentation in the {stringdist} package.
.workers Parallel Workers - This parameter defines the number of workers to be used for parallelization during the matching process.
.mat_size Maximum Matrix Size - This parameter sets the maximum size of the similarity matrix created during the fuzzy matching process. If you are dealing with a large number of names to match, the matrix can become very large. Adjust this value to avoid overloading your system’s memory.
.verbose Verbose Output - This boolean flag determines whether additional information is displayed during the execution of the function. If set to TRUE, the function will output extra details, which can be helpful for troubleshooting and gaining insights into the matching process.

The output looks the following:

id_s	id_t	score	rank_old	rank_new	name_s	name_t	iso3_s	iso3_t	city_s	city_t	address_s	address_t	size_s	size_t
000F8750	E48EB751	1.0000000	1	1	NTR HOLDING A/S	NTR HOLDING A/S	DNK	DNK	COPENHAGEN	COPENHAGEN	BREDGADE 30	SANKT ANNAE PLADS 13 3	middle	middle
000F8750	1749518D	0.6686869	2	2	NTR HOLDING A/S	NEWCAP HOLDING A/S	DNK	DNK	COPENHAGEN	KOBENHAVN K	BREDGADE 30	BREDGADE 30	middle	middle
000F8750	6A2ED2E8	0.5789474	4	3	NTR HOLDING A/S	LOYAL SOLUTIONS A/S	DNK	DNK	COPENHAGEN	COPENHAGEN	BREDGADE 30	ROBERT JACOBSENS VEJ 68	middle	middle
002FCAB5	FF136D09	1.0000000	1	1	VIROGATES A/S	VIROGATES A/S	DNK	DNK	BIRKEROD	BIRKEROD	BANEVAENGET 13	BLOKKEN 45	large	large
002FCAB5	45C4AE7D	0.3942308	3	2	VIROGATES A/S	PHOTOCAT A/S	DNK	DNK	BIRKEROD	ROSKILDE	BANEVAENGET 13	LANGEBJERG 4	large	large
0051857E	9E944051	0.7600000	1	1	GAUMONT SA	GAUMONT	FRA	FRA	NEUILLY-SUR-SEINE	NEUILLY-SUR-SEINE	30, AVENUE CHARLES DE GAULLE	30, AVENUE CHARLES DE GAULLE	middle	middle

The output contains the following relevant columns: id_s: The ID of the source table id_t: The ID of the target table score: similarity score rank_old: Unadjusted Rank rank_new: Adjusted Rank (only relevant if .allow_mult = FALSE, than the rank will consider the best match of a name considering all the other matches)

Note _t stands for the column in the target table and _s for the columns in the source table

The matching is now already stored, so if you just want to change for example the weights, the calculation will be almost instant.

But if you want to change the columns, the function has to cache another matching

Deduplicating Matches

The Outputs of the matching tables is still not unique. We can easily deduplicate them by setting rank_new ==

we can quickly check that we have unique 1-1 matches with the function filter_dups() which return s a dataframe with all duplicates

[1] 0

[1] 0

[1] 0

Matching Comparison

Let us now see how good the matching library performs. We calculated 3 different matches which we stored in match1, match2, and match3.

First we load the pre-stored matches and combine all the matches to a single table