Unstandardized Accounting Terminology
  • Home
  • Terms & Concepts
  • Database Matching
  • Name Matching Package

On this page

  • rMatching
    • Installation
    • Build-In Tables
      • table_source
      • table_target
      • table_matches
  • Name Matching Pipeline
    • Step 1: Prepare Tables
    • Step 2: Match Data
  • Deduplicating Matches
  • Matching Comparison

Name Matching Package

rMatching

Installation

You can install the development version of rMatching from GitHub with:

devtools::install_github("TerminologyStandardization/rMatching")

Build-In Tables

Below The basic steps how to match two data sets by company names.

The package contains three build-in datasets:

  • table_source: The source table with company names

  • table_target: The target table with company names

  • table_matches: A table in which source and target table are already matched

table_source

id name iso3 city address size
291C5CB8 ASM INTERNATIONAL NV NLD ALMERE VERSTERKERSTRAAT 8 large
097A6454 TELEFONAKTIEBOLAGET LM ERICS SWE STOCKHOLM TORSHAMNSGATAN 21, KISTA small
0CA8A1F4 NOVO NORDISK A/S DNK BAGSVAERD NOVO ALLE 1 small

table_target

id name iso3 city address size
40D62BF9 VOLKSWAGEN AG DEU WOLFSBURG BRIEFFACH 1849 middle
18162F6F DAIMLER AG DEU STUTTGART MERCEDESSTRASSE 120 small
47F0DB5C BAYERISCHE MOTOREN WERKE AG DEU MUENCHEN PETUELRING 130 middle

table_matches

id_s id_t name_s name_t iso3_s iso3_t city_s city_t address_s address_t match
291C5CB8 1147DBEB ASM INTERNATIONAL NV ASM INTERNATIONAL NV NLD NLD ALMERE ALMERE VERSTERKERSTRAAT 8 VERSTERKERSTRAAT 8 1
0CA8A1F4 BACB9C1F NOVO NORDISK A/S NOVO NORDISK A/S DNK DNK BAGSVAERD BAGSVAERD NOVO ALLE 1 NOVO ALLE 1
80DC386E C201D476 KONINKLIJKE PHILIPS NV KONINKLIJKE PHILIPS N.V. NLD NLD AMSTERDAM AMSTERDAM PHILIPS CENTER, AMSTELPLEIN 2 AMSTELPLEIN 2 1

Name Matching Pipeline

Step 1: Prepare Tables

Tables have to be prepared with the function: prep_tables()

Inputs:

  1. .source: The Source Table - This is the table containing the names you want to match. These names will be compared to those in the target table to identify potential matches.

  2. .target: The Target Table - This table contains the names to which the source table names will be matched. The function aims to find corresponding names in the target table for each name in the source table.

  3. .fstd: Standardization Function - This is a user-defined or built-in function used to standardize the names in both source and target tables. By default, the function uses the built-in standardize_str function for this purpose.

  4. .dir: Data Storage Directory - This parameter specifies the directory where the processed data will be stored, allowing for easy access and review of the intermediate data files.

  5. .return: Return Tables as List - This is a boolean flag that determines whether the processed source and target tables should be returned as a list. If set to TRUE, the function will return the tables; otherwise, it will only store them in the specified directory.

  6. .verbose: Verbose Output - This boolean flag controls the display of additional information during the function’s execution. If set to TRUE, the function will print extra details, which can be useful for troubleshooting and understanding the matching process.

The relevant tables are stored in the directory (.dir)

type file
Source Table sdata.fst
Source Table sorig.fst
Groups tdata.fst
Groups torig.fst

Step 2: Match Data

The Matching can be now performed with the function match_data()

Inputs:

  1. .dir Data Storage Directory - This parameter specifies the directory in which the processed data will be stored. It enables easy access and review of the intermediate data files created during the matching process.

  2. .cols Named Vector for Columns - This parameter is a named vector containing the columns to be considered for matching. The names in the vector should be either “e” or “exact” for an exact comparison of the strings, or “f” or “fuzzy” for a fuzzy matching approach as specified by the .method argument. Names in the vector can be either quoted or unquoted, while the values must be quoted.

  3. .range Character Range - This parameter defines the range of characters to be considered when matching names. For example, if a name in the source table has 10 characters and .range is set to 5, any name within the 5-15 character range in the target table will be considered for matching.

  4. .weights Column Weights - This named vector assigns weights to each column used for matching. If not specified, all columns will be assigned equal weights by default.

  5. .max_match Maximum Matches - This parameter determines the maximum number of matches to be returned for each record in the source table.

  6. .allow_mult Multiple Match Permission - This boolean parameter indicates whether multiple matches are allowed for each record in the target table. If you want a 1-1 match, set it to FALSE; for a 1-n match, set it to TRUE.

  7. .method Matching Method - This parameter specifies the method to be used for matching records. It can be one of the following: “osa”, “lv”, “dl”, “hamming”, “lcs”, “qgram”, “cosine”, “jaccard”, “jw”, or “soundex”. For more information on these methods, refer to the stringdist-metrics documentation in the {stringdist} package.

  8. .workers Parallel Workers - This parameter defines the number of workers to be used for parallelization during the matching process.

  9. .mat_size Maximum Matrix Size - This parameter sets the maximum size of the similarity matrix created during the fuzzy matching process. If you are dealing with a large number of names to match, the matrix can become very large. Adjust this value to avoid overloading your system’s memory.

  10. .verbose Verbose Output - This boolean flag determines whether additional information is displayed during the execution of the function. If set to TRUE, the function will output extra details, which can be helpful for troubleshooting and gaining insights into the matching process.

The output looks the following:

id_s id_t score rank_old rank_new name_s name_t iso3_s iso3_t city_s city_t address_s address_t size_s size_t
000F8750 E48EB751 1.0000000 1 1 NTR HOLDING A/S NTR HOLDING A/S DNK DNK COPENHAGEN COPENHAGEN BREDGADE 30 SANKT ANNAE PLADS 13 3 middle middle
000F8750 1749518D 0.6686869 2 2 NTR HOLDING A/S NEWCAP HOLDING A/S DNK DNK COPENHAGEN KOBENHAVN K BREDGADE 30 BREDGADE 30 middle middle
000F8750 6A2ED2E8 0.5789474 4 3 NTR HOLDING A/S LOYAL SOLUTIONS A/S DNK DNK COPENHAGEN COPENHAGEN BREDGADE 30 ROBERT JACOBSENS VEJ 68 middle middle
002FCAB5 FF136D09 1.0000000 1 1 VIROGATES A/S VIROGATES A/S DNK DNK BIRKEROD BIRKEROD BANEVAENGET 13 BLOKKEN 45 large large
002FCAB5 45C4AE7D 0.3942308 3 2 VIROGATES A/S PHOTOCAT A/S DNK DNK BIRKEROD ROSKILDE BANEVAENGET 13 LANGEBJERG 4 large large
0051857E 9E944051 0.7600000 1 1 GAUMONT SA GAUMONT FRA FRA NEUILLY-SUR-SEINE NEUILLY-SUR-SEINE 30, AVENUE CHARLES DE GAULLE 30, AVENUE CHARLES DE GAULLE middle middle

The output contains the following relevant columns: id_s: The ID of the source table id_t: The ID of the target table score: similarity score rank_old: Unadjusted Rank rank_new: Adjusted Rank (only relevant if .allow_mult = FALSE, than the rank will consider the best match of a name considering all the other matches)

Note _t stands for the column in the target table and _s for the columns in the source table

The matching is now already stored, so if you just want to change for example the weights, the calculation will be almost instant.

But if you want to change the columns, the function has to cache another matching

Deduplicating Matches

The Outputs of the matching tables is still not unique. We can easily deduplicate them by setting rank_new ==

we can quickly check that we have unique 1-1 matches with the function filter_dups() which return s a dataframe with all duplicates

[1] 0
[1] 0
[1] 0

Matching Comparison

Let us now see how good the matching library performs. We calculated 3 different matches which we stored in match1, match2, and match3.

First we load the pre-stored matches and combine all the matches to a single table

 

© 2025 | Supplementary materials for JAE submission