Designing an Entity Matching and Data Standardization System

Architecture patterns for building systems that reconcile messy real-world names with canonical records

Engineering Team·

Overview

Real-world data is messy. Organizations routinely receive information containing entity names that don't match their canonical records. A company name might appear as "IBM", "I.B.M.", "International Business Machines", or "Int'l Business Machines Corp" across different sources. Educational institutions, hospitals, government agencies, and countless other entities suffer from the same name variation problem.

This article explores how to design an entity matching system, with particular focus on the string matching algorithms that power fuzzy search capabilities.

The Problem

Entity matching appears straightforward until you encounter real-world variations:

Variation TypeExample
Abbreviation"International Business Machines" → "IBM"
Punctuation"I.B.M." vs "IBM"
Word reordering"University of California" vs "California University"
Typos"Machiens" vs "Machines"
OCR errors"1BM" vs "IBM" (digit/letter confusion)
Partial names"Stanford" vs "Stanford University"

A robust system must handle all these while minimizing false positives and false negatives.

String Matching Algorithms

The core of any entity matching system is its string similarity algorithms. Different algorithms excel at different types of variations.

Levenshtein Distance (Edit Distance)

The Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

For strings aa and bb with lengths a|a| and b|b|, define d(i,j)d(i,j) as the distance between the first ii characters of aa and first jj characters of bb:

d(i,j)={max(i,j)if min(i,j)=0min{d(i1,j)+1(deletion)d(i,j1)+1(insertion)d(i1,j1)+1aibj(substitution)otherwised(i,j) = \begin{cases} \max(i,j) & \text{if } \min(i,j) = 0 \\ \min \begin{cases} d(i-1, j) + 1 & \text{(deletion)} \\ d(i, j-1) + 1 & \text{(insertion)} \\ d(i-1, j-1) + \mathbf{1}_{a_i \neq b_j} & \text{(substitution)} \end{cases} & \text{otherwise} \end{cases}

Example: "kitten" → "sitting"

sitting
01234567
k11234567
i22123456
t33212345
t44321234
e55432234
n66543323

Distance = 3 (substitute k→s, substitute e→i, insert g)

Normalized similarity:

sim(a,b)=1d(a,b)max(a,b)\text{sim}(a, b) = 1 - \frac{d(a, b)}{\max(|a|, |b|)}
def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            dp[i][j] = min(
                dp[i-1][j] + 1,      # deletion
                dp[i][j-1] + 1,      # insertion
                dp[i-1][j-1] + cost  # substitution
            )

    return dp[m][n]

Best for: Typos, small character-level errors.

Jaro-Winkler Similarity

Jaro-Winkler is designed for short strings like names. It considers character matches and transpositions.

Jaro similarity for strings s1s_1 and s2s_2:

jaro(s1,s2)={0if m=013(ms1+ms2+mtm)otherwise\text{jaro}(s_1, s_2) = \begin{cases} 0 & \text{if } m = 0 \\ \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) & \text{otherwise} \end{cases}

where:

  • mm = number of matching characters (within window max(s1,s2)21\lfloor \frac{\max(|s_1|, |s_2|)}{2} \rfloor - 1)
  • tt = number of transpositions / 2

Jaro-Winkler boosts similarity for strings with common prefixes:

jaro_winkler(s1,s2)=jaro(s1,s2)+p(1jaro(s1,s2))\text{jaro\_winkler}(s_1, s_2) = \text{jaro}(s_1, s_2) + \ell \cdot p \cdot (1 - \text{jaro}(s_1, s_2))

where:

  • \ell = length of common prefix (up to 4 characters)
  • pp = scaling factor (typically 0.1)

Example: "MARTHA" vs "MARHTA"

  • Matches: M, A, R, T, H, A (m = 6)
  • Transpositions: T/H are transposed (t = 1)
  • Jaro = 13(66+66+616)=0.944\frac{1}{3}(\frac{6}{6} + \frac{6}{6} + \frac{6-1}{6}) = 0.944
  • Common prefix "MAR" (ℓ = 3)
  • Jaro-Winkler = 0.944+3×0.1×(10.944)=0.9610.944 + 3 \times 0.1 \times (1 - 0.944) = 0.961

Best for: Personal names, short entity names.

N-gram Similarity (Jaccard)

N-grams split strings into overlapping substrings of length n. Comparison uses set operations.

For string "hello" with n=2 (bigrams):

ngrams("hello",2)={he,el,ll,lo}\text{ngrams}(\text{"hello"}, 2) = \{\text{he}, \text{el}, \text{ll}, \text{lo}\}

Jaccard similarity between two n-gram sets:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}

Example: "night" vs "nacht"

StringBigrams
night{ni, ig, gh, ht}
nacht{na, ac, ch, ht}
  • Intersection: {ht}
  • Union: {ni, ig, gh, ht, na, ac, ch}
  • Jaccard = 1/7 = 0.143

Dice coefficient (alternative, often better for short strings):

Dice(A,B)=2ABA+B\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}
def ngram_similarity(s1, s2, n=2):
    def get_ngrams(s, n):
        return set(s[i:i+n] for i in range(len(s) - n + 1))

    ng1 = get_ngrams(s1.lower(), n)
    ng2 = get_ngrams(s2.lower(), n)

    intersection = len(ng1 & ng2)
    union = len(ng1 | ng2)

    return intersection / union if union > 0 else 0

Best for: Longer strings, partial matches, word reordering.

Phonetic Algorithms (Soundex)

Phonetic algorithms encode strings by pronunciation, matching words that sound alike despite different spellings.

Soundex Encoding Table:

CodeLetters
1B, F, P, V
2C, G, J, K, Q, S, X, Z
3D, T
4L
5M, N
6R
(drop)A, E, I, O, U, H, W, Y

Algorithm: Keep first letter → encode consonants → remove vowels → remove consecutive duplicates → pad/truncate to 4 chars.

Example:

NameProcessSoundex
RobertR-O-B-E-R-T → R-1-6-3R163
RupertR-U-P-E-R-T → R-1-6-3R163
SmithS-M-I-T-H → S-5-3S530
SmythS-M-Y-T-H → S-5-3S530
def soundex(name):
    if not name:
        return ""

    name = name.upper()
    code = name[0]

    mapping = {
        'B': '1', 'F': '1', 'P': '1', 'V': '1',
        'C': '2', 'G': '2', 'J': '2', 'K': '2',
        'Q': '2', 'S': '2', 'X': '2', 'Z': '2',
        'D': '3', 'T': '3',
        'L': '4',
        'M': '5', 'N': '5',
        'R': '6'
    }

    for char in name[1:]:
        if char in mapping:
            digit = mapping[char]
            if digit != code[-1]:
                code += digit

    code = code[0] + code[1:].replace('0', '')
    return (code + '000')[:4]

Best for: Names with spelling variations, phonetic typos.

Token-Based Matching

For multi-word entity names, tokenize first and compare token sets.

Token Jaccard:

TokenJaccard(s1,s2)=J(tokens(s1),tokens(s2))\text{TokenJaccard}(s_1, s_2) = J(\text{tokens}(s_1), \text{tokens}(s_2))

Example: "University of California, Berkeley" vs "Berkeley, University of California"

  • Tokens: {university, of, california, berkeley} = identical sets
  • Similarity = 1.0

TF-IDF weighted matching assigns higher weight to distinctive tokens:

tfidf(t,d,D)=tf(t,d)×logD{dD:td}\text{tfidf}(t, d, D) = \text{tf}(t, d) \times \log\frac{|D|}{|\{d \in D : t \in d\}|}

Common words like "University" or "Corporation" get lower weights; distinctive words drive the match.

def token_similarity(s1, s2):
    tokens1 = set(normalize(s1).split())
    tokens2 = set(normalize(s2).split())

    intersection = len(tokens1 & tokens2)
    union = len(tokens1 | tokens2)

    return intersection / union if union > 0 else 0

Best for: Long names, word reordering, missing/extra words.

System Architecture

The matching system applies algorithms in layers, from fastest/strictest to slowest/fuzziest:

def match_entity(query, reference_db):
    query = normalize(query)

    # Layer 1: Exact match (O(1) with hash index)
    if match := reference_db.exact_match(query):
        return Result(match, confidence=1.0)

    if match := reference_db.alias_match(query):
        return Result(match, confidence=0.95)

    # Layer 2: Fuzzy matching
    candidates = []

    # Phonetic matching (fast, good for typos)
    phonetic_code = soundex(query)
    for match in reference_db.by_soundex(phonetic_code):
        score = jaro_winkler(query, match.name)
        candidates.append((match, score, "phonetic"))

    # Token matching (good for reordering)
    query_tokens = set(query.split())
    for match in reference_db.by_tokens(query_tokens):
        score = token_similarity(query, match.name)
        candidates.append((match, score, "token"))

    # Edit distance (expensive, last resort)
    for match in reference_db.top_by_ngram(query, limit=100):
        score = 1 - levenshtein(query, match.name) / max(len(query), len(match.name))
        if score > 0.6:
            candidates.append((match, score, "edit"))

    # Rank and return
    candidates.sort(key=lambda x: -x[1])
    if candidates and candidates[0][1] > 0.85:
        return Result(candidates[0][0], candidates[0][1])

    return AmbiguousResult(candidates[:5]) if candidates else NoMatch()

Choosing the Right Algorithm

Variation TypeBest AlgorithmReason
Typos (1-2 chars)LevenshteinDirectly measures edit operations
Phonetic errorsSoundex / MetaphoneMatches by pronunciation
Name transpositionJaro-WinklerHandles transposed characters
Word reorderingToken JaccardOrder-independent
AbbreviationsToken + Alias table"Univ" → "University" via lookup
Mixed variationsEnsemble scoringCombine multiple algorithms

The Value

A well-designed entity matching system transforms data quality from an ongoing burden into a solved problem. The combination of exact matching for speed, phonetic matching for typos, and token matching for reordering handles the vast majority of real-world variations automatically.

The alias accumulation creates a virtuous cycle. Each resolved variation becomes an exact match for future queries. The system learns the particular variations common in your data sources, becoming more accurate over time.