Designing an Entity Matching and Data Standardization System

Overview

Real-world data is messy. Organizations routinely receive information containing entity names that don't match their canonical records. A company name might appear as "IBM", "I.B.M.", "International Business Machines", or "Int'l Business Machines Corp" across different sources. Educational institutions, hospitals, government agencies, and countless other entities suffer from the same name variation problem.

This article explores how to design an entity matching system, with particular focus on the string matching algorithms that power fuzzy search capabilities.

The Problem

Entity matching appears straightforward until you encounter real-world variations:

Variation Type	Example
Abbreviation	"International Business Machines" → "IBM"
Punctuation	"I.B.M." vs "IBM"
Word reordering	"University of California" vs "California University"
Typos	"Machiens" vs "Machines"
OCR errors	"1BM" vs "IBM" (digit/letter confusion)
Partial names	"Stanford" vs "Stanford University"

A robust system must handle all these while minimizing false positives and false negatives.

String Matching Algorithms

The core of any entity matching system is its string similarity algorithms. Different algorithms excel at different types of variations.

Levenshtein Distance (Edit Distance)

The Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

For strings $a$ and $b$ with lengths $|a|$ and $|b|$ , define $d(i,j)$ as the distance between the first $i$ characters of $a$ and first $j$ characters of $b$ :

d(i,j) = \begin{cases} \max(i,j) & \text{if } \min(i,j) = 0 \\ \min \begin{cases} d(i-1, j) + 1 & \text{(deletion)} \\ d(i, j-1) + 1 & \text{(insertion)} \\ d(i-1, j-1) + \mathbf{1}_{a_i \neq b_j} & \text{(substitution)} \end{cases} & \text{otherwise} \end{cases}

Example: "kitten" → "sitting"

		s	i	t	t	i	n	g
	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

Distance = 3 (substitute k→s, substitute e→i, insert g)

Normalized similarity:

\text{sim}(a, b) = 1 - \frac{d(a, b)}{\max(|a|, |b|)}

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            dp[i][j] = min(
                dp[i-1][j] + 1,      # deletion
                dp[i][j-1] + 1,      # insertion
                dp[i-1][j-1] + cost  # substitution
            )

    return dp[m][n]

Best for: Typos, small character-level errors.

Jaro-Winkler Similarity

Jaro-Winkler is designed for short strings like names. It considers character matches and transpositions.

Jaro similarity for strings $s_1$ and $s_2$ :

\text{jaro}(s_1, s_2) = \begin{cases} 0 & \text{if } m = 0 \\ \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) & \text{otherwise} \end{cases}

where:

$m$ = number of matching characters (within window $\lfloor \frac{\max(|s_1|, |s_2|)}{2} \rfloor - 1$ )
$t$ = number of transpositions / 2

Jaro-Winkler boosts similarity for strings with common prefixes:

\text{jaro\_winkler}(s_1, s_2) = \text{jaro}(s_1, s_2) + \ell \cdot p \cdot (1 - \text{jaro}(s_1, s_2))

where:

$\ell$ = length of common prefix (up to 4 characters)
$p$ = scaling factor (typically 0.1)

Example: "MARTHA" vs "MARHTA"

Matches: M, A, R, T, H, A (m = 6)
Transpositions: T/H are transposed (t = 1)
Jaro = $\frac{1}{3}(\frac{6}{6} + \frac{6}{6} + \frac{6-1}{6}) = 0.944$
Common prefix "MAR" (ℓ = 3)
Jaro-Winkler = $0.944 + 3 \times 0.1 \times (1 - 0.944) = 0.961$

Best for: Personal names, short entity names.

N-gram Similarity (Jaccard)

N-grams split strings into overlapping substrings of length n. Comparison uses set operations.

For string "hello" with n=2 (bigrams):

\text{ngrams}(\text{"hello"}, 2) = \{\text{he}, \text{el}, \text{ll}, \text{lo}\}

Jaccard similarity between two n-gram sets:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Example: "night" vs "nacht"

String	Bigrams
night	{ni, ig, gh, ht}
nacht	{na, ac, ch, ht}

Intersection: {ht}
Union: {ni, ig, gh, ht, na, ac, ch}
Jaccard = 1/7 = 0.143

Dice coefficient (alternative, often better for short strings):

\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}

def ngram_similarity(s1, s2, n=2):
    def get_ngrams(s, n):
        return set(s[i:i+n] for i in range(len(s) - n + 1))

    ng1 = get_ngrams(s1.lower(), n)
    ng2 = get_ngrams(s2.lower(), n)

    intersection = len(ng1 & ng2)
    union = len(ng1 | ng2)

    return intersection / union if union > 0 else 0

Best for: Longer strings, partial matches, word reordering.

Phonetic Algorithms (Soundex)

Phonetic algorithms encode strings by pronunciation, matching words that sound alike despite different spellings.

Soundex Encoding Table:

Code	Letters
1	B, F, P, V
2	C, G, J, K, Q, S, X, Z
3	D, T
4	L
5	M, N
6	R
(drop)	A, E, I, O, U, H, W, Y

Algorithm: Keep first letter → encode consonants → remove vowels → remove consecutive duplicates → pad/truncate to 4 chars.

Example:

Name	Process	Soundex
Robert	R-O-B-E-R-T → R-1-6-3	R163
Rupert	R-U-P-E-R-T → R-1-6-3	R163
Smith	S-M-I-T-H → S-5-3	S530
Smyth	S-M-Y-T-H → S-5-3	S530

def soundex(name):
    if not name:
        return ""

    name = name.upper()
    code = name[0]

    mapping = {
        'B': '1', 'F': '1', 'P': '1', 'V': '1',
        'C': '2', 'G': '2', 'J': '2', 'K': '2',
        'Q': '2', 'S': '2', 'X': '2', 'Z': '2',
        'D': '3', 'T': '3',
        'L': '4',
        'M': '5', 'N': '5',
        'R': '6'
    }

    for char in name[1:]:
        if char in mapping:
            digit = mapping[char]
            if digit != code[-1]:
                code += digit

    code = code[0] + code[1:].replace('0', '')
    return (code + '000')[:4]

Best for: Names with spelling variations, phonetic typos.

Token-Based Matching

For multi-word entity names, tokenize first and compare token sets.

Token Jaccard:

\text{TokenJaccard}(s_1, s_2) = J(\text{tokens}(s_1), \text{tokens}(s_2))

Example: "University of California, Berkeley" vs "Berkeley, University of California"

Tokens: {university, of, california, berkeley} = identical sets
Similarity = 1.0

TF-IDF weighted matching assigns higher weight to distinctive tokens:

\text{tfidf}(t, d, D) = \text{tf}(t, d) \times \log\frac{|D|}{|\{d \in D : t \in d\}|}

Common words like "University" or "Corporation" get lower weights; distinctive words drive the match.

def token_similarity(s1, s2):
    tokens1 = set(normalize(s1).split())
    tokens2 = set(normalize(s2).split())

    intersection = len(tokens1 & tokens2)
    union = len(tokens1 | tokens2)

    return intersection / union if union > 0 else 0

Best for: Long names, word reordering, missing/extra words.

System Architecture

The matching system applies algorithms in layers, from fastest/strictest to slowest/fuzziest:

def match_entity(query, reference_db):
    query = normalize(query)

    # Layer 1: Exact match (O(1) with hash index)
    if match := reference_db.exact_match(query):
        return Result(match, confidence=1.0)

    if match := reference_db.alias_match(query):
        return Result(match, confidence=0.95)

    # Layer 2: Fuzzy matching
    candidates = []

    # Phonetic matching (fast, good for typos)
    phonetic_code = soundex(query)
    for match in reference_db.by_soundex(phonetic_code):
        score = jaro_winkler(query, match.name)
        candidates.append((match, score, "phonetic"))

    # Token matching (good for reordering)
    query_tokens = set(query.split())
    for match in reference_db.by_tokens(query_tokens):
        score = token_similarity(query, match.name)
        candidates.append((match, score, "token"))

    # Edit distance (expensive, last resort)
    for match in reference_db.top_by_ngram(query, limit=100):
        score = 1 - levenshtein(query, match.name) / max(len(query), len(match.name))
        if score > 0.6:
            candidates.append((match, score, "edit"))

    # Rank and return
    candidates.sort(key=lambda x: -x[1])
    if candidates and candidates[0][1] > 0.85:
        return Result(candidates[0][0], candidates[0][1])

    return AmbiguousResult(candidates[:5]) if candidates else NoMatch()

Choosing the Right Algorithm

Variation Type	Best Algorithm	Reason
Typos (1-2 chars)	Levenshtein	Directly measures edit operations
Phonetic errors	Soundex / Metaphone	Matches by pronunciation
Name transposition	Jaro-Winkler	Handles transposed characters
Word reordering	Token Jaccard	Order-independent
Abbreviations	Token + Alias table	"Univ" → "University" via lookup
Mixed variations	Ensemble scoring	Combine multiple algorithms

The Value

A well-designed entity matching system transforms data quality from an ongoing burden into a solved problem. The combination of exact matching for speed, phonetic matching for typos, and token matching for reordering handles the vast majority of real-world variations automatically.

The alias accumulation creates a virtuous cycle. Each resolved variation becomes an exact match for future queries. The system learns the particular variations common in your data sources, becoming more accurate over time.

		s	i	t	t	i	n	g
	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

		s	i	t	t	i	n	g
	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

		s	i	t	t	i	n	g
	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3