Product Matching Across E-Commerce Sites: Algorithms That Actually Work

Scraping competitor prices is the easy part. The hard part is figuring out which of your products corresponds to which of theirs.

"Premium Glass Jar 4oz Clear" from your catalog might be the same as "4 oz Clear Glass Container - Premium Quality" from a competitor. Or it might not — maybe theirs has a different closure type. Getting this right is the foundation of useful competitive intelligence.

Here's a technical look at the algorithms that solve this problem, their tradeoffs, and how to combine them for reliable results.

The Challenge

Product matching across e-commerce sites is hard because:

Naming conventions vary wildly. "1oz Mylar Bag" vs "Mylar Pouch 1 Ounce" vs "1-oz Flat Pouch, Mylar" — all the same product, all described differently. Attributes are embedded in titles. Size, color, material, and closure type are mashed into the product name rather than structured as separate fields. Extracting and comparing them requires parsing. Partial matches matter. Two products might be 80% similar — same material, same size, but different closure type. Whether that's a "match" depends on your business context. Scale compounds the problem. With 500 brand products and 5,000 competitor products, there are 2.5 million potential pairs to evaluate. Brute-force comparison doesn't work.

Algorithm 1: Fuzzy Text Matching

Fuzzy matching compares product name strings using edit distance or token-based similarity metrics. The most practical implementation for product titles is token sort ratio.

How Token Sort Ratio Works

Tokenize both strings (split into words)

Sort tokens alphabetically

Compare the sorted strings using Levenshtein distance

Return a similarity score from 0-100

Example:

"Premium Glass Jar 4oz Clear" → sorted: "4oz Clear Glass Jar Premium"
"4 oz Clear Glass Container Premium" → sorted: "4 Clear Container Glass Premium oz"

Token sort handles word reordering, which is the most common difference between product names across stores.

Strengths

Fast. RapidFuzz processes thousands of comparisons per second
No external dependencies. Runs locally, no API calls
Predictable. Same inputs always produce the same score
Good for obvious matches. Products with similar names score 85-95%

Weaknesses

Semantic blind spots. Doesn't know that "CR" means "child-resistant" or that "1oz" and "1 ounce" are the same
Noise sensitivity. Marketing language ("Best Seller!", "NEW!") in titles reduces match accuracy
No attribute awareness. Treats all words equally — can't distinguish size from color from material

When to Use

Fuzzy matching is a great first pass. Set a threshold of 75-85% and you'll catch the straightforward matches with high confidence. Products below the threshold need a smarter approach.

Algorithm 2: AI-Powered Semantic Matching

Language models (like GPT-4o-mini) understand product semantics. They know that "4oz" and "4 ounce" are equivalent, that "mylar" and "metalized polyester" refer to the same material, and that "pop top" is a closure type.

How It Works

Send your brand product and a batch of competitor products to the LLM

The prompt includes industry context (e.g., "packaging" or "supplements")

The model returns matched pairs with confidence scores and reasoning

Results are stored for future reference

Strengths

Semantic understanding. Handles synonyms, abbreviations, and domain knowledge
Context-aware. Can use industry profile (categories, common terms) to improve accuracy
Explains reasoning. The model can articulate why two products match or don't match
Handles ambiguity. Can flag "possible matches" for human review

Weaknesses

Cost. Each matching call costs money (though GPT-4o-mini is cheap at ~$0.15 per 1M input tokens)
Latency. API calls take 1-5 seconds per batch
Non-deterministic. The same inputs might produce slightly different results across runs
Hallucination risk. The model might confidently match products that aren't actually the same

When to Use

AI matching is ideal as a second pass after fuzzy matching. Run fuzzy first to catch the easy matches, then send the remaining unmatched products to the LLM for semantic analysis.

Algorithm 3: Vector Embeddings

Vector embeddings represent product names as high-dimensional numerical vectors. Similar products have vectors that are close together in embedding space, regardless of how differently they're worded.

How It Works

Generate embeddings for all product names using a model like OpenAI's text-embedding-3-small (1536 dimensions)

Store embeddings in a vector database (pgvector in PostgreSQL)

For each brand product, find the nearest competitor product vectors using cosine similarity

Products within a similarity threshold are potential matches

Strengths

Scales efficiently. Embedding generation is a one-time cost per product. Similarity search is fast with HNSW indexes
Language-agnostic similarity. Captures semantic meaning without explicit rules
Incrementally updateable. New products get embedded once and are immediately searchable

Weaknesses

Black box. Hard to explain why two products matched or didn't
Requires infrastructure. Needs a vector database (though pgvector adds this to PostgreSQL natively)
Embedding quality varies. General-purpose embeddings may not capture domain-specific nuances (e.g., packaging terminology)

When to Use

Vector search works well as a candidate retrieval step. Find the top 10-20 nearest neighbors for each product, then use fuzzy matching or AI to confirm the actual match.

The Hybrid Approach

The most reliable strategy combines all three algorithms:

Pass 1: Candidate Retrieval with Embeddings

Generate embeddings for all products. For each brand product, retrieve the top 20 most similar competitor products by cosine similarity. This reduces the search space from thousands to a manageable candidate set.

Pass 2: Fuzzy Scoring

Run token sort ratio on all candidate pairs. Products scoring above 85% are high-confidence matches. Products scoring 60-85% go to the AI pass.

Pass 3: AI Confirmation

Send ambiguous candidates (60-85% fuzzy score) to GPT-4o-mini for semantic evaluation. The model provides a confidence score and reasoning.

Pass 4: Human Review

Products that the AI is uncertain about (confidence below 70%) get flagged for manual review. This is typically 5-10% of the total — a manageable workload.

Practical Considerations

Price Ratio Guards

If your product costs $5 and the potential match costs $500, they're probably not the same product regardless of name similarity. Apply a price ratio guard (e.g., reject matches where the price ratio exceeds 10x).

Stale Product Detection

Products that haven't been updated by a scrape in over 30 days should be flagged. They might be discontinued or out of stock, making matches unreliable.

Confidence Tracking

Track the confidence distribution of your matches over time. If average confidence is dropping, it might indicate that competitors are changing their naming conventions or that your product catalog has shifted.

Implementation in VantageDash

VantageDash implements all three algorithms. The Comparison page shows matched products with confidence scores, and you can run fuzzy matching, AI matching, or hybrid matching from the dashboard. Product embeddings are stored via pgvector in Supabase, enabling fast similarity search across thousands of products.

Match results include confidence scores, reasoning from the AI model, and price-per-unit comparisons to help you make informed pricing decisions.