Product Matching Across E-Commerce Sites: Algorithms That Actually Work
A technical deep-dive into fuzzy matching, AI-powered semantic matching, and vector embeddings for cross-site product comparison in e-commerce.
Product Matching Across E-Commerce Sites: Algorithms That Actually Work
Scraping competitor prices is the easy part. The hard part is figuring out which of your products corresponds to which of theirs.
"Premium Glass Jar 4oz Clear" from your catalog might be the same as "4 oz Clear Glass Container - Premium Quality" from a competitor. Or it might not — maybe theirs has a different closure type. Getting this right is the foundation of useful competitive intelligence.
Here's a technical look at the algorithms that solve this problem, their tradeoffs, and how to combine them for reliable results.
The Challenge
Product matching across e-commerce sites is hard because:
Naming conventions vary wildly. "1oz Mylar Bag" vs "Mylar Pouch 1 Ounce" vs "1-oz Flat Pouch, Mylar" — all the same product, all described differently. Attributes are embedded in titles. Size, color, material, and closure type are mashed into the product name rather than structured as separate fields. Extracting and comparing them requires parsing. Partial matches matter. Two products might be 80% similar — same material, same size, but different closure type. Whether that's a "match" depends on your business context. Scale compounds the problem. With 500 brand products and 5,000 competitor products, there are 2.5 million potential pairs to evaluate. Brute-force comparison doesn't work.Algorithm 1: Fuzzy Text Matching
Fuzzy matching compares product name strings using edit distance or token-based similarity metrics. The most practical implementation for product titles is token sort ratio.
How Token Sort Ratio Works
- "Premium Glass Jar 4oz Clear" → sorted: "4oz Clear Glass Jar Premium"
- "4 oz Clear Glass Container Premium" → sorted: "4 Clear Container Glass Premium oz"
Token sort handles word reordering, which is the most common difference between product names across stores.
Strengths
- Fast. RapidFuzz processes thousands of comparisons per second
- No external dependencies. Runs locally, no API calls
- Predictable. Same inputs always produce the same score
- Good for obvious matches. Products with similar names score 85-95%
Weaknesses
- Semantic blind spots. Doesn't know that "CR" means "child-resistant" or that "1oz" and "1 ounce" are the same
- Noise sensitivity. Marketing language ("Best Seller!", "NEW!") in titles reduces match accuracy
- No attribute awareness. Treats all words equally — can't distinguish size from color from material
When to Use
Fuzzy matching is a great first pass. Set a threshold of 75-85% and you'll catch the straightforward matches with high confidence. Products below the threshold need a smarter approach.
Algorithm 2: AI-Powered Semantic Matching
Language models (like GPT-4o-mini) understand product semantics. They know that "4oz" and "4 ounce" are equivalent, that "mylar" and "metalized polyester" refer to the same material, and that "pop top" is a closure type.
How It Works
Strengths
- Semantic understanding. Handles synonyms, abbreviations, and domain knowledge
- Context-aware. Can use industry profile (categories, common terms) to improve accuracy
- Explains reasoning. The model can articulate why two products match or don't match
- Handles ambiguity. Can flag "possible matches" for human review
Weaknesses
- Cost. Each matching call costs money (though GPT-4o-mini is cheap at ~$0.15 per 1M input tokens)
- Latency. API calls take 1-5 seconds per batch
- Non-deterministic. The same inputs might produce slightly different results across runs
- Hallucination risk. The model might confidently match products that aren't actually the same
When to Use
AI matching is ideal as a second pass after fuzzy matching. Run fuzzy first to catch the easy matches, then send the remaining unmatched products to the LLM for semantic analysis.
Algorithm 3: Vector Embeddings
Vector embeddings represent product names as high-dimensional numerical vectors. Similar products have vectors that are close together in embedding space, regardless of how differently they're worded.
How It Works
Strengths
- Scales efficiently. Embedding generation is a one-time cost per product. Similarity search is fast with HNSW indexes
- Language-agnostic similarity. Captures semantic meaning without explicit rules
- Incrementally updateable. New products get embedded once and are immediately searchable
Weaknesses
- Black box. Hard to explain why two products matched or didn't
- Requires infrastructure. Needs a vector database (though pgvector adds this to PostgreSQL natively)
- Embedding quality varies. General-purpose embeddings may not capture domain-specific nuances (e.g., packaging terminology)
When to Use
Vector search works well as a candidate retrieval step. Find the top 10-20 nearest neighbors for each product, then use fuzzy matching or AI to confirm the actual match.
The Hybrid Approach
The most reliable strategy combines all three algorithms:
Pass 1: Candidate Retrieval with Embeddings
Generate embeddings for all products. For each brand product, retrieve the top 20 most similar competitor products by cosine similarity. This reduces the search space from thousands to a manageable candidate set.
Pass 2: Fuzzy Scoring
Run token sort ratio on all candidate pairs. Products scoring above 85% are high-confidence matches. Products scoring 60-85% go to the AI pass.
Pass 3: AI Confirmation
Send ambiguous candidates (60-85% fuzzy score) to GPT-4o-mini for semantic evaluation. The model provides a confidence score and reasoning.
Pass 4: Human Review
Products that the AI is uncertain about (confidence below 70%) get flagged for manual review. This is typically 5-10% of the total — a manageable workload.
Practical Considerations
Price Ratio Guards
If your product costs $5 and the potential match costs $500, they're probably not the same product regardless of name similarity. Apply a price ratio guard (e.g., reject matches where the price ratio exceeds 10x).
Stale Product Detection
Products that haven't been updated by a scrape in over 30 days should be flagged. They might be discontinued or out of stock, making matches unreliable.
Confidence Tracking
Track the confidence distribution of your matches over time. If average confidence is dropping, it might indicate that competitors are changing their naming conventions or that your product catalog has shifted.
Implementation in VantageDash
VantageDash implements all three algorithms. The Comparison page shows matched products with confidence scores, and you can run fuzzy matching, AI matching, or hybrid matching from the dashboard. Product embeddings are stored via pgvector in Supabase, enabling fast similarity search across thousands of products.
Match results include confidence scores, reasoning from the AI model, and price-per-unit comparisons to help you make informed pricing decisions.