Efficient Ranking and Evaluation in AI-Powered E-commerce Search
Retrieval and ranking are different problems. Retrieval answers: which products are relevant to this query? Ranking answers: in what order should they be shown to this specific user, right now?
In AI-powered ecommerce search, ranking is solved in two separate passes, one offline, one online, because they have fundamentally different requirements. The offline pass is global and batch-computed; the online pass is personal and real-time. Running both in the hot path would destroy latency SLOs. Running only one sacrifices either relevance or personalization.
Why Rank Twice?
Hybrid search returns a merged candidate set for ‘running shoes’, say, of 500 products. The ranking problem is deciding which 10 to show first. A single pass combining global relevance and personal signals would require fetching real-time user data for every product in the candidate set which is expensive and slow. A single offline pass produces globally relevant but non-personalized results. The two-pass approach separates these concerns cleanly.
Pass 1: Offline Static Ranking (Global Relevance)
Runs as part of the batch pipeline. Three signal categories feed into the global relevance score:
Collective user behavior: Historical data on which products performed best for which queries. A purchase is weighted more than an add-to-cart, which is weighted more than a click.
Textual relevance: Semantic and keyword match between the query and the product’s title, description, and attributes.
Product popularity: Overall sales velocity, view counts, and ratings, independent of any specific query.
This global relevance score is stored alongside the product in the search index and cache. It becomes the baseline ordering for any query, before personalization is applied.
Pass 2: Online Personalised Re-ranking
At serving time, hybrid search returns the top 500 candidates. The online re-ranking model operates on this set using three personalization signals:
- short-term intent (recently viewed brand → boost that brand’s products),
- long-term preferences (historical purchases → boost preferred categories and attributes), and
- context (device type, geolocation, time of day).
The re-ranking model is lightweight by design, as it needs to run in milliseconds. Typically it is a gradient-boosted model or small neural network, not a large transformer. The heavy lifting of understanding query intent was done offline. The online model’s job is a fast reordering of a pre-qualified candidate set.
The Evaluation Stack
LLM-as-Judge: A judge model (GPT-4o or Claude 3.5 Sonnet) rates relevance on a 1–5 scale with concrete anchors:
1 is ‘query: Phone, result: Socks’
3 is ‘query: Nike Shoes, result: Adidas Shoes.’
5 is ‘query: iPhone 15, result: Apple iPhone 15 Pro.’
Run on a random 500 daily query sample. The judge explains its reasoning, making scores actionable.
Golden Set Regression Testing: Before any change ships, a pytest suite of critical queries must pass: ‘iphone’ must return Apple in the top 3. ‘Running shoes’ must return footwear. These block deployment. The golden set catches categorical failures LLM judges can miss. A plausible wrong result might score 3/5 from a judge but fail the golden set immediately.
A/B Testing on Revenue Per Session: Significant changes roll out to 5% of users. Bucketing: SHA-256 hash of user ID modulo 100, deterministic, same user always in the same bucket. Primary metric: Revenue Per Session, not NDCG, not CTR. The question is whether the change makes users more likely to buy.
HNSW vs. Inverted Partitioning
HNSW (Hierarchical Navigable Small Worlds) builds a multi-layered graph. Searches start at the top sparse layer, move to denser lower layers for precision. High recall, high throughput, low latency, but large memory footprint. Suitable for 1M to 10M item catalogs.
Inverted Partitioning groups and compresses vectors into clusters. Dramatically lower memory footprint suitable for 1B+ item catalogs. Tradeoff: higher latency and lower recall than HNSW.
For a 10M product catalog, Chapter 5 recommends HNSW the memory footprint is manageable, and the accuracy and latency advantages are worth it. Switch to inverted partitioning only when catalog size makes HNSW’s memory consumption prohibitive.
Closing Thoughts
The two-pass ranking architecture is a practical solution to a genuinely hard problem: relevant results AND personalized results, without making the hot path too expensive. The evaluation stack reflects the same principle: no single evaluation approach is sufficient. LLM-as-judge, golden set regression, and A/B testing together are more robust than any one alone, each catches failure modes the others miss.
Based on Sections 5.9–5.14 of System Design for the LLM Era, which references engineering work from DoorDash, Instacart, Picnic Engineering, and Grab. Available at:
Gumroad: sampritimitra.gumroad.com/l/systemdesignforthellmera
Topmate: topmate.io/sampritimitra/1840112

