Designing Search Ranking System

May 8, 2022 4 min read

Table of Contents

Overview of Search Rank Chapter from educative.io and other sources:
Identify the metrics:
High Level architecture
- Document Selection:
References:
Topics/Concepts to learn:

Overview of Search Rank Chapter from educative.io and other sources:

Problem: Design a search rank system for a search engine

Clarify the following:
- Scope of the problem
- Scaling factors
- Personalization aspects

Identify the metrics:

Online Metrics: (Applicable per query)
- Click through rate: \[ \text{Click Through Rate} = \frac{\text{Number of clicks}}{\text{Total Number of views or impressions}} \]
- Sucessful Session Rate
- Time to Success
Offline Metrics:
- Normalized Discounted Cumulative Gain (NDCG)
  - Cumulative Gain is the sum of all document relevance
  - Discounted Cumulative Gain penalizes incorrect relevancy of the document based on ground truth. (Ground truth is actual relevance as rated by human raters)
  - DCG can’t be used to compare the search engine’s performance across different queries on an absolute scale. The is because the length of the result list varies from query to query. So, the DCG for a query with a longer result list may be higher due to its length instead of its quality. To remedy this, you need to move towards NDCG.
  - NDCG is calculated by dividing DCG with max(DCG) or by calculating ideal DCG. IDCG is calculated by rearranging search results as per human raters.
- Average NDCG across N queries

High Level architecture

Two step process:
- Document Selection
- Filtering and ranking relevancy

Document Selection:

Four actors:
- Searcher: prior query results, historical engagement
- Actual Query
- Document corpus
- Context

Overall approach for collecting documents for training:

Collect a large number of (randomly extracted) queries to form a test set.
For each request q- Collect the documents {dj} m j = 1 associated with the request.
Take the judgment on the appropriateness of each document through a human evaluation.
Use a ranking model to rank the documents.
Measure the difference between the results of the ranking and the assessment of appropriateness using an evaluation measure.
Use the average measure for all queries in the test set to evaluate the performance of the ranking model.

Query Processing:

Lexical Based: Spell check, expansion, relaxation (removing stop words, helps to improve recall), translation (another form of rewriting text translation applied to convert tail queries into head queries which inturn helps recall), normalization (Stemming, unicode standardization, removing accent, spell checking and tokenization) etc.
Graph based:

During query time, the query is parsed and tagged to the relevant concepts (i.e., nodes) in the knowledge graph. The query can then be expanded—to include relevant concepts—by finding the closest nodes in the knowledge graphs.

Both Lexical and Graph based approaches treat queries as bag of words and apply n-grams approach and struggle to understand synonyms, antonyms, morphological variants, spelling errors etc. Knowledge graph requires lots of manual work, huge effort to build the ontology and graph management.

Embedding based approaches require large behavioral data and is a problem with cold start.

Embeddings Based

Embedding-based retrieval is a technique to use embeddings to represent query and documents, and then convert the retrieval problem into a nearest neighbor (NN) search problem in the embedding space

Scaling Challenges

Document retreival layer has millions of documents to consider against a given query unlike the ranking layer which has to deal with only 100s of document results.

Building embeddings for these millions of documents and searching against them introduces lots of scaling challenges.

Learning to Rank:

-Feature based ranking:

BM25
VSM
Latent Semantic Indexing
Language Model for Info retrieval - LMIR
PageRank, TrustRank and Query-dependent Page Rank models

Machine learning approaches to ranking:

Pointwise Ranking
Pairwise Ranking
Listwise Ranking

References:

Topics/Concepts to learn:

Query Rewriting:
- Stemming
- Lemmetizing
- Normalization
- SpellChecking techniques
- Tokenization
Embedding:
- Skip-gram approach
- CBOW
- Word2vec
- Glove Algorithm
Metric:
- Jaccard Similarity
- Cosine Similarity
- NDCG
- Pointwise Mutual Information Matrix
NN architectures:
- Transformer
- RNN
- LSTM
- GRU
Ranking Approaches:
- Pointwise Ranking
- Pairwise Ranking
- Listwise Ranking

ML search_ranking