Metrics for Information Retrieval and Question Answering Pipelines

https://1drv.ms/b/s!Au45o0W1gVVLvclLipPETRiXaOUz1g?e=PQb6Tx

👉 For Information Retrieval Pipelines:

🎯 Hit Rate (HR): It calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. It’s about how often retriever gets it right within the top few guesses.
🎯 Recall: Measures how many times the correct document was among the retrieved documents over a set of queries. It's affected by the number of documents that the retriever returns.
🎯 Precision: Tells us how precise the system is. It counts how many of all retrieved documents were relevant to the query.

👉 For Question Answering Pipelines:

🎯 Mean Reciprocal Rank (MRR): The position of the first correctly retrieved document, accounting for the fact that a query elicits multiple responses of varying relevance.

<aside> 📘 应该是 rerank 的指标，度量 retriever 返回的最优信息是否出现在了最前列

$$ MRR = \frac{\sum_{i=1}^Q 1/rank_i}{|Q|} $$

注：Q 是 queries 总次数，$rank_i$ 是每一次 query 结果里最优答案的位置

</aside>
🎯 Mean Average Precision (mAP): The position of every correctly retrieved document. This metric is handy when there is more than one correct document to be retrieved.

<aside> 📘 也称为 mAP@K，度量第 K 个正确答案的位置。可以理解为 MRR 的复数版。

</aside>
🎯 Normalized Discounted Cumulative Gain (NDCG): A ranking performance measure that focuses on the relevant document's position in search results.

<aside> 📘 NDCG 计算步骤
1. Gain(G): 假设 retriever 返回了多条数据，给每一条数据打一个客观分数（Gain）
2. Cumulative Gain(CG): 计算本次返回的所有数据的总分
3. Discounted Cumulative Gain(DCG): 不仅仅是简单加和，而是每一条数据都要乘上它所在的位置系数再加和，越靠前的位置系数越高
4. Ideal Discounted Cumulative Gain (IDCG): 假设所有的数据都按照最优位置排列，理论上能取得的最高分
5. Normalization Discounted Cumulative Gain (NDCG): $\frac{DCG}{IDCG}$ 得到归一化的 NDCG。 </aside>
🎯 Exact Match (EM): Measures the proportion of cases where the predicted answer is identical to the correct answer.
🎯 F1 Score : More forgiving, measures the word overlap between the labeled and the predicted answer.

<aside> 📘 也叫 F-score https://en.wikipedia.org/wiki/F-score

就是 recall 和 precision 的调和平均数：

$$ F_1 = \frac{2}{recall^{-1} + precision^{-1}} $$

</aside>
🎯 Semantic Answer Similarity (SAS): Uses a transformer-based, cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap.

Semantic Answer Similarity for Evaluating Question Answering Models

Untitled