👉 For Information Retrieval Pipelines:
🎯 Hit Rate (HR)
: It calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. It’s about how often retriever gets it right within the top few guesses.Recall
: Measures how many times the correct document was among the retrieved documents over a set of queries. It's affected by the number of documents that the retriever returns.Precision
: Tells us how precise the system is. It counts how many of all retrieved documents were relevant to the query.👉 For Question Answering Pipelines:
🎯 Mean Reciprocal Rank (MRR)
: The position of the first correctly retrieved document, accounting for the fact that a query elicits multiple responses of varying relevance.
<aside> 📘 应该是 rerank 的指标,度量 retriever 返回的最优信息是否出现在了最前列
$$ MRR = \frac{\sum_{i=1}^Q 1/rank_i}{|Q|} $$
注:Q 是 queries 总次数,$rank_i$ 是每一次 query 结果里最优答案的位置
</aside>
🎯 Mean Average Precision (mAP)
: The position of every correctly retrieved document. This metric is handy when there is more than one correct document to be retrieved.
<aside>
📘 也称为 mAP@K
,度量第 K 个正确答案的位置。可以理解为 MRR 的复数版。
</aside>
🎯 Normalized Discounted Cumulative Gain (NDCG)
: A ranking performance measure that focuses on the relevant document's position in search results.
<aside> 📘 NDCG 计算步骤
Gain(G)
: 假设 retriever 返回了多条数据,给每一条数据打一个客观分数(Gain
)Cumulative Gain(CG)
: 计算本次返回的所有数据的总分Discounted Cumulative Gain(DCG)
: 不仅仅是简单加和,而是每一条数据都要乘上它所在的位置系数再加和,越靠前的位置系数越高Ideal Discounted Cumulative Gain (IDCG)
: 假设所有的数据都按照最优位置排列,理论上能取得的最高分Normalization Discounted Cumulative Gain (NDCG)
: $\frac{DCG}{IDCG}$ 得到归一化的 NDCG。
</aside>🎯 Exact Match (EM)
: Measures the proportion of cases where the predicted answer is identical to the correct answer.
🎯 F1 Score
: More forgiving, measures the word overlap between the labeled and the predicted answer.
<aside> 📘 也叫 F-score https://en.wikipedia.org/wiki/F-score
就是 recall 和 precision 的调和平均数:
$$ F_1 = \frac{2}{recall^{-1} + precision^{-1}} $$
</aside>
🎯 Semantic Answer Similarity (SAS)
: Uses a transformer-based, cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap.
Semantic Answer Similarity for Evaluating Question Answering Models