RAG，QA常用的数据集和评价标准多为知识密集型、问答型数据集

数据集

1.UltraDomain

lightrag曾使用

使用MemoRAG提出的Benchmark。

在UltraDomain里，包含多个领域的数据，每个数据包括多本书。以cs为例，共含有100本书和100个对应的问题。该领域专注于计算机科学，涵盖数据科学和软件工程的关键领域。它特别强调机器学习和大数据处理，内容涉及推荐系统、分类算法以及使用Spark进行实时分析。：

数据集地址：TommyChien/UltraDomain · Datasets at Hugging Face

Lightrag使用LLM生成问题-答案对

生成问题的方法来自于From Local to Global: A Graph RAG Approach to Query-Focused Summarization

提供文本，让大模型生成K个使用该数据集的用户身份（比如数据集是财经新闻，user就可能是收集金融市场趋势的财经记者），对于每个用户再生成N个任务，每个用户-任务提出M个高层次问题（理解整个数据集、无需提取具体事实）

2.DAPR使用的数据集

MS MARCO、Natural Questions、MIRACL、Genomics 和 ConditionalQA

3.HotpotQA

含有train（easy、medium、hard），test（distractor、full Wiki）

Distractor：每个问题会提供 10 篇备选篇章，其中包含 2 段与问题答案相关的段落，8 段不相关的段落，这 10 篇文章限定了模型寻找答案的范围，相对较小。 Full Wiki：Full Wiki 属于开放域问答任务，模型需要从整个维基百科文档中抽取文档，然后再从文档中提取段落，最后从段落中抽取答案，数据范围是整个维基百科，范围要大得多，这使得任务更具挑战性。（实际上也是10篇文章，每篇的段落也没比distractor长多少）

下载

来自https://github.com/hotpotqa/hotpot/blob/master/download.sh

1.hotpot_train_v1.1.json

2.hotpot_dev_distractor_v1.json

{
  "_id": "5a8b57f25542995d1e6f1371",  # 问题编号
  "answer": "yes",  # 回答（简短）
  "question": "Were Scott Derrickson and Ed Wood of the same nationality?",  # 问题
  "supporting_facts": [  # 黄金段落所在文档的标题以及句子的编号
    ["Scott Derrickson", 0],
    ["Ed Wood", 0]
  ],
  "context": [  # 相关的文档，文档内包含多个段落
    [
      "Ed Wood (film)",
      [
        "Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood.",
        " The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau.",
        " Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast."
      ]
    ],
    [
      "Scott Derrickson",
      [
        "Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.",
        " He lives in Los Angeles, California.",
        " He is best known for directing horror films such as \"Sinister\", \"The Exorcism of Emily Rose\", and \"Deliver Us From Evil\", as well as the 2016 Marvel Cinematic Universe installment, \"Doctor Strange.\""
      ]
    ]
  ],
  "type": "comparison",  # 问题类型
  "level": "hard"  # 问题等级
}

3.hotpot_dev_fullwiki_v1.json

4.2WikiMultiHopQA

论文链接： Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps - ACL Anthology

github repo地址： Alab-NII/2wikimultihop

数据地址：https://www.dropbox.com/s/npidmtadreo6df2/data.zip

类似hotpotqa

  {
        "_id": "str",
        "type": [				 // 以下四种之一作为值。问题类型有：比较、推理、组合和桥接比较
            "compositional",
            "inference",
            "bridge_comparison",
            "comparison"
        ],
        "question": "str",
        "context": [      //可组成corpus
            [
                "str(Title)", // 文档标题
                [
                    "str(Sent)", // 句子内容
                    "str(Sent)" // 句子内容
                    // ...
                ]
            ]
            // ...
        ],
        "supporting_facts": [      //黄金段落
            ["str(title)", "int(sent_id)"]        // 支持文档的标题和对应句子的序号（第几句）
        ],
        "evidences": [
			["str(subject entity)", "str(relation)", "str(object entity)"]
            // 列表，每个元素是一个包含[主体实体, 关系, 客体实体]的三元组,有几组`supporting_facts`就有几组这个
        ],
        "answer": "str"
    }

5.Qasper

NLP论文相关的问答

hf地址：https://huggingface.co/datasets/allenai/qasper

数据集地址：https://qasper-dataset.s3.us-west-2.amazonaws.com/qasper-train-dev-v0.3.tgz

评价标准

1.Lightrag使用LLM评价

包括几个维度，和GraphRAG一致：

• Comprehensiveness. How much detail does the answer provide to cover all aspects and details of the question?

• Diversity. How varied and rich is the answer in providing different perspectives and insights on the question?

• Empowerment. How well does the answer help the reader understand and make informed judgments about the topic?

2.NDCG

一、NDCG是什么？

NDCG的全称是：Normalized Discounted Cumulative Gain(归一化折损累计增益)

在搜索和推荐任务中，系统常返回一个item列表。如何衡量这个返回的列表是否优秀呢？

例如，当我们检索【推荐排序】，网页返回了与推荐排序相关的链接列表。列表可能会是[A,B,C,G,D,E,F],也可能是[C,F,A,E,D]，现在问题来了，当系统返回这些列表时，怎么评价哪个列表更好？

没错，NDCG就是用来评估排序结果的。搜索和推荐任务中比较常见。

二、一点点来理解NDCG~

G-CG-DCG-NDCG

Gain: 表示一个列表中所有item的相关性分数。$rel(i)$表示$item(i)$相关性得分。$$Gain=rel(i)$$
Cumulative Gain: 表示对K个item的Gain进行累加。$CG_{k}=\sum_{i=1}^{k}{rel(i)}$ CG只是单纯累加相关性，不考虑位置信息。

如果返回一个list_1=[A,B,C,D,E]，那list_1的CG为0.5+0.9+0.3+0.6+0.1=2.4

如果返回一个list_2=[D,A,E,C,B]，那list_2的CG为0.6+0.5+0.1+0.3+0.9=2.4

所以，顺序不影响CG得分。如果我们想评估不同顺序的影响，就需要使用另一个指标DCG来评估。

Discounted Cumulative Gain: 考虑排序顺序的因素，使得排名靠前的item增益更高，对排名靠后的item进行折损。

CG与顺序无关，而DCG评估了顺序的影响。DCG的思想是：list中item的顺序很重要，不同位置的贡献不同，一般来说，排在前面的item影响更大，排在后面的item影响较小。（例如一个返回的网页，肯定是排在前面的item会有更多人点击）。所以，相对CG来说，DCG使排在前面的item增加其影响，排在后面的item减弱其影响。

$$DCG_{k}=\sum_{i=1}^{k}{\frac{rel(i)}{log_{2}(i+1)}}$$

怎么实现这个思想呢？DCG在CG的基础上，给每个item的相关性比上log2(i+1)，i越大，log2(i+1)的值越大，相当于给每个item的相关性打个折扣，item越靠后，折扣越大。

还是上面那个例子：

list_1=[A,B,C,D,E], 其对应计算如下：

i	rel(i)	log(i+1)	rel(i)/log(i+1)
1 = A	0.5	1	0.5
2 = B	0.9	1.59	0.57
3 = C	0.3	2	0.15
4 = D	0.6	2.32	0.26
5 = E	0.1	2.59	0.04

list_1的 DCG_1= 0.5+0.57+0.15+0.26+0.04=1.52

list_2=[D,A,E,C,B]，其对应计算如下：

i	rel(i)	log(i+1)	rel(i)/log(i+1)
1 = D	0.6	1	0.6
2 = A	0.5	1.59	0.31
3 = E	0.1	2	0.05
4 = C	0.3	2.32	0.13
5 = B	0.9	2.59	0.35

list_2的 DCG_2= 0.6+0.31+0.05+0.13+0.35=1.44

DCG_1 > DCG_2, 所以在这个例子里list_1优于list_2。

到这里，我们可以知道，使用DCG方法就可以对不同的list进行评估，那为什么后面还有一个NDCG呢？

NDCG(Normalized DCG): 归一化折损累计增益

在NDCG之前，先了解一些IDGC(ideal DCG)–理想的DCG，IDCG的依据是：是根据rel(i)降序排列，即排列到最好状态。算出最好排列的DCG，就是IDCG。

IDCG=最好排列的DCG

对于上述的例子，按照rel(i)进行降序排列的最好状态为list_best=[B,D,A,C,E]

i	rel(i)	log(i+1)	rel(i)/log(i+1)
1 = B	0.9	1	0.9
2 = D	0.6	1.59	0.38
3 = A	0.5	2	0.25
4 = C	0.3	2.32	0.13
5 = E	0.1	2.59	0.04

IDCG = list_best的DCG_best = 0.9+0.38+0.25+0.13+0.04=1.7 (理所当然，IDCG>DCG_1和DCG_2)

因为不同query的搜索结果有多有少，所以不同query的DCG值就没有办法来做对比。所以提出NDCG。

$$NDCG=\frac{DCG}{IDCG}$$

所以NDGC使用DCG/IDCG来表示，这样的话，NDCG就是一个相对值，那么不同query之间就可以通过NDCG值进行比较评估。

3.Precision

所有检索到的结果中，有多少是应该是被检索到的

$$Precision=\frac{正确的结果}{返回的结果}$$

RAG,QA相关数据集及评价标准

数据集

1.UltraDomain

2.DAPR使用的数据集

3.HotpotQA

4.2WikiMultiHopQA

5.Qasper

评价标准

1.Lightrag使用LLM评价

2.NDCG

一、NDCG是什么？

二、一点点来理解NDCG~

3.Precision

4.Recall

5.F1

数据集#

1.UltraDomain#

2.DAPR使用的数据集#

3.HotpotQA#

4.2WikiMultiHopQA#

5.Qasper#

评价标准#

1.Lightrag使用LLM评价#

2.NDCG#

一、NDCG是什么？#

二、一点点来理解NDCG~#

3.Precision#

4.Recall#

5.F1#

数据集

1.UltraDomain

2.DAPR使用的数据集

3.HotpotQA

4.2WikiMultiHopQA

5.Qasper

评价标准

1.Lightrag使用LLM评价

2.NDCG

一、NDCG是什么？

二、一点点来理解NDCG~

3.Precision

4.Recall

5.F1