1.使用LLM生成问题及答案

2.使用标注数据集

1.LightRAG

数据集

使用MemoRAG提出的Benchmark。

在UltraDomain里,包含多个领域的数据,每个数据包括多本书。以cs为例,共含有100本书和100个对应的问题。该领域专注于计算机科学,涵盖数据科学和软件工程的关键领域。它特别强调机器学习和大数据处理,内容涉及推荐系统、分类算法以及使用Spark进行实时分析。:

{
	input: How does Spark Streaming enable real-time data processing?
	answers: ['Spark Streaming extends ...... ']
  	context: "Whole Book......"
	length: 131651
	context_id: 7bcef8714a477fd61fc8fb0d499b2cc3
	_id: b2fd8d9c6d1499d521d778ce3d6d06fa
	label: cs
	meta: {'title': 'Machine Learning With Spark', 'authors': 'Nick Pentreath'}
}

数据集地址:TommyChien/UltraDomain · Datasets at Hugging Face

问题生成

生成问题的方法来自于From Local to Global: A Graph RAG Approach to Query-Focused Summarization

提供文本,让大模型生成K个使用该数据集的用户身份(比如数据集是财经新闻,user就可能是收集金融市场趋势的财经记者),对于每个用户再生成N个任务,每个用户-任务提出M个高层次问题(理解整个数据集、无需提取具体事实)

 User: A tech journalist looking for insights and trends in the tech industry
 Task: Understanding how tech leaders view the role of policy and regulation
 Questions:
 1. Which episodes deal primarily with tech policy and government regulation?
 2. How do guests perceive the impact of privacy laws on technology development?
 3. Do any guests discuss the balance between innovation and ethical considerations?
 4. What are the suggested changes to current policies mentioned by the guests?
 5. Are collaborations between tech companies and governments discussed and how?

评价标准

不使用黄金标准答案,使用LLM评价。包括

• Comprehensiveness. How much detail does the answer provide to cover all aspects and details of the question?

• Diversity. How varied and rich is the answer in providing different perspectives and insights on the question?

• Empowerment. How well does the answer help the reader understand and make informed judgments about the topic?


2.DAPR

数据集

MS MARCO、Natural Questions、MIRACL、Genomics 和 ConditionalQA


3.HotpotQA

含有train(easy、medium、hard),test(distractor、full Wiki)

Distractor:每个问题会提供 10 篇备选篇章,其中包含 2 段与问题答案相关的段落,8 段不相关的段落,这 10 篇文章限定了模型寻找答案的范围,相对较小。 Full Wiki:Full Wiki 属于开放域问答任务,模型需要从整个维基百科文档中抽取文档,然后再从文档中提取段落,最后从段落中抽取答案,数据范围是整个维基百科,范围要大得多,这使得任务更具挑战性。(实际上也是10篇文章,每篇的段落也没比distractor长多少)

1.hotpot_train_v1.1.json

总问题数量: 90447

问题类型分布:
  - comparison: 17456 (19.3%)
  - bridge: 72991 (80.7%)
  
难度分布:
  - medium: 56814 (62.8%)
  - hard: 15661 (17.3%)
  - easy: 17972 (19.9%)
  
问题长度分布:
  - 0-10: 0 (0.0%)
  - 11-20: 8 (0.0%)
  - 21-30: 96 (0.1%)
  - 31-40: 664 (0.7%)
  - 41-50: 3352 (3.7%)
  - 51-70: 18602 (20.6%)
  - 71-100: 31161 (34.5%)
  - 100+: 36564 (40.4%) 100左右最多
  
supporting_facts统计:
  - 几乎都是2,其次3、4,其他极少
  
context统计:
  - 几乎每个问题的上下文数量都在9-10

2.hotpot_dev_distractor_v1.json

{
  "_id": "5a8b57f25542995d1e6f1371",  # 问题编号
  "answer": "yes",  # 回答(简短)
  "question": "Were Scott Derrickson and Ed Wood of the same nationality?",  # 问题
  "supporting_facts": [  # 黄金段落所在文档的标题以及句子的编号
    ["Scott Derrickson", 0],
    ["Ed Wood", 0]
  ],
  "context": [  # 相关的文档,文档内包含多个段落
    [
      "Ed Wood (film)",
      [
        "Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood.",
        " The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau.",
        " Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast."
      ]
    ],
    [
      "Scott Derrickson",
      [
        "Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.",
        " He lives in Los Angeles, California.",
        " He is best known for directing horror films such as \"Sinister\", \"The Exorcism of Emily Rose\", and \"Deliver Us From Evil\", as well as the 2016 Marvel Cinematic Universe installment, \"Doctor Strange.\""
      ]
    ]
  ],
  "type": "comparison",  # 问题类型
  "level": "hard"  # 问题等级
}
总问题数量: 7405

问题类型分布:
  - comparison: 1487 (20.1%)
  - bridge: 5918 (79.9%)

难度分布:
  - hard: 7405 (100.0%)
  
supporting_facts统计:
  - 几乎都是2,其次3、4,其他极少
  - 平均每个问题的支持性事实数量: 2.43
  - 最多支持性事实数量: 8
  - 最少支持性事实数量: 2

context统计:
  - 10篇文章,2篇相关,8篇不相关

3.hotpot_dev_fullwiki_v1.json

问题类型分布:
  - comparison: 1487 (20.1%)
  - bridge: 5918 (79.9%)

难度分布:
  - hard: 7405 (100.0%)

问题长度分布:
  - 0-10: 0 (0.0%)
  - 11-20: 0 (0.0%)
  - 21-30: 0 (0.0%)
  - 31-40: 49 (0.7%)
  - 41-50: 282 (3.8%)
  - 51-70: 1612 (21.8%)
  - 71-100: 2919 (39.4%)
  - 100+: 2543 (34.3%)

supporting_facts数量分布:
  - 包含 2 个supporting_facts的问题: 4990 (67.4%)
  - 包含 3 个supporting_facts的问题: 1774 (24.0%)
  - 包含 5 个supporting_facts的问题: 80 (1.1%)
  - 包含 4 个supporting_facts的问题: 537 (7.3%)
  - 包含 7 个supporting_facts的问题: 9 (0.1%)
  - 包含 6 个supporting_facts的问题: 14 (0.2%)
  - 包含 8 个supporting_facts的问题: 1 (0.0%)

context数量分布:
  - 10篇文章,属于是open-domain的开放域问答任务(没有人为设置负样本?)
 

4.NDCG

一、NDCG是什么?

NDCG的全称是:Normalized Discounted Cumulative Gain(归一化折损累计增益)

在搜索和推荐任务中,系统常返回一个item列表。如何衡量这个返回的列表是否优秀呢?

例如,当我们检索【推荐排序】,网页返回了与推荐排序相关的链接列表。列表可能会是[A,B,C,G,D,E,F],也可能是[C,F,A,E,D],现在问题来了,当系统返回这些列表时,怎么评价哪个列表更好?

没错,**NDCG就是用来评估排序结果的。**搜索和推荐任务中比较常见。

二、一点点来理解NDCG~

G-CG-DCG-NDCG

  1. Gain: 表示一个列表中所有item的相关性分数。$rel(i)$表示$item(i)$相关性得分。$$Gain=rel(i)$$
  2. Cumulative Gain: 表示对K个item的Gain进行累加。$CG_{k}=\sum_{i=1}^{k}{rel(i)}$ CG只是单纯累加相关性,不考虑位置信息。

如果返回一个list_1=[A,B,C,D,E],那list_1的CG为0.5+0.9+0.3+0.6+0.1=2.4

如果返回一个list_2=[D,A,E,C,B],那list_2的CG为0.6+0.5+0.1+0.3+0.9=2.4

所以,顺序不影响CG得分。如果我们想评估不同顺序的影响,就需要使用另一个指标DCG来评估。

  1. Discounted Cumulative Gain: 考虑排序顺序的因素,使得排名靠前的item增益更高,对排名靠后的item进行折损。

CG与顺序无关,而DCG评估了顺序的影响。DCG的思想是:list中item的顺序很重要,不同位置的贡献不同,一般来说,排在前面的item影响更大,排在后面的item影响较小。(例如一个返回的网页,肯定是排在前面的item会有更多人点击)。所以,相对CG来说,DCG使排在前面的item增加其影响,排在后面的item减弱其影响。

$$DCG_{k}=\sum_{i=1}^{k}{\frac{rel(i)}{log_{2}(i+1)}}$$

怎么实现这个思想呢?DCG在CG的基础上,给每个item的相关性比上log2(i+1),i越大,log2(i+1)的值越大,相当于给每个item的相关性打个折扣,item越靠后,折扣越大。

还是上面那个例子:

list_1=[A,B,C,D,E], 其对应计算如下:

irel(i)log(i+1)rel(i)/log(i+1)
1 = A0.510.5
2 = B0.91.590.57
3 = C0.320.15
4 = D0.62.320.26
5 = E0.12.590.04

list_1的 DCG_1= 0.5+0.57+0.15+0.26+0.04=1.52

list_2=[D,A,E,C,B],其对应计算如下:

irel(i)log(i+1)rel(i)/log(i+1)
1 = D0.610.6
2 = A0.51.590.31
3 = E0.120.05
4 = C0.32.320.13
5 = B0.92.590.35

list_2的 DCG_2= 0.6+0.31+0.05+0.13+0.35=1.44

DCG_1 > DCG_2, 所以在这个例子里list_1优于list_2。

到这里,我们可以知道,使用DCG方法就可以对不同的list进行评估,那为什么后面还有一个NDCG呢?

  1. NDCG(Normalized DCG): 归一化折损累计增益

在NDCG之前,先了解一些IDGC(ideal DCG)–理想的DCG,IDCG的依据是:是根据rel(i)降序排列,即排列到最好状态。算出最好排列的DCG,就是IDCG。

IDCG=最好排列的DCG

对于上述的例子,按照rel(i)进行降序排列的最好状态为list_best=[B,D,A,C,E]

irel(i)log(i+1)rel(i)/log(i+1)
1 = B0.910.9
2 = D0.61.590.38
3 = A0.520.25
4 = C0.32.320.13
5 = E0.12.590.04

IDCG = list_best的DCG_best = 0.9+0.38+0.25+0.13+0.04=1.7 (理所当然,IDCG>DCG_1和DCG_2)

因为不同query的搜索结果有多有少,所以不同query的DCG值就没有办法来做对比。所以提出NDCG。

$$NDCG=\frac{DCG}{IDCG}$$

所以NDGC使用DCG/IDCG来表示,这样的话,NDCG就是一个相对值,那么不同query之间就可以通过NDCG值进行比较评估。


5.Precision

所有检索到的结果中,有多少是应该是被检索到的

$$Precision=\frac{正确的结果}{返回的结果}$$