数据集:
Salesforce/rose
语言:
enThis repo contiains the RoSE benchmark of our paper "Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation".
Please visit here for a demo page of this project.
RoSE benchmark contains system outputs annotated with our ACU protocol. It contains four parts:
We summarize the statistics below.
Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name |
---|---|---|---|---|---|
CNNDM | Test | 500 | 12 | 6000 | cnndm_test |
CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation |
XSum | Test | 500 | 8 | 4000 | xsum |
SamSum | Test | 500 | 8 | 4000 | samsum |
We have system outputs annotated with four different human evaluation protocols in total. We summarize them below.
Protocol | w/ Input Document | w/ Reference Summary | Fine-grained |
---|---|---|---|
Prior | ✗ | ✗ | ✗ |
Ref-free | ✓ | ✗ | ✗ |
Ref-based | ✗ | ✓ | ✗ |
ACU | ✗ | ✓ | ✓ |
We annotated two sets of system summaries.