数据集:
Salesforce/rose
语言:
This repo contiains the RoSE benchmark of our paper "Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation".
Please visit here for a demo page of this project.
RoSE benchmark contains system outputs annotated with our ACU protocol. It contains four parts:
We summarize the statistics below.
| Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name |
|---|---|---|---|---|---|
| CNNDM | Test | 500 | 12 | 6000 | cnndm_test |
| CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation |
| XSum | Test | 500 | 8 | 4000 | xsum |
| SamSum | Test | 500 | 8 | 4000 | samsum |
We have system outputs annotated with four different human evaluation protocols in total. We summarize them below.
| Protocol | w/ Input Document | w/ Reference Summary | Fine-grained |
|---|---|---|---|
| Prior | ✗ | ✗ | ✗ |
| Ref-free | ✓ | ✗ | ✗ |
| Ref-based | ✗ | ✓ | ✗ |
| ACU | ✗ | ✓ | ✓ |
We annotated two sets of system summaries.