数据集:
Hellisotherpeople/DebateSum
语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2011.07251许可:
mitCorresponding code repo for the upcoming paper at ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset"
Arxiv pre-print available here: https://arxiv.org/abs/2011.07251
Check out the presentation date and time here: https://argmining2020.i3s.unice.fr/node/9
Full paper as presented by the ACL is here: https://www.aclweb.org/anthology/2020.argmining-1.1/
Video of presentation at COLING 2020: https://underline.io/lecture/6461-debatesum-a-large-scale-argument-mining-and-summarization-dataset
The dataset is distributed as csv files.
A search engine over DebateSum (as well as some additional evidence not included in DebateSum) is available as debate.cards . It's very good quality and allows for the evidence to be viewed in the format that debaters use.
DebateSum consists of 187328 debate documents, arguements (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, citations, and associated metadata organized by topic-year. This data is ready for analysis by NLP systems.
All data is accesable in a parsed format organized by topic year here
Addtionally, the trained word-vectors for debate2vec are also found in that folder.
This is useful as the debaters who produce the evidence release their work every year. Soon enough I will update to include the 2020-2021 topic.
Step 1: Download all open evidence files from Open Evidence and unzip them into a directory. The links are as follows:
Step 2: Convert all evidence from docx files to html5 files using pandoc with this command:
for f in *.docx; do pandoc "$f" -s -o "${f%.docx}.html5"; done
Step 3: install the dependencies for make_debate_dataset.py.
pip install -r requirements.txt
Step 4: Modify the folder and file locations as needed for your system, and run make_debate_dataset.py
python3 make_debate_dataset.py
Huge thanks to Arvind Balaji for making debate.cards and being second author on this paper!