数据集:
fewshot-goes-multilingual/cs_squad-3.0
任务:
问答子任务:
extractive-qa语言:
cs计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
lgpl-3.0This a processed and filtered adaptation of an existing dataset. For raw and larger dataset, see Dataset Source section.
The data contains questions and answers based on Czech wikipeadia articles. Each question has an answer (or more) and a selected part of the context as the evidence. A majority of the answers are extractive - i.e. they are present in the context in the exact form. The remaining cases are
All questions in the dataset are answerable from the context. Small minority of questions have multiple answers. Sometimes it means that any of them is correct (e.g. either "Pacifik" or "Tichý oceán" are correct terms for Pacific Ocean) and sometimes it means that all of them together are a correct answer (e.g., Who was Leonardo da Vinci? ["painter", "engineer"])
Total number of examples is around:
Each example contains:
The dataset is a preprocessed adaptation of existing SQAD 3.0 dataset link to data . This adaptation contains (almost) same data, but converted to a convenient format. The data was also filtered to remove a statistical bias where the answer was contained in the first sentence in the article (around 50% of all data in the original dataset, likely caused by the data collection process).
Cite authors of the original dataset :
@misc{11234/1-3069, title = {sqad 3.0}, author = {Medve{\v d}, Marek and Hor{\'a}k, Ale{\v s}}, url = {http://hdl.handle.net/11234/1-3069}, note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University}, copyright = {{GNU} Library or "Lesser" General Public License 3.0 ({LGPL}-3.0)}, year = {2019} }