数据集:

spider

任务:

文生文

语言:

en

计算机处理:

monolingual

大小:

1K<n<10K

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for Spider

Dataset Summary

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases

Supported Tasks and Leaderboards

The leaderboard can be seen at https://yale-lily.github.io/spider

Languages

The text in the dataset is in English.

Dataset Structure

Data Instances

What do the instances that comprise the dataset represent?

Each instance is natural language question and the equivalent SQL query

How many instances are there in total?

What data does each instance consist of?

[More Information Needed]

Data Fields

  • db_id : Database name
  • question : Natural language to interpret into SQL
  • query : Target SQL query
  • query_toks : List of tokens for the query
  • query_toks_no_value : List of tokens for the query
  • question_toks : List of tokens for the question

Data Splits

train : 7000 questions and SQL query pairs dev : 1034 question and SQL query pairs

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization Who are the source language producers?

[More Information Needed]

Annotations

The dataset was annotated by 11 college students at Yale University

Annotation process Who are the annotators?

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

[More Information Needed]

Other Known Limitations

Additional Information

The listed authors in the homepage are maintaining/supporting the dataset.

Dataset Curators

[More Information Needed]

Licensing Information

The spider dataset is licensed under the CC BY-SA 4.0

[More Information Needed]

Citation Information

@article{yu2018spider,
  title={Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
  author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
  journal={arXiv preprint arXiv:1809.08887},
  year={2018}
}

Contributions

Thanks to @olinguyen for adding this dataset.