数据集:

cfq

中文

Dataset Card for "cfq"

Dataset Summary

The Compositional Freebase Questions (CFQ) is a dataset that is specifically designed to measure compositional generalization. CFQ is a simple yet realistic, large dataset of natural language questions and answers that also provides for each question a corresponding SPARQL query against the Freebase knowledge base. This means that CFQ can also be used for semantic parsing.

Supported Tasks and Leaderboards

More Information Needed

Languages

English ( en ).

Dataset Structure

Data Instances

mcd1
  • Size of downloaded dataset files: 267.60 MB
  • Size of the generated dataset: 42.90 MB
  • Total amount of disk used: 310.49 MB

An example of 'train' looks as follows.

{
  'query': 'SELECT count(*) WHERE {\n?x0 a ns:people.person .\n?x0 ns:influence.influence_node.influenced M1 .\n?x0 ns:influence.influence_node.influenced M2 .\n?x0 ns:people.person.spouse_s/ns:people.marriage.spouse|ns:fictional_universe.fictional_character.married_to/ns:fictional_universe.marriage_of_fictional_characters.spouses ?x1 .\n?x1 a ns:film.cinematographer .\nFILTER ( ?x0 != ?x1 )\n}',
  'question': 'Did a person marry a cinematographer , influence M1 , and influence M2'
}
mcd2
  • Size of downloaded dataset files: 267.60 MB
  • Size of the generated dataset: 44.77 MB
  • Total amount of disk used: 312.38 MB

An example of 'train' looks as follows.

{
  'query': 'SELECT count(*) WHERE {\n?x0 ns:people.person.parents|ns:fictional_universe.fictional_character.parents|ns:organization.organization.parent/ns:organization.organization_relationship.parent ?x1 .\n?x1 a ns:people.person .\nM1 ns:business.employer.employees/ns:business.employment_tenure.person ?x0 .\nM1 ns:business.employer.employees/ns:business.employment_tenure.person M2 .\nM1 ns:business.employer.employees/ns:business.employment_tenure.person M3 .\nM1 ns:business.employer.employees/ns:business.employment_tenure.person M4 .\nM5 ns:business.employer.employees/ns:business.employment_tenure.person ?x0 .\nM5 ns:business.employer.employees/ns:business.employment_tenure.person M2 .\nM5 ns:business.employer.employees/ns:business.employment_tenure.person M3 .\nM5 ns:business.employer.employees/ns:business.employment_tenure.person M4\n}',
  'question': "Did M1 and M5 employ M2 , M3 , and M4 and employ a person 's child"
}
mcd3
  • Size of downloaded dataset files: 267.60 MB
  • Size of the generated dataset: 43.60 MB
  • Total amount of disk used: 311.20 MB

An example of 'train' looks as follows.

{
    "query": "SELECT /producer M0 . /director M0 . ",
    "question": "Who produced and directed M0?"
}
query_complexity_split
  • Size of downloaded dataset files: 267.60 MB
  • Size of the generated dataset: 45.95 MB
  • Total amount of disk used: 313.55 MB

An example of 'train' looks as follows.

{
    "query": "SELECT /producer M0 . /director M0 . ",
    "question": "Who produced and directed M0?"
}
query_pattern_split
  • Size of downloaded dataset files: 267.60 MB
  • Size of the generated dataset: 46.12 MB
  • Total amount of disk used: 313.72 MB

An example of 'train' looks as follows.

{
    "query": "SELECT /producer M0 . /director M0 . ",
    "question": "Who produced and directed M0?"
}

Data Fields

The data fields are the same among all splits and configurations:

  • question : a string feature.
  • query : a string feature.

Data Splits

name train test
mcd1 95743 11968
mcd2 95743 11968
mcd3 95743 11968
query_complexity_split 100654 9512
query_pattern_split 94600 12589
question_complexity_split 98999 10340
question_pattern_split 95654 11909
random_split 95744 11967

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@inproceedings{Keysers2020,
  title={Measuring Compositional Generalization: A Comprehensive Method on
         Realistic Data},
  author={Daniel Keysers and Nathanael Sch"{a}rli and Nathan Scales and
          Hylke Buisman and Daniel Furrer and Sergii Kashubin and
          Nikola Momchev and Danila Sinopalnikov and Lukasz Stafiniak and
          Tibor Tihon and Dmitry Tsarkov and Xiao Wang and Marc van Zee and
          Olivier Bousquet},
  booktitle={ICLR},
  year={2020},
  url={https://arxiv.org/abs/1912.09713.pdf},
}

Contributions

Thanks to @thomwolf , @patrickvonplaten , @lewtun , @brainshawn for adding this dataset.