数据集:

wiki_bio

语言:

en

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:1603.07771
中文

Dataset Card for [Dataset Name]

Dataset Summary

This Dataset contains 728321 biographies extracted from Wikipedia containing the first paragraph of the biography and the tabular infobox.

Supported Tasks and Leaderboards

The main purpose of this dataset is developing text generation models.

Languages

English.

Dataset Structure

Data Instances

More Information Needed

Data Fields

The structure of a single sample is the following:

{
   "input_text":{
      "context":"pope michael iii of alexandria\n",
      "table":{
         "column_header":[
            "type",
            "ended",
            "death_date",
            "title",
            "enthroned",
            "name",
            "buried",
            "religion",
            "predecessor",
            "nationality",
            "article_title",
            "feast_day",
            "birth_place",
            "residence",
            "successor"
         ],
         "content":[
            "pope",
            "16 march 907",
            "16 march 907",
            "56th of st. mark pope of alexandria & patriarch of the see",
            "25 april 880",
            "michael iii of alexandria",
            "monastery of saint macarius the great",
            "coptic orthodox christian",
            "shenouda i",
            "egyptian",
            "pope michael iii of alexandria\n",
            "16 -rrb- march -lrb- 20 baramhat in the coptic calendar",
            "egypt",
            "saint mark 's church",
            "gabriel i"
         ],
         "row_number":[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      }
   },
   "target_text":"pope michael iii of alexandria -lrb- also known as khail iii -rrb- was the coptic pope of alexandria and patriarch of the see of st. mark -lrb- 880 -- 907 -rrb- .\nin 882 , the governor of egypt , ahmad ibn tulun , forced khail to pay heavy contributions , forcing him to sell a church and some attached properties to the local jewish community .\nthis building was at one time believed to have later become the site of the cairo geniza .\n"
}

where, in the "table" field, all the information of the Wikpedia infobox is stored (the header of the infobox is stored in "column_header" and the information in the "content" field).

Data Splits

  • Train: 582659 samples.
  • Test: 72831 samples.
  • Validation: 72831 samples.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

This dataset was announced in the paper Neural Text Generation from Structured Data with Application to the Biography Domain (arxiv link) and is stored in this repo (owned by DavidGrangier).

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

This dataset is ditributed under Creative Comons CC BY-SA 3.0 License.

Citation Information

For refering the original paper in BibTex format:

@article{DBLP:journals/corr/LebretGA16,
  author    = {R{\'{e}}mi Lebret and
               David Grangier and
               Michael Auli},
  title     = {Generating Text from Structured Data with Application to the Biography
               Domain},
  journal   = {CoRR},
  volume    = {abs/1603.07771},
  year      = {2016},
  url       = {http://arxiv.org/abs/1603.07771},
  archivePrefix = {arXiv},
  eprint    = {1603.07771},
  timestamp = {Mon, 13 Aug 2018 16:48:30 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/LebretGA16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contributions

Thanks to @alejandrocros for adding this dataset.