数据集:
wiki_bio
任务:
表格到文本语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1603.07771许可:
cc-by-sa-3.0This Dataset contains 728321 biographies extracted from Wikipedia containing the first paragraph of the biography and the tabular infobox.
The main purpose of this dataset is developing text generation models.
English.
More Information Needed
The structure of a single sample is the following:
{ "input_text":{ "context":"pope michael iii of alexandria\n", "table":{ "column_header":[ "type", "ended", "death_date", "title", "enthroned", "name", "buried", "religion", "predecessor", "nationality", "article_title", "feast_day", "birth_place", "residence", "successor" ], "content":[ "pope", "16 march 907", "16 march 907", "56th of st. mark pope of alexandria & patriarch of the see", "25 april 880", "michael iii of alexandria", "monastery of saint macarius the great", "coptic orthodox christian", "shenouda i", "egyptian", "pope michael iii of alexandria\n", "16 -rrb- march -lrb- 20 baramhat in the coptic calendar", "egypt", "saint mark 's church", "gabriel i" ], "row_number":[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] } }, "target_text":"pope michael iii of alexandria -lrb- also known as khail iii -rrb- was the coptic pope of alexandria and patriarch of the see of st. mark -lrb- 880 -- 907 -rrb- .\nin 882 , the governor of egypt , ahmad ibn tulun , forced khail to pay heavy contributions , forcing him to sell a church and some attached properties to the local jewish community .\nthis building was at one time believed to have later become the site of the cairo geniza .\n" }
where, in the "table" field, all the information of the Wikpedia infobox is stored (the header of the infobox is stored in "column_header" and the information in the "content" field).
[More Information Needed]
This dataset was announced in the paper Neural Text Generation from Structured Data with Application to the Biography Domain (arxiv link) and is stored in this repo (owned by DavidGrangier).
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This dataset is ditributed under Creative Comons CC BY-SA 3.0 License.
For refering the original paper in BibTex format:
@article{DBLP:journals/corr/LebretGA16, author = {R{\'{e}}mi Lebret and David Grangier and Michael Auli}, title = {Generating Text from Structured Data with Application to the Biography Domain}, journal = {CoRR}, volume = {abs/1603.07771}, year = {2016}, url = {http://arxiv.org/abs/1603.07771}, archivePrefix = {arXiv}, eprint = {1603.07771}, timestamp = {Mon, 13 Aug 2018 16:48:30 +0200}, biburl = {https://dblp.org/rec/journals/corr/LebretGA16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Thanks to @alejandrocros for adding this dataset.