数据集:
code_x_glue_tt_text_to_text
CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text
The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/ .
da_en, lv_en, no_en, zh_en
An example of 'test' looks as follows.
{ "id": 0, "source": "4 . K\u00f8r modellen , og udgiv den som en webtjeneste .\n", "target": "4 . Run the model , and publish it as a web service .\n" }lv_en
An example of 'train' looks as follows.
{ "id": 0, "source": "title : Pakalpojumu objektu izveide\n", "target": "title : Create service objects\n" }no_en
An example of 'validation' looks as follows.
{ "id": 0, "source": "2 . \u00c5pne servicevaren du vil definere komponenter fra en stykkliste for .\n", "target": "2 . Open the service item for which you want to set up components from a BOM .\n" }zh_en
An example of 'validation' looks as follows.
{ "id": 0, "source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; \u5305\u62ec \u901a\u77e5 , \u800c \u4e0d \u8003\u8651 \u8bfb\u53d6 \u72b6\u6001 \u3002 & # 124 ;\n", "target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . |\n" }
In the following each data field in go is explained for each config. The data fields are the same among all splits.
da_en, lv_en, no_en, zh_enfield name | type | description |
---|---|---|
id | int32 | The index of the sample |
source | string | The source language version of the text |
target | string | The target language version of the text |
name | train | validation | test |
---|---|---|---|
da_en | 42701 | 1000 | 1000 |
lv_en | 18749 | 1000 | 1000 |
no_en | 44322 | 1000 | 1000 |
zh_en | 50154 | 1000 | 1000 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@article{CodeXGLUE, title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence}, year={2020},}
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.