数据集:
neulab/mconala
MCoNaLa is a Multilingual Code/Natural Language Challenge dataset with 896 NL-Code pairs in three languages: Spanish, Japanese, and Russian.
Spanish, Japanese, Russian; Python
from datasets import load_dataset # Spanish subset load_dataset("neulab/mconala", "es") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 341 }) }) # Japanese subset load_dataset("neulab/mconala", "ja") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 210 }) }) # Russian subset load_dataset("neulab/mconala", "ru") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 345 }) })
Field | Type | Description |
---|---|---|
question_id | int | StackOverflow post id of the sample |
intent | string | Title of the Stackoverflow post as the initial NL intent |
rewritten_intent | string | nl intent rewritten by human annotators |
snippet | string | Python code solution to the NL intent |
The dataset contains 341, 210, and 345 samples in Spanish, Japanese, and Russian.
@article{wang2022mconala, title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages}, author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig}, journal={arXiv preprint arXiv:2203.08388}, year={2022} }