数据集:
neulab/mconala
MCoNaLa is a Multilingual Code/Natural Language Challenge dataset with 896 NL-Code pairs in three languages: Spanish, Japanese, and Russian.
Spanish, Japanese, Russian; Python
from datasets import load_dataset
# Spanish subset
load_dataset("neulab/mconala", "es")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 341
})
})
# Japanese subset
load_dataset("neulab/mconala", "ja")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 210
})
})
# Russian subset
load_dataset("neulab/mconala", "ru")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 345
})
})
| Field | Type | Description |
|---|---|---|
| question_id | int | StackOverflow post id of the sample |
| intent | string | Title of the Stackoverflow post as the initial NL intent |
| rewritten_intent | string | nl intent rewritten by human annotators |
| snippet | string | Python code solution to the NL intent |
The dataset contains 341, 210, and 345 samples in Spanish, Japanese, and Russian.
@article{wang2022mconala,
title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
journal={arXiv preprint arXiv:2203.08388},
year={2022}
}