数据集:

neulab/mconala

中文

Dataset Card for MCoNaLa

Dataset Summary

MCoNaLa is a Multilingual Code/Natural Language Challenge dataset with 896 NL-Code pairs in three languages: Spanish, Japanese, and Russian.

Languages

Spanish, Japanese, Russian; Python

Dataset Structure

How to Use

from datasets import load_dataset

# Spanish subset
load_dataset("neulab/mconala", "es")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 341
    })
})

# Japanese subset
load_dataset("neulab/mconala", "ja")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 210
    })
})

# Russian subset
load_dataset("neulab/mconala", "ru")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 345
    })
})

Data Fields

Field Type Description
question_id int StackOverflow post id of the sample
intent string Title of the Stackoverflow post as the initial NL intent
rewritten_intent string nl intent rewritten by human annotators
snippet string Python code solution to the NL intent

Data Splits

The dataset contains 341, 210, and 345 samples in Spanish, Japanese, and Russian.

Citation Information

@article{wang2022mconala,
  title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
  author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
  journal={arXiv preprint arXiv:2203.08388},
  year={2022}
}