数据集:
codeparrot/xlcost-text-to-code
任务:
文本生成子任务:
language-modeling语言:
code计算机处理:
multilingual预印本库:
arxiv:2206.08474许可:
cc-by-sa-4.0This is a subset of XLCoST benchmark , for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP .
The dataset contains text in English and its corresponding code translation. Each program is divided into several code snippets, so the snipppet-level subsets contain these code snippets with their corresponding comments, for program-level subsets, the comments were concatenated in one long description. Moreover, programs in all the languages are aligned at the snippet level and the comment for a particular snippet is the same across all the languages.
To load the dataset you need to specify a subset among the 14 exiting instances : LANGUAGE-snippet-level/LANGUAGE-program-level for LANGUAGE in [Python, C, Csharp, C++, Java, Javascript and PHP] . By default Python-snippet-level is loaded.
from datasets import load_dataset load_dataset("codeparrot/xlcost-text-to-code", "Python-program-level") DatasetDict({ train: Dataset({ features: ['text', 'code'], num_rows: 9263 }) test: Dataset({ features: ['text', 'code'], num_rows: 887 }) validation: Dataset({ features: ['text', 'code'], num_rows: 472 }) })
next(iter(data["train"])) {'text': 'Maximum Prefix Sum possible by merging two given arrays | Python3 implementation of the above approach ; Stores the maximum prefix sum of the array A [ ] ; Traverse the array A [ ] ; Stores the maximum prefix sum of the array B [ ] ; Traverse the array B [ ] ; Driver code', 'code': 'def maxPresum ( a , b ) : NEW_LINE INDENT X = max ( a [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( a ) ) : NEW_LINE INDENT a [ i ] += a [ i - 1 ] NEW_LINE X = max ( X , a [ i ] ) NEW_LINE DEDENT Y = max ( b [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( b ) ) : NEW_LINE INDENT b [ i ] += b [ i - 1 ] NEW_LINE Y = max ( Y , b [ i ] ) NEW_LINE DEDENT return X + Y NEW_LINE DEDENT A = [ 2 , - 1 , 4 , - 5 ] NEW_LINE B = [ 4 , - 3 , 12 , 4 , - 3 ] NEW_LINE print ( maxPresum ( A , B ) ) NEW_LINE'}
Note that the data undergo some tokenization hence the additional whitespaces and the use of NEW_LINE instead of \n and INDENT instead of \t , DEDENT to cancel indentation...
Each subset has three splits: train, test and validation.
@misc{zhu2022xlcost, title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence}, url = {https://arxiv.org/abs/2206.08474}, author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.}, year = {2022}, eprint={2206.08474}, archivePrefix={arXiv} }