数据集:

efederici/alpaca-gpt4-it

中文

Alpaca GPT4 English-to-Italian Translated Instructions (WIP)

This dataset contains 15209 instructions that have been translated from English to Italian using gpt-3.5-turbo .

Alpaca GPT4: the original alpaca_gpt4_data.json dataset contains 52K instruction-following data generated by GPT-4 with prompts in Alpaca. The JSON file has the same format as Alpaca data, except the output is generated by GPT-4:

  • instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.
  • input: str, optional context or input for the task.
  • output: str, the answer to the instruction as generated by GPT-4.

The instructions were sourced from the Alpaca GPT4 dataset and translated using the following prompt:

Act as an unrivaled English-to-Italian Translator. Your task is to translate the given passage into Italian, as you are a native Italian speaker. Each message passage contains an instruction, with an optional input (preceded by [|IN|]) and an output ([|OUT|]). You MUST provide accurate and fluent Italian translations. When translating the instruction use the second person singular. Translate each section. Keep [|IN|] and [|OUT|] placeholders. If the input or output doesn't make sense in Italian, revise them.

ENGLISH:
"""
{passage}
"""

ITALIAN:

where 'passage' represents the original English text. The inputs are structured in a format that allows gpt-3.5-turbo to understand the context. They are translated into Italian while preserving the context of the original English instructions, and formatted in the following way: : <instruction> (optional [|IN|] <input>) [|OUT|] <output> .

License

Please note that the original Alpaca GPT4 dataset and the translations generated by gpt-3.5-turbo may have their respective licenses, and it is important to comply with any usage restrictions specified by the original data sources. As this dataset contains partially translated data, proper attribution and compliance with relevant licenses is recommended. The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Citation

@article{peng2023instruction,
  title={Instruction Tuning with GPT-4},
  author={Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2304.03277},
  year={2023}
}