数据集:
TurkuNLP/register_oscar
The Register Oscar dataset is a multilingual dataset, containing languaegs from the Oscar dataset that have been tagged with register information.
8 main-level registers:
For further description of the labels, see (Douglas Biber and Jesse Egbert. 2018. Register variation online)
Code used to tag Register Oscar can be found at https://github.com/TurkuNLP/register-labeling
Currently contains the following languages: Arabic, Bengali, Catalan, English, Spanish, Basque, French, Hindi, Indonesian, Portuguese, Swahili, Urdu, Vietnamese and Chinese.
For further information on the languages and data, see https://huggingface.co/datasets/oscar
{"id": "0", "labels": ["NA"], "text": "Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran\n"}