数据集:
musabg/wikipedia-oscar-tr
? Welcome to the "Wikipedia and OSCAR Turkish" Huggingface Repo!
? This repo contains a Turkish language dataset generated by merging Wikipedia and OSCAR cleaned Common Crawl. The dataset contains over 13 million examples with a single feature - text.
? This dataset can be useful for natural language processing tasks in Turkish language.
? To download the dataset, you can use the Hugging Face Datasets library. Here's some sample code to get started:
from datasets import load_dataset dataset = load_dataset("musabg/wikipedia-oscar-tr")
? Have fun exploring this dataset and training language models on it!