In this repository I release GPT-2 model, that was trained on various texts for Turkish.
The model is meant to be an entry point for fine-tuning on other texts.
I used a Turkish corpora that is taken from oscar-corpus.
It was possible to create byte-level BPE with Tokenizers library of Huggingface.
With the Tokenizers library, I created a 52K byte-level BPE vocab based on the training corpora.
After creating the vocab, I could train the GPT-2 for Turkish on two 2080TI over the complete training corpus (five epochs).
Logs during training: https://tensorboard.dev/experiment/3AWKv8bBTaqcqZP5frtGkw/#scalars
Both PyTorch and Tensorflow compatible weights are available.
Model | Downloads |
---|---|
redrussianarmy/gpt2-turkish-cased | config.json • merges.txt • pytorch_model.bin • special_tokens_map.json • tf_model.h5 • tokenizer_config.json • traning_args.bin • vocab.json |
The model itself can be used in this way:
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("redrussianarmy/gpt2-turkish-cased") model = AutoModelWithLMHead.from_pretrained("redrussianarmy/gpt2-turkish-cased")
Here's an example that shows how to use the great Transformers Pipelines for generating text:
from transformers import pipeline pipe = pipeline('text-generation', model="redrussianarmy/gpt2-turkish-cased", tokenizer="redrussianarmy/gpt2-turkish-cased", config={'max_length':800}) text = pipe("Akşamüstü yolda ilerlerken, ")[0]["generated_text"] print(text)
git lfs install git clone https://huggingface.co/redrussianarmy/gpt2-turkish-cased
For questions about the GPT2-Turkish model, just open an issue here ?