模型:
codeparrot/codeparrot-small-multi
CodeParrot-Multi ? is a GPT-2 model (110M parameters) trained to generate code in 9 programming languages: "Java", "JavaScript", "PHP", "Python", "C#", "C++", "GO", "Ruby" and "TypeScript".
You can load the CodeParrot-Multi model and tokenizer directly in transformers :
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi") model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi") inputs = tokenizer("def hello_world():", return_tensors="pt") outputs = model(**inputs)
or with a pipeline :
from transformers import pipeline pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi") outputs = pipe("def hello_world():")
The model was trained on the small Github code small after near deduplication, a subset of Github code dataset with the following settings:
Config | Value |
---|---|
Batch size | 192 |
Context size | 1024 |
Training steps | 300'000 |
Gradient accumulation | 2 |
Gradient checkpointing | False |
Learning rate | 5e-4 |
Weight decay | 0.1 |
Warmup steps | 2000 |
Schedule | Cosine |
The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 58 billion tokens.
We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:
Metric | Value |
---|---|
pass@1 | --% |
pass@10 | --% |
pass@100 | --% |
The pass@k metric tells the probability that at least one out of k generations passes the tests.