模型:
flax-community/gpt-neo-125M-code-clippy
Please refer to our new GitHub Wiki which documents our efforts in detail in creating the open source version of GitHub Copilot
GPT-Neo-125M-Code-Clippy is a GPT-Neo-125M model finetuned using causal language modeling on our version of the Code Clippy Data dataset that has duplicates, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages. As discussed in OpenAI's Codex paper , we modified the GPT-Neo model and tokenizer to accommodate for additional whitespace characters. Specifically, we add the following tokens ["\t\t", " ", " ", " "] and since they are all related to indentation, we initialize the embedding layer of these tokens with the same weights as the \t token already present in the model in hopes the model will learn to associate these whitespace characters with indentation faster. A script to automatically do this can be found here .
The training script used to train this model can be found here .
To reproduce the training one can use this command with the above script:
./run_clm_streaming_flax.py \ --output_dir $HOME/gpt-neo-125M-code-clippy \ --model_name_or_path="flax-community/gpt-neo-125M-code-clippy" \ --dataset_name $HOME/gpt-code-clippy/data_processing/code_clippy.py \ --data_dir /home/shared/code_clippy_data \ --text_column_name="text" \ --do_train --do_eval \ --block_size="2048" \ --per_device_train_batch_size="8" \ --per_device_eval_batch_size="16" \ --preprocessing_num_workers="8" \ --learning_rate="1e-4" \ --max_steps 100000 \ --warmup_steps 2500 \ --decay_steps 25000 \ --adam_beta1="0.9" \ --adam_beta2="0.95" \ --weight_decay="0.1" \ --overwrite_output_dir \ --logging_steps="100" \ --eval_steps="500" \ --push_to_hub="False" \ --report_to="all" \ --dtype="bfloat16" \ --skip_memory_metrics="True" \ --save_steps="500" \ --save_total_limit 10 \ --gradient_accumulation_steps 16 \ --report_to="wandb" \ --run_name="125m_1e-4lr_1024bs" \ --max_eval_samples 2000 \ --save_optimizer true
The model is finetuned on text files from github repositories (mostly programming languages but also markdown and other project related files).
You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:
from transformers import AutoModelForCausalLM, AutoTokenizer, FlaxAutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("flax-community/gpt-neo-125M-code-clippy") tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt-neo-125M-code-clippy") prompt = """def greet(name): '''A function to greet user. Given a user name it should say hello''' """ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device) start = input_ids.size(1) out = model.generate(input_ids, do_sample=True, max_length=50, num_beams=2, early_stopping=True, eos_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(out[0][start:]))
The model is intended to be used for research purposes and comes with no guarantees of quality of generated code.
The paper "Evaluating Large Language Models Trained on Code" from OpenAI has a good discussion on what the impact of a large language model trained on code could be. Therefore, some parts of their discuss are highlighted here as it pertains to this dataset and models that may be trained from it. As well as some differences in views from the paper, particularly around legal implications .
GPT-Neo-125M-Code-Clippy is finetuned from GPT-Neo and might have inherited biases and limitations from it. See GPT-Neo model card for details.
Coming soon...