模型:
HuggingFaceM4/opt-1.3b-fp16-8b-samples
This model is an outcome of an experiment of training from scratch https://huggingface.co/facebook/opt-1.3b for just 8B tokens in fp16, fp32 and bf16 which would allow comparing the resulting models when they are used to train a multimodal model. But, of course, it can be used for any other purpose, just be aware that these models are very undertrained. Most language models are trained for about 300B tokens, this one was just 8B.
The 3 repositories are:
get transformers:
git clone https://github.com/huggingface/transformers cd transformers
Prepare an initialized opt-1.3 model:
cat << EOT > prep-fp16.py from transformers import AutoConfig, AutoModel, AutoTokenizer import torch mname = "facebook/opt-1.3b" config = AutoConfig.from_pretrained(mname) model = AutoModel.from_config(config, torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(mname) path = "opt-1.3b-fp16" model.save_pretrained(path) tokenizer.save_pretrained(path) EOT
Run:
python prep-fp16.py
Train from scratch on a single 8x 80GB A100 node on realnewslike subset of https://huggingface.co/datasets/c4 :
git clone https://github.com/huggingface/transformers cd transformers PYTHONPATH="src" python -m torch.distributed.run \ --nproc_per_node=8 \ --nnode=1 \ --node_rank=0 \ --master_addr=127.0.0.1 \ --master_port=9901 \ examples/pytorch/language-modeling/run_clm.py \ --fp16 \ --tf32 1 \ --seed 42 \ --dataset_name c4 \ --dataset_config_name realnewslike \ --model_name_or_path opt-1.3b-fp16 \ --per_device_train_batch_size 6 \ --per_device_eval_batch_size 6 \ --gradient_accumulation_steps 2 \ --do_train \ --logging_steps 5 \ --save_steps 1000 \ --eval_steps 1000 \ --weight_decay 0.1 \ --num_train_epochs 1 \ --adam_beta1 0.9 \ --adam_beta2 0.95 \ --learning_rate 0.0002 \ --lr_scheduler_type linear \ --warmup_steps 1000 \ --report_to tensorboard \ --output_dir saved \ --logging_dir tb \ --log_level warning \ --preprocessing_num_workers 32
The training took about 40h.