模型:

HuggingFaceM4/opt-1.3b-bf16-8b-samples

中文

This model is an outcome of an experiment of training from scratch https://huggingface.co/facebook/opt-1.3b for just 8B tokens in fp16, fp32 and bf16 which would allow comparing the resulting models when they are used to train a multimodal model. But, of course, it can be used for any other purpose, just be aware that these models are very undertrained. Most language models are trained for about 300B tokens, this one was just 8B.

The 3 repositories are:

The training

get transformers:

git clone https://github.com/huggingface/transformers
cd transformers

Prepare an initialized opt-1.3 model:

cat << EOT > prep-bf16.py
from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch

mname = "facebook/opt-1.3b"

config = AutoConfig.from_pretrained(mname)
model = AutoModel.from_config(config, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(mname)

path = "opt-1.3b-bf16"

model.save_pretrained(path)
tokenizer.save_pretrained(path)
EOT

Run:

python prep-bf16.py

Train from scratch on a single 8x 80GB A100 node on realnewslike subset of https://huggingface.co/datasets/c4 :

git clone https://github.com/huggingface/transformers
cd transformers
PYTHONPATH="src" python -m torch.distributed.run \
    --nproc_per_node=8 \
    --nnode=1 \
    --node_rank=0 \
    --master_addr=127.0.0.1 \
    --master_port=9901 \
    examples/pytorch/language-modeling/run_clm.py \
    --bf16 \
    --tf32 1 \
    --seed 42 \
    --dataset_name c4 \
    --dataset_config_name realnewslike \
    --model_name_or_path opt-1.3b-bf16 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 6 \
    --gradient_accumulation_steps 2 \
    --do_train \
    --logging_steps 5 \
    --save_steps 1000 \
    --eval_steps 1000 \
    --weight_decay 0.1 \
    --num_train_epochs 1 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --learning_rate 0.0002 \
    --lr_scheduler_type linear \
    --warmup_steps 1000 \
    --report_to tensorboard \
    --output_dir saved \
    --logging_dir tb \
    --log_level warning \
    --preprocessing_num_workers 32

The training took about 40h.