模型:
michaelfeil/ct2fast-flan-alpaca-base
Speedup inference by 2x-8x using int8 inference in C++
quantized version of declare-lab/flan-alpaca-base
pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
Checkpoint compatible to ctranslate2 and hf-hub-ctranslate2
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub model_name = "michaelfeil/ct2fast-flan-alpaca-base" model = TranslatorCT2fromHfHub( # load in int8 on CUDA model_name_or_path=model_name, device="cuda", compute_type="int8_float16" ) outputs = model.generate( text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"], min_decoding_length=24, max_decoding_length=32, max_input_length=512, beam_size=5 ) print(outputs)
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.