模型:
THUDM/visualglm-6b
? Github Repo • ? Twitter • ? [GLM@ACL 22] [GitHub] • ? [GLM-130B@ICLR 23] [GitHub]
VisualGLM-6B is an open-source multimodal dialogue language model that supports image, Chinese, and English. The language model is based on ChatGLM-6B with 6.2 billion parameters. The image part utilizes BLIP2-Qformer to bridge the visual and language models, resulting in a total of 7.8 billion parameters.
VisualGLM-6B is pre-trained on a dataset consisting of 30 million high-quality Chinese image-text pairs from CogView and 300 million English image-text pairs after careful selection. The training evenly weights Chinese and English. This training approach effectively aligns visual information with the semantic space of ChatGLM. In the subsequent fine-tuning stage, the model is trained on long visual question and answer data to generate answers that align with human preferences.
pip install SwissArmyTransformer>=0.3.6 torch>1.10.0 torchvision transformers>=4.27.1 cpm_kernels
You can generate conversations using the VisualGLM-6B model by calling the following code:
>>> from transformers import AutoTokenizer, AutoModel >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True) >>> model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda() >>> image_path = "your image path" >>> response, history = model.chat(tokenizer, image_path, "描述这张图片。", history=[]) >>> print(response) >>> response, history = model.chat(tokenizer, image_path, "这张图片可能是在什么场所拍摄的?", history=history) >>> print(response)
For more instructions, including how to run CLI and web demos, and model quantization, please refer to our Github Repo .
For more instructions, including how to run CLI and web demos, and model quantization, please refer to our Github Repo .
The code in this repository is open source under the Apache-2.0 license. The use of the VisualGLM-6B model's weights is subject to the Model License.
If you find our work helpful, please consider citing the following papers:
@inproceedings{du2022glm, title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling}, author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages={320--335}, year={2022} }
@article{ding2021cogview, title={Cogview: Mastering text-to-image generation via transformers}, author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and others}, journal={Advances in Neural Information Processing Systems}, volume={34}, pages={19822--19835}, year={2021} }