英文

VisualGLM-6B

? Github Repo • ? Twitter • ? [GLM@ACL 22] [GitHub] • ? [GLM-130B@ICLR 23] [GitHub]

? 加入我们的 Slack and WeChat

介绍

VisualGLM-6B is an open-source multimodal dialogue language model that supports image, Chinese, and English. The language model is based on ChatGLM-6B with 6.2 billion parameters. The image part utilizes BLIP2-Qformer to bridge the visual and language models, resulting in a total of 7.8 billion parameters.

VisualGLM-6B is pre-trained on a dataset consisting of 30 million high-quality Chinese image-text pairs from CogView and 300 million English image-text pairs after careful selection. The training evenly weights Chinese and English. This training approach effectively aligns visual information with the semantic space of ChatGLM. In the subsequent fine-tuning stage, the model is trained on long visual question and answer data to generate answers that align with human preferences.

软件依赖

pip install SwissArmyTransformer>=0.3.6 torch>1.10.0 torchvision transformers>=4.27.1 cpm_kernels

代码调用

You can generate conversations using the VisualGLM-6B model by calling the following code:

>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True)
>>> model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda()
>>> image_path = "your image path"
>>> response, history = model.chat(tokenizer, image_path, "描述这张图片。", history=[])
>>> print(response)
>>> response, history = model.chat(tokenizer, image_path, "这张图片可能是在什么场所拍摄的?", history=history)
>>> print(response)

For more instructions, including how to run CLI and web demos, and model quantization, please refer to our Github Repo .

For more instructions, including how to run CLI and web demos, and model quantization, please refer to our Github Repo .

协议

The code in this repository is open source under the Apache-2.0 license. The use of the VisualGLM-6B model's weights is subject to the Model License.

引用

If you find our work helpful, please consider citing the following papers:

@inproceedings{du2022glm,
  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={320--335},
  year={2022}
}
@article{ding2021cogview,
  title={Cogview: Mastering text-to-image generation via transformers},
  author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and others},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={19822--19835},
  year={2021}
}