DePlot 模型卡

内容目录

简介

使用模型

贡献

引用

简介

该论文的摘要陈述如下：

可视化语言，例如图表和绘图，是人类世界中无处不在的。理解图表和绘图需要强大的推理能力。先前的最新模型需要至少数万个训练样本，它们的推理能力仍然非常有限，尤其是对于复杂的人类编写的查询。本论文提出了一种 visual language reasoning 的一次性解决方案。我们将 visual language reasoning 的挑战分解为两个步骤：(1) 将图表转换为文本，和 (2) 对转换后的文本进行推理。在这个方法中的关键是一种称为 DePlot 的模态转换模块，它将图表的图像转换为线性化的表格。DePlot 的输出可以直接用于预训练的大型语言模型 (LLM) 的输入，从而利用 LLM 的少样本推理能力。为了获得 DePlot，我们通过建立统一的任务格式和度量标准，对图表到表格的任务进行标准化，并在此任务上对 DePlot 进行端到端的训练。DePlot 可以与 LLM 一起直接使用，形成即插即用的模式。与基于超过28k数据点微调的最新模型相比，仅通过一次提示，DePlot+LLM 在来自图表问答任务的人类编写查询上实现了24.0%的性能提升。

使用模型

从 T5x 转换到 huggingface

您可以按照以下方式使用 convert_pix2struct_checkpoint_to_pytorch.py 脚本：

python convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --is_vqa

如果您要转换大型模型，请运行：

python convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --use-large --is_vqa

保存后，可以使用以下代码将转换后的模型推送到仓库中：

from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

model = Pix2StructForConditionalGeneration.from_pretrained(PATH_TO_SAVE)
processor = Pix2StructProcessor.from_pretrained(PATH_TO_SAVE)

model.push_to_hub("USERNAME/MODEL_NAME")
processor.push_to_hub("USERNAME/MODEL_NAME")

运行预测

您可以通过查询输入图像和问题来运行预测，如下所示：

from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
import requests
from PIL import Image

model = Pix2StructForConditionalGeneration.from_pretrained('google/deplot')
processor = Pix2StructProcessor.from_pretrained('google/deplot')
url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt")
predictions = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(predictions[0], skip_special_tokens=True))

贡献

该模型最初由Fangyu Liu, Julian Martin Eisenschlos等贡献，并由 Younes Belkada 添加到 Hugging Face 生态系统中。

引用

如果您要引用此工作，请考虑引用原始论文：

@misc{liu2022deplot,
      title={DePlot: One-shot visual language reasoning by plot-to-table translation},
      author={Liu, Fangyu and Eisenschlos, Julian Martin and Piccinno, Francesco and Krichene, Syrine and Pang, Chenxi and Lee, Kenton and Joshi, Mandar and Chen, Wenhu and Collier, Nigel and Altun, Yasemin},
      year={2022},
      eprint={2212.10505},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

作者:

Google AI

数据集大小:

1.06 GB