多模态-字幕生成

2023-10-16

介绍

在上一篇文章多模态-CLIP，介绍了CLIP中text跟image如何匹配。本文介绍如何基于image来做字幕生成，也即Image Caption，属于text-to-image任务。

其整体流程用到了transformer/vision_encoder_decoder架构，即使用ViT来作为图像的encoder，gpt2来作为文本的decoder。当然你也可以使用其他模型，整体架构如下图所示。

参考

1、zero_nlp vit-gpt2-image-chinese-captioning
2、The Illustrated Image Captioning using transformers

jsonContent: meta: false pages: false posts: title: true date: true path: true text: false raw: false content: false slug: false updated: false comments: false link: false permalink: false excerpt: false categories: false tags: true