[ICML 2022] VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Introduction Two mainstream architectures are widely used in previous work. Dual-stream : encode images and text separately. Modality interaction is handled by the cosine similarity of the image and text feature vectors. This architecture is effective for retrieval tasks, especially for masses of images and text Representative model : CLIP, ALIGN Limitation : Its shallow interaction is not enoug..
논문 리뷰/멀티모달
2024. 4. 6. 18:57