Learning Cross-Modality Encoder Representations from Transformers
image and related sentence (e.g. caption or question)
three outputs for language, vision, and cross-modality, respectively
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.