Learning Cross-Modality Encoder Representations from Transformers
Journal: Conference on Empirical Methods in Natural Language Processing
Programming languages: Python
image and related sentence (e.g. caption or question)
three outputs for language, vision, and cross-modality, respectively
Project website: https://github.com/airsplay/lxmert
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.