Learning Cross-Modality Encoder Representations from Transformers

Year: 2,019
Journal: Conference on Empirical Methods in Natural Language Processing
Languages: English
Programming languages: Python
Input data:

image and related sentence (e.g. caption or question)

Output data:

three outputs for language, vision, and cross-modality, respectively

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Sign In


Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.