Robustly optimized BERT approach
concatenation of two segments (sequences of tokens)
Segments usually consist of more than one natural sentence. The two segments are presented as a single input sequence to (RoBERTa) with special tokens delimiting them (s. BERT)
-> pair of sentences
tokens, segment and position embeddings
We present a replication study of BERT pretraining that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.