Introduction
BART, which specifically pre-trains a model combining Bidirectional and Auto-Regressive Transformers. BART is a denoising autoencoder that is built with a proper sequence-to-sequence model that is applicable for a very wide range of end tasks. Pre-Training basically has two stages-
(1)Text is corrupted with an arbitrary noising function, and
(2) A sequence-to-sequence model is learned to reconstruct the original text.
BART Sequence-to-Sequence
BART has both an encoder which is very much like BERT(Bidirectional Encoder Representations from Transformers) and a decoder which is like Generative Pre-trained Transformers (GPT), essentially getting the best of both worlds.
Figure 1 ( c ) from the BART paper
The encoder implements a denoising objective which is trained to reconstruct text where a random subset of the words has been masked out which is very similar to BERT.On the other hand, the decoder tries to reproduce the original sequence token by token which is similar to autoencoder.
BART is generally trained by corrupting documents followed by optimizing the reconstruction loss—the cross-entropy between the decoder’s output and the original document.
There are various corruption schemes which are used during pre-training of BART which is listed as follows:
Machine Translation using BART :
During training the BART, they replaced BART’s encoder embedding layer with a new randomly initialized encoder. The model is then trained end-to-end, which trains the new encoder whose responsibility is to map foreign words into an input that BART can de-noise to English.
The new encoder also uses a separate vocabulary from the original BART model. They trained the source encoder in mainly two steps, in both the cases backpropagating the cross-entropy loss from the output of the model.
Conclusion:
With the introduction of BART, a pre-training approach that basically learns to map the corrupted documents to the original document. BART achieves similar performance to RoBERTA (Robustly Optimized BERT Pre Training Approach) on discriminative tasks while achieving some new state-of-the-art results on numerous text generation tasks. There is the future scope of exploring some new methods for corrupting documents for pre-training, perhaps tailoring them to specific end tasks.
Some of the important use cases for which the BART model can be used are Text Summarization, Abstractive Question Answering, Language Translation, etc.