Bidirectional and Auto Regressive Transformers

Posted By :Ashish Bhatnagar |19th September 2021


BART, which specifically pre-trains a model combining Bidirectional and Auto-Regressive Transformers. BART is a denoising autoencoder that is built with a proper sequence-to-sequence model that is applicable for a very wide range of end tasks. Pre-Training basically has two stages-
(1)Text is corrupted with an arbitrary noising function, and

(2) A sequence-to-sequence model is learned to reconstruct the original text.

BART Sequence-to-Sequence

BART has both an encoder which is very much like BERT(Bidirectional Encoder Representations from Transformers) and a decoder which is like Generative Pre-trained Transformers (GPT), essentially getting the best of both worlds.


Figure 1 ( c ) from the BART paper


The encoder implements a denoising objective which is trained to reconstruct text where a random subset of the words has been masked out which is very similar to BERT.On the other hand, the decoder tries to reproduce the original sequence token by token which is similar to autoencoder.

BART is generally trained by corrupting documents followed by optimizing the reconstruction loss—the cross-entropy between the decoder’s output and the original document.

There are various corruption schemes which are used during pre-training of BART which is listed as follows:

  1. Token Masking: A very similar strategy which is used in BERT, random tokens are sampled and replaced with [MASK] elements. 
  2. Token Deletion: In this scheme, some random tokens are omitted from the input. Unlike token masking, the model must make a decision on which positions are missing input.
  3. Text Infilling: In this number of text spans which can be of variable length are each replaced with a single [MASK] token.
  4. Sentence Permutation:  In this scheme, a  document is divided into sentences that are based on full stops, and then these sentences are shuffled in random order.
  5. Document Rotation:  A token is randomly chosen and the document is rotated such that it begins with that token. This task gives the ability to the model to clearly identify the start of the document. 


Machine Translation using BART :

During training the BART, they replaced BART’s encoder embedding layer with a new randomly initialized encoder. The model is then trained end-to-end, which trains the new encoder whose responsibility is to map foreign words into an input that BART can de-noise to English.

 The new encoder also uses a separate vocabulary from the original BART model. They trained the source encoder in mainly two steps, in both the cases backpropagating the cross-entropy loss from the output of the model. 

  1. In the first step, they froze most of the BART parameters and only updated the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART’s encoder first layer. 
  2. In the second step, they trained all model parameters for a small number of iterations.



With the introduction of BART, a pre-training approach that basically learns to map the corrupted documents to the original document. BART achieves similar performance to RoBERTA (Robustly Optimized BERT Pre Training Approach) on discriminative tasks while achieving some new state-of-the-art results on numerous text generation tasks. There is the future scope of exploring some new methods for corrupting documents for pre-training, perhaps tailoring them to specific end tasks.

Some of the important use cases for which the BART model can be used are Text Summarization, Abstractive Question Answering, Language Translation, etc.


About Author

Ashish Bhatnagar

He is a enthusiastic and have a good grip on latest technologies like ML, DL and Computer vision. He is focused and always willing to learn.

Request For Proposal

[contact-form-7 404 "Not Found"]

Ready to innovate ? Let's get in touch

Chat With Us