Wav2Vec A powerful model for speech to text representations

Posted By :Ashish Bhatnagar |5th June 2021



So basically, the idea that we are discussing here is to learn the representations from audio without using any labeled data. 


From the 100 hrs of audio data, how can we leverage this unlabelled data to learn some good representations useful for speech recognition tasks? 


Wav2Vec 2.0 disguises the speech input in the latent space but not in the input space and solves a contrastive task defined over a quantization of the latent representations, which are jointly learned. 

Then they use these quantized representations to train the model. This model's most exciting or appealing thing is that just using just fifteen minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves a 4.8/8.2 WER(Word Error Rate).

This gives us the demonstration of its feasibility of speech recognition with limited amounts of labeled data.                   



Source: https://venturebeat.com/2020/06/23/facebook-claims-wav2vec-2-0-tops-speech-recognition-performance-with-10-minutes-of-labeled-data/


Model Architecture: 

So basically, it just takes the raw audio as the input, which we can just read from the audio .wav file. We will get the sequence of numbers which is 1- dimensional. After this, the input X(raw waveform) in this case is passed on to feature encoder which contains some number of CNN(Convolution Neural Networks) which gives us the representation Z (Latent speech representations), which is a latent speech representation z1, . . . , zT for T time-steps. After that, this Z is fed into the Transformer to build representations c1, . . . cT captures the information from the entire sequence, which is a contextual representation of speech data. The output of the feature encoder is discretized to qt with a quantization module to represent the targets in the self-supervised objective.


Compared with the previous version of models, this model uses a transformer that builds context representations over continuous speech representations. Self-attention captures dependencies over the entire sequence of latent representations end-to-end.

Now let us detail some of the critical segments of the model like feature encoder, contextualized representations with transformers, and quantization module.


1) Quantization module: 

This says that Z(z1,z2..,zn) is the continuous vector, so let's replace it with q (q1,q2…,qn), present in the present the codebook is some dictionary.


2) Feature encoder:

The feature encoder consists of several blocks containing a temporal convolution followed by layer normalization and a GELU(Gaussian Error Linear Unit) activation function. The raw audio waveform, which is the input to the encoder, is normalized with zero mean and unit variance. The overall stride of the encoder governs the number of time-steps T, which are input to the Transformer.


3) Contextualized representations with Transformers:

 The specific output of the feature encoder is fed to a context network which is a Transformer architecture. Instead of fixed positional embeddings that encode absolute positional information, they use a convolutional layer similar to relative positional embedding. They add the convolution output followed by a GELU to the inputs and then apply layer normalization. 



The training objective requires identifying the correct quantized latent audio representation in a set of distractors for each masked time step. The final model is precisely fine-tuned on the labeled data. It also includes the masking technique in which we mask a proportion of the output of the feature encoder or time steps before feeding them to the context network and replace them with a trained featured vector shared between the all-time phase.


The loss function which they are trying to minimize is defined as given below:


L = Lm + αLd 


α is a tuned hyperparameter 

Lm is Contrastive Loss 

Ld is Diversity Loss.


Fine Tuning: 

The pre-trained models are fine-tuned for speech recognition by adding a random initialized linear projection on top of the context network representing the vocabulary of the task. This handy technique with which, by just fine-tuning a small resource speech dataset, this model could achieve state-of-the-art results.


With the introduction of Wav2Vec2 in the Transformers library, Hugging Face has made it much easier and simpler to create and work with audio data to create a State of an art speech recognition system with concise code lines great news.







About Author

Ashish Bhatnagar

He is a enthusiastic and have a good grip on latest technologies like ML, DL and Computer vision. He is focused and always willing to learn.

Request For Proposal

Sending message..

Ready to innovate ? Let's get in touch

Chat With Us