Introduction to Image Captioning
With the latest breakthrough in the research of Deep Learning. Our computers can now achieve previously impossible sci-fi abilities. Among the wide varieties of Machine Learning feats, Namely Object Tracking, Automatic speech recognition, Image recognition, Visual art processing, Natural language processing, now we can add one more pious ability: Automatic Image Captioning.
In this article, we will use a pre-trained model to generate a descriptive caption for any random image.
Image captioning: Few lines of text used to explain and elaborate a photograph.
Encoder-Decoder architecture: A model that uses Encoder (CNN) to encode the input ( image ) and then decodes it using a Decoder (RNN), into a word sequence.
BLEU: BLEU, stands for Bilingual Evaluation Understudy, which is a score given for comparing the actual text with the generated one for a particular case. It is used to evaluate the outcome of natural language processing models.
LSTM: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture. LSTM uses feedback connections. It processes entire sequences of data (such as speech or video). LSTM is heavily used for speech recognition, unsegmented, connected handwriting recognition].
Embeddings: embeddings are dense vectors of real numbers, one per word in the vocabulary.
Transfer Learning: A Machine Learning Technique in which an already trained model trained on one task is re-trained to use on another similar task.
Inference: when a trained model is used to predict the testing samples and comprises a similar forward pass as training to predict the values is called Inference.
The Encoder converts 3 color channels Image to "learned" channels.
This learned channel possesses all the features of the input image.
We will use already trained CNNs to encode images.
We will use 152 layered Residual Networks trained on the ImageNet classification task (available in PyTorch).
The encoding produced by our ResNet-152 encoder has a size of 14x14 with 2048 channels, i.e., a 2048, 14, 14 size tensor.
The last layer or two of these models should be removed so that we can get all the feature vectors.
Fig 1. Working of Encoder
The Encoded image is processed by Decoder to produce a sequence of words related to the image.
Recurrent Neural Network (RNN) is used to process sequences of data.
We will average the encoded image across by pixels. Then will feed this into the Decoder as its first hidden state and generate the caption. Each generated word is used to generate the next word.
Fig 2. Working of Decoder
The total flow would be like this.
After Encoder generates the output, we transform the encoding to create the initial hidden state (and cell state ) for the LSTM Decoder.
At each step @Decoding
the encoded output and the previous hidden state both generate weights for each pixel.
the outcome of the previous step and the average weights of the encoding are sent to the Decoder to produce the next probable output (word).
Fig 3. Overall WorkFlow