Generative Pretrained Transformer 2 for Starters
Understanding the Generative Pretrained Transformer 2
The year 2019 can be called as an essential year for the machine learning community where Earth witnessed a brilliant machine learning feat. The OpenAI GPT-2 showed amazing ability to write coherent and passionate essays that surpass what we expected current language models might generate.
The GPT-2 wasn't a particularly innovative design–the layout is very close to the decoder-only transformer. Nevertheless, the GPT2 was a very wide transformer-based language model that was trained on a huge dataset.
Transformers: Transformer is an algorithm, requiring two components (Encoder and Decoder) to turn one sequence into another.
Attention Mechanism: Attention Mechanism enables the decoder at each stage of the output generation to attend to different parts of the source statement.
Encoder: An encoder is an input-network, and outputs a map / vector / tensor function. These vector characteristics carry the information, the characteristics, which represent the data.
Decoder: The decoder is a network that takes the function vector from the encoder (usually the same network configuration as the encoder but in opposite orientation) and gives the best fit to the real input or expected output.
WordPiece: A strategy for segmenting terms into NLP subword-level activities where the vocabulary is initialized with all the individual characters in the expression, and then the most frequent / likely variants of the symbols in the vocabulary are added iteratively.
A machine-learning algorithm capable of interpreting part of a sentence and predicting the next word. A good example of a language model would be auto-suggestive words on your phone when you type anything.
The transformer layout consists of an encoder and decoder, each of which is a stack of what we might consider transformer bricks. The design was well suitable for machine translation tasks.
GPT-2 (a successor to GPT), was trained in 40 GB of Internet text to predict the next word.
OpenAI team, at first provided a much smaller model for academics to work with, as well as a scientific report, as an exercise in open disclosure.
GPT-2 is equipped with a specific aim: to predict the next word, despite all the preceding words in some text. Dataset complexity allows this basic goal to include naturally occurring representations of many functions across different domains. GPT-2 is a similar GPT scale-up, with more than 10X parameters, and the amount of data trained on more than 10X.
GPT-2 shows a wide range of capabilities including the ability to generate unparalleled consistency conditional synthetic text samples where we prime the software with entry and have it generate a long continuity.
However, GPT-2 outperforms other language models trained on different domains (such as Wikipedia, news, or books) without the need to use such domain-specific training datasets.
GPT-2 begins learning these tasks from the raw text on language tasks such as answering questions, reading comprehension, summarization, and translation, using no role-specific training details.
The GPT-2 is fabricated using decoder blocks of the transformer. In comparison, BERT uses encoder blocks of the transformer. A major difference between the two is that GPT2 emits one token at a time like standard language models.
A simple illustration of the GPT2 Language Model.