DETR: Detection Transformer is a collection based global loss that requires individual predictions via bipartite matching, also a transformer encoder-decoder structure. Modern deep learning algorithms make multi-step object detection which points to the problem of false positives. DETR intends to explain this innovatively and efficiently.
It handles an object detection difficulty as a close set prediction problem with the guidance of an encoder-decoder structure based on transformers. By set, I expect the position of bounding boxes.
Transformer: It is a structure during transforming one set into another one with the guidance of two parts (Encoder and Decoder), but it differs from the previously reported sequence-to-sequence models because it does not indicate any Recurrent Network (GRU, LSTM, etc.).
It relies on a simple yet powerful mechanism called attention, which allows AI models to selectively focus on certain parts of their input and thus think more effectively.
DETR Pipeline:
Inference:
Training:
Advantage of the DETR Pipeline:
DETER Architecture:
It Contains three Components
Backbone: Frequently utilized ResNet50 as the backbone of DETR. Ideally, any backbone can be used depending on the complexity of the task. It provides a low dimensional description of the image must a refined feature.
Encoder: The encoder layer has a fixed architecture and consists of a multi-head attention module and an FNN.
Deocder: It supports the conventional structure of the transformer, transforming N embedding of size d using multi-head self and encode decoder recognition mechanism.
The difference with the first transformer is DETR model decodes the N object in similarity at any decode layer.
FFN: It predicts the normalized core coordinates, length & breadth of the box w.r.t input image, and linear layer predicts class label using a softmax function.
Results:
for more details click here.