Object Detection using DETR

Posted By :Pradeep Farthyal |1st September 2020

DETR: Detection Transformer is a collection based global loss that requires individual predictions via bipartite matching, also a transformer encoder-decoder structure. Modern deep learning algorithms make multi-step object detection which points to the problem of false positives. DETR intends to explain this innovatively and efficiently.
It handles an object detection difficulty as a close set prediction problem with the guidance of an encoder-decoder structure based on transformers. By set, I expect the position of bounding boxes.

Transformer: It is a structure during transforming one set into another one with the guidance of two parts (Encoder and Decoder), but it differs from the previously reported sequence-to-sequence models because it does not indicate any Recurrent Network (GRU, LSTM, etc.).

It relies on a simple yet powerful mechanism called attention, which allows AI models to selectively focus on certain parts of their input and thus think more effectively.



DETR Pipeline:


  1. Calculate the Image feature from the backbone.
  2. To Transfer encoder-decoder structure
  3. Determining a set of predictions


  1. Calculate Image features from the backbone
  2. To transformer encoder-decoder architecture.

Advantage of the DETR Pipeline:

  1. Easy to Use
  2. No Custom Layers
  3. An easy extension to other tasks
  4. Prior information about anchors or handcrafted algorithms like NMS is not needed

DETER Architecture:

It Contains three Components

  1. CNN backbone extract compact feature representation
  2. encoder-decoder architecture
  3. FFN (Feed Forward Network) that make final detection prediction

Backbone: Frequently utilized ResNet50 as the backbone of DETR. Ideally, any backbone can be used depending on the complexity of the task. It provides a low dimensional description of the image must a refined feature.









Encoder: The encoder layer has a fixed architecture and consists of a multi-head attention module and an FNN.


Deocder: It supports the conventional structure of the transformer, transforming N embedding of size d using multi-head self and encode decoder recognition mechanism. 

The difference with the first transformer is DETR model decodes the N object in similarity at any decode layer.

FFN: It predicts the normalized core coordinates, length & breadth of the box w.r.t input image, and linear layer predicts class label using a softmax function.




for more details click here.


About Author

Pradeep Farthyal

He is a hard-working, quick learner, and a result-oriented employee who is always ready to learn new things.

Request For Proposal

[contact-form-7 404 "Not Found"]

Ready to innovate ? Let's get in touch

Chat With Us