Introduction to Aws textract and how we use it for data extraction

Posted By :Hemant Chauhan |28th December 2020

 

Amazon Textract is an simple optical character recognition(OCR) service that automatically extracts text and information from scanned documents. The main Advantage of Amazon textract is you can easily get the information stored in tables and content of fields in forms.

 

The Result of Amazon textract is very accurate as compared to tesseract which is open source OCR engine. You cannot extract table cell and forms information using tesseract engine. Textract supports all types of image formats especially scans images. Using Amazon s3 bucket, textract also supports PDFs documents too. When you upload the document to Amazon textract it popup the message and ask that “ you have to create the amazon s3 bucket ” when you answer yes it automatically creates the s3 bucket on your amzaon s3 service and store that uploaded pdf into created s3 bucket.

 

In AWS Textract, we get three types of results:

 

1. RawText- In raw text all document data will be extracted. If the document have key value pairs then the resulting raw text is like “ key: value”.

2. Forms- If the document contain any form, then the resulting data is of form type(key value pair)

3. Tables: In this type, textract detect all tables form the document and extract tabular information per cell and also converts the resulting data into csv file when you download the results.

 

How we use Amazon Textract

 

1. Go to https://aws.amazon.com/textract/ sign In to the console of aws and click on get started with Amazon textract.

 

2. Click on try amazon textract. Now you can see upload document option, here you can upload any document.

 

3. After uploading ,In the right side you can see four option Raw text, Forms, Tables and Human review. Here you can see the extracted document data. You can also download the results using Download results option after extraction.

 

How we use Textract in python with a few lines of code.

 

1. Showing the document processing on local machine.

 

 import boto3  
 # Document  
 documentName = "OneKeyValue.png"  
 # Read document content  
 with open(documentName, 'rb') as document:  
   imageBytes = bytearray(document.read())  
 # Amazon Textract client  
 textract = boto3.client('textract')  
 # Call Amazon Textract  
 response = textract.detect_document_text(Document={'Bytes': imageBytes})  
 #print(response)  
 # Print detected text  
 for item in response["Blocks"]:  
   if item["BlockType"] == "LINE":  
     print ('\033[94m' + item["Text"] + '\033[0m')  

 

2. Showing processing the document in Amazon s3 bucket. Make sure before using s3 bucket you have to set your local system aws credentials and config file according to your s3 bucket configuration. Also make sure the document already save to your s3 bucket before running the script.

 

 import boto3  
 # Document  
 s3BucketName = "Your s3 bucket Name"  
 documentName = "document.jpg"  
 # Amazon Textract client  
 textract = boto3.client('textract')  
 # Call Amazon Textract  
 response = textract.detect_document_text(  
   Document={  
     'S3Object': {  
       'Bucket': s3BucketName,  
       'Name': documentName  
     }  
   })  
 #print(response)  
 # Print detected text  
 for item in response["Blocks"]:  
   if item["BlockType"] == "LINE":  
     print ('\033[94m' + item["Text"] + '\033[0m')  

 

3. Showing form (key/value) processing

 

 import boto3  
 from trp import Document  
 # Document  
 s3BucketName = "Your s3 bucket name"  
 documentName = "document.jpg"  
 # Amazon Textract client  
 textract = boto3.client('textract')  
 # Call Amazon Textract  
 response = textract.analyze_document(  
   Document={  
     'S3Object': {  
       'Bucket': s3BucketName,  
       'Name': documentName  
     }  
   },  
   FeatureTypes=["FORMS"])  
 #print(response)  
 doc = Document(response)  
 for page in doc.pages:  
   # Print fields  
   print("Fields:")  
   for field in page.form.fields:  
     print("Key: {}, Value: {}".format(field.key, field.value))  
   # Get field by key  
   print("\nGet Field by Key:")  
   key = "Phone Number:"  
   field = page.form.getFieldByKey(key)  
   if(field):  
     print("Key: {}, Value: {}".format(field.key, field.value))  
   # Search fields by key  
   print("\nSearch Fields:")  
   key = "address"  
   fields = page.form.searchFieldsByKey(key)  
   for field in fields:  
     print("Key: {}, Value: {}".format(field.key, field.value))  

 

4. Showing table processing

 

 import boto3  
 # Document  
 from python.trp import Document  
 s3BucketName = "Your s3 bucket name"  
 documentName = "document.jpg"  
 # Amazon Textract client  
 textract = boto3.client('textract')  
 # Call Amazon Textract  
 response = textract.analyze_document(  
   Document={  
     'S3Object': {  
       'Bucket': s3BucketName,  
       'Name': documentName  
     }  
   },  
   FeatureTypes=["TABLES"])  
 #print(response)  
 doc = Document(response)  
 for page in doc.pages:  
    # Print tables  
   for table in page.tables:  
     for r, row in enumerate(table.rows):  
       for c, cell in enumerate(row.cells):  
         print("Table[{}][{}] = {}".format(r, c, cell.text))  

 

5. Showing PDF document processing

 

 import boto3  
 import time  
 def startJob(s3BucketName, objectName):  
   response = None  
   client = boto3.client('textract')  
   response = client.start_document_text_detection(  
   DocumentLocation={  
     'S3Object': {  
       'Bucket': s3BucketName,  
       'Name': objectName  
     }  
   })  
   return response["JobId"]  
 def isJobComplete(jobId):  
   time.sleep(5)  
   client = boto3.client('textract')  
   response = client.get_document_text_detection(JobId=jobId)  
   status = response["JobStatus"]  
   print("Job status: {}".format(status))  
   while(status == "IN_PROGRESS"):  
     time.sleep(5)  
     response = client.get_document_text_detection(JobId=jobId)  
     status = response["JobStatus"]  
     print("Job status: {}".format(status))  
   return status  
 def getJobResults(jobId):  
   pages = []  
   time.sleep(5)  
   client = boto3.client('textract')  
   response = client.get_document_text_detection(JobId=jobId)  
   pages.append(response)  
   print("Resultset page recieved: {}".format(len(pages)))  
   nextToken = None  
   if('NextToken' in response):  
     nextToken = response['NextToken']  
   while(nextToken):  
     time.sleep(5)  
     response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)  
     pages.append(response)  
     print("Resultset page recieved: {}".format(len(pages)))  
     nextToken = None  
     if('NextToken' in response):  
       nextToken = response['NextToken']  
   return pages  
 # Document  
 s3BucketName = "Your s3 bucket name"  
 documentName = "document.pdf"  
 jobId = startJob(s3BucketName, documentName)  
 print("Started job with id: {}".format(jobId))  
 if(isJobComplete(jobId)):  
   response = getJobResults(jobId)  
 #print(response)  
 # Print detected text  
 for resultPage in response:  
   for item in resultPage["Blocks"]:  
     if item["BlockType"] == "LINE":  
       print ('\033[94m' + item["Text"] + '\033[0m')  

 


About Author

Hemant Chauhan

Hemant is an accomplished backend developer with extensive experience in software development. He possesses an in-depth understanding of various technologies and has a strong command over Java, Spring Boot, MySQL, Elasticsearch, Selenium with Java, GitHub/GitLab, HTML/CSS, and MongoDB. Hemant has worked on several related projects, including Tessaract OCR, Sikuli with Selenium Automation, Transleqo, and currently, SecureNow. He excels at managing trading bots, developing centralized exchanges, and has a creative mindset with exceptional analytical skills.

Request For Proposal

[contact-form-7 404 "Not Found"]

Ready to innovate ? Let's get in touch

Chat With Us