Amazon Textract is an simple optical character recognition(OCR) service that automatically extracts text and information from scanned documents. The main Advantage of Amazon textract is you can easily get the information stored in tables and content of fields in forms.
The Result of Amazon textract is very accurate as compared to tesseract which is open source OCR engine. You cannot extract table cell and forms information using tesseract engine. Textract supports all types of image formats especially scans images. Using Amazon s3 bucket, textract also supports PDFs documents too. When you upload the document to Amazon textract it popup the message and ask that “ you have to create the amazon s3 bucket ” when you answer yes it automatically creates the s3 bucket on your amzaon s3 service and store that uploaded pdf into created s3 bucket.
In AWS Textract, we get three types of results:
1. RawText- In raw text all document data will be extracted. If the document have key value pairs then the resulting raw text is like “ key: value”.
2. Forms- If the document contain any form, then the resulting data is of form type(key value pair)
3. Tables: In this type, textract detect all tables form the document and extract tabular information per cell and also converts the resulting data into csv file when you download the results.
How we use Amazon Textract
1. Go to https://aws.amazon.com/textract/ sign In to the console of aws and click on get started with Amazon textract.
2. Click on try amazon textract. Now you can see upload document option, here you can upload any document.
3. After uploading ,In the right side you can see four option Raw text, Forms, Tables and Human review. Here you can see the extracted document data. You can also download the results using Download results option after extraction.
How we use Textract in python with a few lines of code.
1. Showing the document processing on local machine.
import boto3
# Document
documentName = "OneKeyValue.png"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
2. Showing processing the document in Amazon s3 bucket. Make sure before using s3 bucket you have to set your local system aws credentials and config file according to your s3 bucket configuration. Also make sure the document already save to your s3 bucket before running the script.
import boto3
# Document
s3BucketName = "Your s3 bucket Name"
documentName = "document.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
3. Showing form (key/value) processing
import boto3
from trp import Document
# Document
s3BucketName = "Your s3 bucket name"
documentName = "document.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["FORMS"])
#print(response)
doc = Document(response)
for page in doc.pages:
# Print fields
print("Fields:")
for field in page.form.fields:
print("Key: {}, Value: {}".format(field.key, field.value))
# Get field by key
print("\nGet Field by Key:")
key = "Phone Number:"
field = page.form.getFieldByKey(key)
if(field):
print("Key: {}, Value: {}".format(field.key, field.value))
# Search fields by key
print("\nSearch Fields:")
key = "address"
fields = page.form.searchFieldsByKey(key)
for field in fields:
print("Key: {}, Value: {}".format(field.key, field.value))
4. Showing table processing
import boto3
# Document
from python.trp import Document
s3BucketName = "Your s3 bucket name"
documentName = "document.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
#print(response)
doc = Document(response)
for page in doc.pages:
# Print tables
for table in page.tables:
for r, row in enumerate(table.rows):
for c, cell in enumerate(row.cells):
print("Table[{}][{}] = {}".format(r, c, cell.text))
5. Showing PDF document processing
import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# Document
s3BucketName = "Your s3 bucket name"
documentName = "document.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')