Hemant Vishwakarma: Using Textract for OCR locally

Saturday 3 October 2020

Using Textract for OCR locally

I want to extract text form image using Python. (Tessaract lib does not work for me because it requires instalation).

I have found boto3 lib and Textract, but Im having trouble with working with it. Im still new to this. Can you tell me what I need to do in order to run my script correctly. This is my code:

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

When I run this code, I get:

Invalid type for parameter Document.Bytes, value: '''very long aray'''
type: <class 'numpy.ndarray'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object

I have also tried this:

# Document
documentName = "slika2.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

But I get this error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials

Im noob in this, so any help would be good. How can I read text form my image or pdf file?

I have also added this block of code, but the error is still Unable to locate credentials.

session = boto3.Session(
    aws_access_key_id='xxxxxxxxxxxx',
    aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)

from Using Textract for OCR locally

Hemant Vishwakarma

Saturday 3 October 2020

Using Textract for OCR locally

No comments:

Post a Comment