如何使用Amazon Textract同步分析PDF文档？

import boto3 import time def startJob(s3BucketName, objectName): response = None client = boto3.client('textract') response = client.start_document_text_detection( DocumentLocation={ 'S3Object': { 'Bucket': s3BucketName, 'Name': objectName } }) return response["JobId"] def isJobComplete(jobId): # For production use cases, use SNS based notification # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html time.sleep(5) client = boto3.client('textract') response = client.get_document_text_detection(JobId=jobId) status = response["JobStatus"] print("Job status: {}".format(status)) while(status == "IN_PROGRESS"): time.sleep(5) response = client.get_document_text_detection(JobId=jobId) status = response["JobStatus"] print("Job status: {}".format(status)) return status def getJobResults(jobId): pages = [] client = boto3.client('textract') response = client.get_document_text_detection(JobId=jobId) pages.append(response) print("Resultset page recieved: {}".format(len(pages))) nextToken = None if('NextToken' in response): nextToken = response['NextToken'] while(nextToken): response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken) pages.append(response) print("Resultset page recieved: {}".format(len(pages))) nextToken = None if('NextToken' in response): nextToken = response['NextToken'] return pages # Document s3BucketName = "ki-textract-demo-docs" documentName = "Amazon-Textract-Pdf.pdf" jobId = startJob(s3BucketName, documentName) print("Started job with id: {}".format(jobId)) if(isJobComplete(jobId)): response = getJobResults(jobId) #print(response) # Print detected text for resultPage in response: for item in resultPage["Blocks"]: if item["BlockType"] == "LINE": print ('\033[94m' + item["Text"] + '\033[0m')

1条回答

网友

1楼 · 发布于 2024-04-19 15:50:47

当前无法直接与Textract同步处理PDF文档。从Textract documentation开始：

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

一种解决方法是在代码中convert the PDF document into images，然后对这些图像使用同步API操作来处理文档

相关问题更多 >

编程相关推荐

热门问题

热门文章