Amazon Textract Caller工具
项目描述
Textract-Caller
amazon-textract-caller 提供了一系列可用的函数和示例实现,以加快任何使用 Amazon Textract 的项目的评估和开发。
无论文件类型和位置如何,都易于调用 Amazon Textract。
安装
> python -m pip install amazon-textract-caller
函数
from textractcaller import call_textract
def call_textract(input_document: Union[str, bytes],
features: Optional[List[Textract_Features]] = None,
queries_config: Optional[QueriesConfig] = None,
output_config: Optional[OutputConfig] = None,
adapters_config: Optional[AdaptersConfig] = None,
kms_key_id: str = "",
job_tag: str = "",
notification_channel: Optional[NotificationChannel] = None,
client_request_token: str = "",
return_job_id: bool = False,
force_async_api: bool = False,
call_mode: Textract_Call_Mode = Textract_Call_Mode.DEFAULT,
boto3_textract_client=None,
job_done_polling_interval=1) -> dict:
当接收到异步作业(start_document_text_detection 或 start_document_analysis)的 JSON 响应时也很有用。
from textractcaller import get_full_json
def get_full_json(job_id: str = None,
textract_api: Textract_API = Textract_API.DETECT,
boto3_textract_client=None)->dict:
当从 OutputConfig 位置接收到 JSON 时,此方法也很有用。
from textractcaller import get_full_json_from_output_config
def get_full_json_from_output_config(output_config: OutputConfig = None,
job_id: str = None,
s3_client = None)->dict:
示例
仅使用 detect_text 功能从本地文件系统中调用文件
textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
仅使用 detect_text 从本地文件系统中调用文件并使用在 Textract 响应解析器中
(需要 trp 依赖项,通过 python -m pip install amazon-textract-response-parser
安装)
import json
from trp import Document
from textractcaller import call_textract
textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
d = Document(textract_json)
使用查询调用多页文档并提取答案
该示例也使用了 amazon-textract-response-parser
python -m pip install amazon-textract-caller amazon-textract-response-parser
import textractcaller as tc
import trp.trp2 as t2
import boto3
textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
textract_json = tc.call_textract(
input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
queries_config=tc.QueriesConfig(queries=[q1, q2]),
features=[tc.Textract_Features.QUERIES],
force_async_api=True,
boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json) # type: ignore
for page in t_doc.pages:
query_answers = t_doc.get_query_answers(page=page)
for x in query_answers:
print(f"{x[1]},{x[2]}")
使用适配器调用多页文档的定制查询
该示例也使用了 amazon-textract-response-parser
python -m pip install amazon-textract-caller amazon-textract-response-parser
import textractcaller as tc
import trp.trp2 as t2
import boto3
textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
adapter1 = tc.Adapter(adapter_id="2e9bf1c4aa31", version="1", pages=["1"])
textract_json = tc.call_textract(
input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
queries_config=tc.QueriesConfig(queries=[q1, q2]),
adapters_config=tc.AdaptersConfig(adapters=[adapter1])
features=[tc.Textract_Features.QUERIES],
force_async_api=True,
boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json) # type: ignore
for page in t_doc.pages:
query_answers = t_doc.get_query_answers(page=page)
for x in query_answers:
print(f"{x[1]},{x[2]}")
使用具有 TABLES 功能的本地文件系统中的文件调用
from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
response = call_textract(
input_document="/folder/local-filesystem-file.png", features=features)
使用位于 S3 上的图像但强制使用异步 API 调用
from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/w2-example.png", force_async_api=True)
使用 OutputConfig 和 Customer-Managed-Key 调用
from textractcaller import call_textract
from textractcaller import OutputConfig, Textract_Features
output_config = OutputConfig(s3_bucket="somebucket-encrypted", s3_prefix="output/")
response = call_textract(input_document="s3://someprefix/somefile.png",
force_async_api=True,
output_config=output_config,
kms_key_id="arn:aws:kms:us-east-1:12345678901:key/some-key-id-ref-erence",
return_job_id=False,
job_tag="sometag",
client_request_token="sometoken")
使用位于 S3 上的 PDF 并强制返回 JobId 而不是 JSON 响应
from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/some-document.pdf", return_job_id=True)
job_id = response['JobId']
项目详情
下载文件
下载您平台上的文件。如果您不确定要选择哪个,请了解更多关于 安装包 的信息。
源分布
amazon-textract-caller-0.2.4.tar.gz (13.2 kB 查看哈希值)
构建分布
关闭
哈希值 for amazon_textract_caller-0.2.4-py2.py3-none-any.whl
算法 | 哈希摘要 | |
---|---|---|
SHA256 | ec7dc3517f1cc9b37b41a74b2b5ea040d67be91e8559a8150f44af75bf7f5590 |
|
MD5 | e217e836d624b9ce1fb513695373362d |
|
BLAKE2b-256 | 06521712e298e0afbd8824a8e521ac8c39db2b9ad0e26e51a48e5a7c77487537 |