pytesseract · PyPI · Python 包索引

Python-tesseract是Google的Tesseract-OCR的Python封装

这些详情尚未由PyPI验证

项目链接

首页

项目描述

Python-tesseract是一个用于Python的图像识别（OCR）工具。也就是说，它可以识别和“读取”图像中嵌入的文字。

Python-tesseract是Google的Tesseract-OCR引擎的封装。它也可以作为一个独立的调用脚本来使用tesseract，因为它可以读取Pillow和Leptonica图像库支持的所有图像类型，包括jpeg、png、gif、bmp、tiff等。此外，如果作为脚本使用，Python-tesseract将打印识别出的文本而不是将其写入文件。

用法

快速入门

注意：测试图像位于Git仓库的tests/data文件夹中。

库使用

from PIL import Image

import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# List of available languages
print(pytesseract.get_languages(config=''))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

# getting multiple types of output with one call to save compute time
# currently supports mix and match of the following: txt, pdf, hocr, box, tsv
text, boxes = pytesseract.run_and_get_multiple_output('test.png', extensions=['txt', 'box'])

支持OpenCV图像/NumPy数组对象

import cv2

img_cv = cv2.imread(r'/<path_to_image>/digits.png')

# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,
# we need to convert from BGR to RGB format/mode:
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb))

如果您需要自定义配置，如oem/psm，请使用config关键字。

# Example of adding any additional options
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

# Example of using pre-defined tesseract config file with options
cfg_filename = 'words'
pytesseract.run_and_get_output(image, extension='txt', config=cfg_filename)

如果您遇到类似“无法打开数据文件…”的错误，请添加以下配置：

# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

函数

get_languages 返回Tesseract OCR当前支持的所有语言。
get_tesseract_version 返回系统中安装的Tesseract版本。
image_to_string 返回Tesseract OCR处理后的未修改输出字符串
image_to_boxes 返回包含识别出的字符及其边界框的结果
image_to_data 返回包含边界框、置信度和其他信息的结果。需要 Tesseract 3.05+。更多信息，请查看 Tesseract TSV 文档
image_to_osd 返回包含方向和脚本检测信息的结果。
image_to_alto_xml 返回以 Tesseract 的 ALTO XML 格式呈现的结果。
run_and_get_output 返回 Tesseract OCR 的原始输出。提供对发送给 tesseract 的参数的更多控制。
run_and_get_multiple_output 返回类似于 run_and_get_output，但可以处理多个扩展。此函数将 extension: str 关键字参数替换为 extension: List[str] 关键字参数，其中可以指定扩展列表，并在单个 tesseract 调用后返回对应的数据。此函数在需要多个输出格式（如文本和边界框）时，可以减少对 tesseract 的调用次数。

参数

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

image 对象或字符串 - 无论是 PIL Image、NumPy 数组还是要由 Tesseract 处理的图像的文件路径。如果您传递对象而不是文件路径，pytesseract 将隐式地将图像转换为 RGB 模式。
lang 字符串 - Tesseract 语言代码字符串。如果未指定，则默认为 eng！对于多语言示例：lang='eng+fra'
config 字符串 - 任何不在 pytesseract 函数中可用的 附加自定义配置标志。例如：config='--psm 6'
nice 整数 - 修改 Tesseract 运行的处理器优先级。在 Windows 上不受支持。Nice 调整类似 Unix 的进程的优先级。
output_type 类属性 - 指定输出类型，默认为字符串。有关所有支持类型的完整列表，请参阅 pytesseract.Output 类的定义。
timeout 整数或浮点数 - OCR 处理的持续时间（以秒为单位），在此之后，pytesseract 将终止并引发 RuntimeError。
pandas_config 字典 - 仅适用于 Output.DATAFRAME 类型。包含用于 pandas.read_csv 的自定义参数的字典。允许您自定义 image_to_data 的输出。

CLI 使用

pytesseract [-l lang] image_file

安装

先决条件

Python-tesseract 需要 Python 3.6+
您将需要 Python Imaging Library (PIL)（或 Pillow 分支）。请查看 Pillow 文档了解 Pillow 的基本安装。
安装 Google Tesseract OCR（有关如何在 Linux、Mac OSX 和 Windows 上安装引擎的更多信息）。您必须能够以 tesseract 的形式调用 tesseract 命令。如果不是这种情况，例如因为 tesseract 不在您的 PATH 中，您必须更改 pytesseract.pytesseract.tesseract_cmd 的“tesseract_cmd”变量。在 Debian/Ubuntu 上，您可以使用 tesseract-ocr 包。对于 Mac OS 用户，请安装 homebrew 包 tesseract。

注意：在某些罕见情况下，您可能还需要从 tesseract-ocr/tessconfigs 安装 tessconfigs 和 configs，如果特定的操作系统包不包括它们。

通过 pip 安装

有关更多信息，请查看 pytesseract 软件包页面。

pip install pytesseract

如果您已安装 git

pip install -U git+https://github.com/madmaze/pytesseract.git

从源代码安装

git clone https://github.com/madmaze/pytesseract.git
cd pytesseract && pip install -U .

使用 conda（通过 conda-forge）安装

conda install -c conda-forge pytesseract

测试

要运行此项目的测试套件，请安装并运行 tox。请确保您已安装 tesseract 并将其添加到您的 PATH 中。

pip install tox
tox

许可证

请检查 Python-tesseract 仓库/发行版中包含的许可证文件。截至 Python-tesseract 0.3.1，许可证为 Apache License Version 2.0。

贡献者

最初由 Samuel Hoffstaetter 编写。
贡献者完整列表

项目详情

这些详情尚未由PyPI验证

项目链接

首页

发布历史发布通知 | RSS 源

此版本

0.3.13

2024年8月16日

0.3.10

2022年8月16日

0.3.9

2022年2月19日

0.3.8

2021年6月28日

0.3.7

2020年12月15日

0.3.6

2020年9月4日

0.3.5

2020年8月9日

0.3.4

2020年4月19日

0.3.3

2020年3月8日

0.3.2

2020年1月25日

0.3.1

2019年12月20日

0.3.0

2019年8月23日

0.2.9

2019年8月16日

0.2.8

2019年8月16日

0.2.7

2019年6月19日

0.2.6

2018年12月16日

0.2.5

2018年10月5日

0.2.4

2018年7月20日

0.2.2

2018年5月31日

0.2.0

2018年1月31日

0.1.9

2018年1月31日

0.1.8

2018年1月21日

0.1.7

2017年5月29日

0.1.6

2015年3月19日

0.1.5

2014年8月14日

0.1.4

2014年8月11日

0.1.3

2014年8月4日

0.1

2014年2月6日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解有关安装软件包的更多信息。

源分发

pytesseract-0.3.13.tar.gz (17.7 kB 查看散列)

上传时间 2024年8月16日 源

构建分发

pytesseract-0.3.13-py3-none-any.whl (14.7 kB 查看散列)

上传时间 2024年8月16日 Python 3

散列 for pytesseract-0.3.13.tar.gz

散列 for pytesseract-0.3.13.tar.gz
算法	散列摘要
SHA256	`4bf5f880c99406f52a3cfc2633e42d9dc67615e69d8a509d74867d3baddb5db9`
MD5	`73f9645e59b437f064d05882b95832ce`
BLAKE2b-256	`9fa67d679b83c285974a7cb94d739b461fa7e7a9b17a3abfd7bf6cbc5c2394b0`

散列 for pytesseract-0.3.13-py3-none-any.whl

散列 for pytesseract-0.3.13-py3-none-any.whl
算法	散列摘要
SHA256	`7a99c6c2ac598360693d83a416e36e0b33a67638bb9d77fdcac094a3589d4b34`
MD5	`5f7a5e451c773cce28c6834fc79c7699`
BLAKE2b-256	`7a338312d7ce74670c9d39a532b2c246a853861120486be9443eebf048043637`

pytesseract 0.3.13

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

用法

安装

测试

许可证

贡献者

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史发布通知 | RSS 源

下载文件

源分发

构建分发

pytesseract 0.3.13

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

用法

安装

测试

许可证

贡献者

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS 源

下载文件

源分发

构建分发

发布历史发布通知 | RSS 源