简单PDF文本提取
项目描述
pdftotext
简单PDF文本提取
import pdftotext
# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("secure.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
操作系统依赖
以下说明假设您正在使用最新操作系统上的Python 3。对于Python 2或较旧操作系统,包名可能不同。
Debian, Ubuntu等
sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
Fedora, Red Hat等
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel
macOS
brew install pkg-config poppler python
Windows
目前仅在conda使用时进行测试
- 安装Microsoft Visual C++构建工具
- 通过conda安装poppler
conda install -c conda-forge poppler
安装
pip install pdftotext
项目详情
关闭
pdftotext-2.2.2.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 2a9aa89bc62022408781b39d188fabf5a3ad1103b6630f32c4e27e395f7966ee |
|
MD5 | 8814a3bdc5c9ad6bc6c3189914b597af |
|
BLAKE2b-256 | e0e379a2ad7ca71160fb6442772155389881672c98bd44c6022303ce242cbfb9 |