Python文本处理

处理PDF

处理PDF详细操作教程
Python可以从中提取文本后读取PDF文件并打印出内容。 为此,必须首先安装所需的模块PyPDF2,以下是安装模块的命令。应该已经在python环境中安装了pip。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
pip install pypdf2
成功安装此模块后,可以使用模块中提供的方法读取PDF文件。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import PyPDF2
pdfName = 'path\test.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
Lidihuo Point originated from the idea that there exists a class of readers who respond better
to online content and prefer to learn new skills at their own pace from the comforts of their
drawing rooms.
The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository which now
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.
读取多个页面
要阅读包含多个页面的pdf并使用页码打印每个页面,使用带有getPageNumber()函数的循环。 在下面的例子中有两个页面的PDF文件。内容在两个单独的页面标题下打印。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import PyPDF2
pdfName = 'Path\test.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
for i in xrange(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
    page_content = page.extractText()
    print page_content
执行上面示例代码,得到以下结果 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
Page No - 1
Lidihuo Point originated from the idea that there exists a class of readers who respond better to
online content and prefer to learn new skills at their own pace from the comforts of their drawing
rooms.
Page No - 2
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web
designing to academics and much more.
昵称: 邮箱:
Copyright © 2022 立地货 All Rights Reserved.
备案号:京ICP备14037608号-4