Python文本处理

处理Word文档

处理Word文档详细操作教程
要读取word文档,可使用python中的docx模块。 首先安装docx,如下所示。 然后编写一个程序,使用docx模块中的不同函数按段落读取整个文件。
使用以下命令将docx模块放入程序环境中。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
pip install docx
在下面的示例中,通过将每个行附加到段落并最终打印出所有段落文本来读取word文档的内容。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import docx
def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print (readtxt('path\test.docx'))
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
Lidihuo Point originated from the idea that there exists a class of readers who respond
better to online content and prefer to learn new skills at their own pace from the comforts
of their drawing rooms.
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated,
we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from programming languages
to web designing to academics and much more.
读取个别段落
可以使用paragraph属性从word文档中读取特定段落。 在下面的例子中,只读取word文档中的第二段。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import docx
doc = docx.Document('path\test.docx')
print len(doc.paragraphs)
print doc.paragraphs[2].text
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository
which now proudly flaunts a wealth of tutorials and allied articles on topics
ranging from programming languages to web designing to academics and much more.
昵称: 邮箱:
Copyright © 2022 立地货 All Rights Reserved.
备案号:京ICP备14037608号-4