Python文本处理

删除停用词

删除停用词详细操作教程
停用词是英语单词,对句子没有多大意义。 在不牺牲句子含义的情况下,可以安全地忽略它们。 例如,the, he, have等等的单词已经在名为语料库的语料库中捕获了这些单词。 我们首先将它下载到python环境中。如下代码 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import nltk
nltk.download('stopwords')
它将下载带有英语停用词的文件。
验证停用词
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]
当运行上面的程序时,得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she',
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them',
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this',
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']
除了英语之外,具有这些停用词的各种语言如下。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
from nltk.corpus import stopwords
print stopwords.fileids()
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish',
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian',
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']
示例
参考下面的示例来说明如何从单词列表中删除停用词。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))
all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words:
    if word not in en_stops:
        print(word)
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
There
tree
near
river
昵称: 邮箱:
Copyright © 2022 立地货 All Rights Reserved.
备案号:京ICP备14037608号-4