Python文本处理

文本分类

文本分类详细操作教程
很多时候,需要通过一些预先定义的标准将可用文本分类为各种类别。 nltk提供此类功能作为各种语料库的一部分。 在下面的示例中,查看电影评论语料库并检查可用的分类。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
# Lets See how the movies are classified
from nltk.corpus import movie_reviews
all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
['neg', 'pos']
现在看一下带有正面评论的文件的内容。这个文件中的句子是标记化的,打印前四个句子来查看样本。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()
sample = movie_reviews.raw("pos/cv944_13521.txt")
token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films
like deep impact , = godzilla , the x-files , armageddon , the truman show ,
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film
releases from the = spielberg-katzenberg-geffen's dreamworks production company .
接下来,通过使用nltk中的FreqDist函数来标记每个文件中的单词并找到最常用的单词。
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))
当运行上面的程序时,我们得到以下输出 -
# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576),
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]
昵称: 邮箱:
Copyright © 2022 立地货 All Rights Reserved.
备案号:京ICP备14037608号-4