频率分布

频率分布详细操作教程

在文本处理期间经常需要计算文本主体中单词出现的频率。这可以通过应用word_tokenize()函数并将结果附加到列表以保持单词的计数来实现，如下面的程序所示。

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
from nltk.tokenize import word_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
token = word_tokenize(sample)
wlist = []
for i in range(50):
wlist.append(token[i])
wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))

当运行上面的程序时，我们得到以下输出 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1), (BOOK', 1), (of', 2), (THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1), (Piping', 2), (down', 1), (the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2), (pleasant', 1), (glee', 1), (,', 3), (On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,', 3), (And', 1), (he', 1), (laughing', 1), (said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]

条件频率分布

当想要计算满足特定crteria满足一组文本的单词时，使用条件频率分布。

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
import nltk
#from nltk.tokenize import word_tokenize
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))
categories = ['hobbies', 'romance','humor']
searchwords = [ 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=categories, samples=searchwords)

当运行上面的程序时，我们得到以下输出 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-23
may might must will
hobbies 131 22 83 264
romance 11 51 45 43
humor 8 8 9 13

找工作要求35岁以下，35岁以上的程序员都干什么去了？

长久以来，一直有一个问题困扰着技术人——如何打破“程序员的35岁职业魔咒”，这一天迟早会到来，或早或晚。

或许是选错了行业，程序员薪水虽高，但光鲜的外表下，背后的苦衷只有自己知道。三十多岁本该是一个人事业的黄金期，但技术变化日新月异，行业竞争异常残酷，对一个企业来说，永远有比你更年轻、劳动成本更低的人可以选择，这让你的中年危机提前到来。破局的智慧可以看看这本书！>>

<< 文字换行词干算法 >>

昵称：邮箱：