Python文本分析 | TF-IDF的实现
目录
TF-IDF简介
TF-IDF算法步骤
TF-IDF实例
TfidfVectorizer
不考虑停用词的TF-IDF实现
TfidfVectorizer
手工计算TF-IDF:复现TfidfVectorizer结果
Python计算TF-IDF:复现TfidfVectorizer结果
考虑停用词的TF-IDF实现
TfidfVectorizer
TF-IDF简介
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术,常用于挖掘文章中的关键词。
TF-IDF有两层意思,一层是"词频"(Term Frequency,缩写为TF),另一层是"逆文档频率"(Inverse Document Frequency,缩写为IDF)。
TF-IDF算法步骤
第一步,计算TF,有以下不同的公式
TF = 某个词在某个文档中的出现次数
TF = 某个词在某个文档中的出现次数/该文档的总词数
TF = 某个词在某个文档中的出现次数/该文档出现次数最多的词的出现次数
第二步,计算IDF,有以下不同的公式
IDF = log(语料库的文档总数/包含该词的文档数) IDF = log(语料库的文档总数/包含该词的文档数)+1
第三步,计算TF-IDF
TF-IDF=TF*IDF
TF-IDF实例
A : The car is driven on the road.
B: The truck is driven on the highway.
TF = 某个词在某个文档中的出现次数/该文档的总词数 IDF = log(语料库的文档总数/包含该词的文档数) TF-IDF=TF*IDF
TfidfVectorizer
TfidfVectorizer
是sklearn
库中的一个类,可以用于计算文本的TF-IDF值。
sklearn中TF的定义:
TF = 某个词语在某个文档中的出现次数
sklearn中IDF的定义:
当smooth_idf = True时(默认):
IDF = log[(1+语料库的文档总数)/(1+包含该词的文档数)]+1
当smooth_idf = False时:
IDF = log(语料库的文档总数/包含该词的文档数)+1
不考虑停用词的TF-IDF实现
TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]
## 声明一个向量化工具vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
## 获取词袋
bag_of_words = vectorizer.get_feature_names()
print('vocabulary: ', bag_of_words)
print('index of vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)
## 将语料集转化为词袋向量
vector = vectorizer.transform(corpus)
## 文本向量化
print(vector.toarray())
结果如下:
vocabulary: ['are', 'cars', 'cheaper', 'diesel', 'is', 'petrol', 'than']
index of vocabulary: {'petrol': 5, 'cars': 1, 'are': 0, 'cheaper': 2, 'than': 6, 'diesel': 3, 'is': 4}
idfs: [1.40546511 1.40546511 1. 1. 1.40546511 1.
1. ]
[[0.37729199 0.75458397 0.26844636 0.26844636 0. 0.26844636
0.26844636]
[0. 0. 0.4090901 0.4090901 0.57496187 0.4090901
0.4090901 ]]
手工计算TF-IDF:复现TfidfVectorizer结果
第一步,计算TF
sklearn中TF的定义:
TF = 某个词语在某个文档中的出现次数
are | cars | cheaper | diesel | is | petrol | than | |
---|---|---|---|---|---|---|---|
d1 | 1 | 2 | 1 | 1 | 0 | 1 | 1 |
d2 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
第二步,计算IDF
sklearn中IDF的定义:
当smooth_idf = True时(默认):
idf | 计算过程 | 计算结果 |
---|---|---|
are | log[(1+2)/(1+1)]+1 | 1.40546511 |
cars | log[(1+2)/(1+1)]+1 | 1.40546511 |
cheaper | log[(1+2)/(1+2)]+1 | 1 |
diesel | log[(1+2)/(1+2)]+1 | 1 |
is | log[(1+2)/(1+1)]+1 | 1.40546511 |
petrol | log[(1+2)/(1+2)]+1 | 1 |
than | log[(1+2)/(1+2)]+1 | 1 |
第三步,计算TF-IDF
tf-idf | tf*idf | 计算结果 | |
---|---|---|---|
d1 | are | 1*1.40546511 | 1.40546511 |
cars | 2*1.40546511 | 2.81093022 | |
cheaper | 1*1 | 1 | |
diesel | 1*1 | 1 | |
is | 0*1.40546511 | 0 | |
petrol | 1*1 | 1 | |
than | 1*1 | 1 | |
d2 | are | 0*1.40546511 | 0 |
cars | 0*1 | 0 | |
cheaper | 1*1 | 1 | |
diesel | 1*1 | 1 | |
is | 1*1.40546511 | 1.40546511 | |
petrol | 1*1 | 1 | |
than | 1*1 | 1 |
第四步,TF-IDF标准化
Python计算TF-IDF:复现TfidfVectorizer结果
获取词集
import pandas as pd
import numpy as np
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
结果为:
Number of words in the corpus: 7
The words in the corpus:
{'cars', 'petrol', 'cheaper', 'are', 'than', 'diesel', 'is'}
计算词频TF
n_docs = len(corpus) ## 文档数量
n_words_set = len(words_set)
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)
## 计算词频(TF)
for i in range(n_docs):
words = corpus[i].split(' ')
for w in words:
df_tf[w][i] = df_tf[w][i] + 1 ## 计算词频:文档中词语的出现次数
## 计算词频:文档中词语的出现次数/文档总词数
## df_tf[w][i] = df_tf[w][i] + (1 / len(words))
print(df_tf)
结果为:
cars petrol cheaper are than diesel is
0 2.0 1.0 1.0 1.0 1.0 1.0 0.0
1 0.0 1.0 1.0 0.0 1.0 1.0 1.0
计算IDF
print("IDF : ")
idf = {}
for w in words_set:
k = 0 ## 包含该词的文档数
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log((n_docs+1) / (k+1))+1
print(idf)
结果为:
IDF :
{'cars': 1.4054651081081644, 'petrol': 1.0, 'cheaper': 1.0, 'are': 1.4054651081081644, 'than': 1.0, 'diesel': 1.0, 'is': 1.4054651081081644}
计算TF-IDF
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
print(df_tf_idf)
结果为:
cars petrol cheaper are than diesel is
0 2.81093 1.0 1.0 1.405465 1.0 1.0 0.000000
1 0.00000 1.0 1.0 0.000000 1.0 1.0 1.405465
TF-IDF标准化
df = df_tf_idf.div(df_tf_idf.apply(lambda x:x**2).sum(axis=1).apply(np.sqrt),axis=0)
print(df)
结果为:
cheaper petrol diesel cars are than is
0 0.268446 0.268446 0.268446 0.754584 0.377292 0.268446 0.000000
1 0.409090 0.409090 0.409090 0.000000 0.000000 0.409090 0.574962
考虑停用词的TF-IDF实现
TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]
## 声明一个向量化工具vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(corpus)
## 获取词袋
bag_of_words = vectorizer.get_feature_names()
print('vocabulary: ', bag_of_words)
print('index of vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)
## 将语料集转化为词袋向量
vector = vectorizer.transform(corpus)
## 文本向量化
print(vector.toarray())
结果为:
vocabulary: ['cars', 'cheaper', 'diesel', 'petrol']
index of vocabulary: {'petrol': 3, 'cars': 0, 'cheaper': 1, 'diesel': 2}
idfs: [1.40546511 1. 1. 1. ]
[[0.85135433 0.30287281 0.30287281 0.30287281]
[0. 0.57735027 0.57735027 0.57735027]]