Python文本分析 | TF-IDF的实现

Original 陈薇 Python for Finance 2024-03-13

TF-IDF简介

TF-IDF算法步骤
TF-IDF实例
TfidfVectorizer

不考虑停用词的TF-IDF实现

TfidfVectorizer
手工计算TF-IDF：复现TfidfVectorizer结果
Python计算TF-IDF：复现TfidfVectorizer结果

考虑停用词的TF-IDF实现

TfidfVectorizer

TF-IDF简介

TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术，常用于挖掘文章中的关键词。

TF-IDF有两层意思，一层是"词频"（Term Frequency，缩写为TF），另一层是"逆文档频率"（Inverse Document Frequency，缩写为IDF）。

TF-IDF算法步骤

第一步，计算TF，有以下不同的公式

TF = 某个词在某个文档中的出现次数
TF = 某个词在某个文档中的出现次数/该文档的总词数
TF = 某个词在某个文档中的出现次数/该文档出现次数最多的词的出现次数

第二步，计算IDF，有以下不同的公式

IDF = log(语料库的文档总数/包含该词的文档数)
IDF = log(语料库的文档总数/包含该词的文档数)+1

第三步，计算TF-IDF

TF-IDF=TF*IDF

TF-IDF实例

A : The car is driven on the road.

B: The truck is driven on the highway.

TF = 某个词在某个文档中的出现次数/该文档的总词数
IDF = log(语料库的文档总数/包含该词的文档数)
TF-IDF=TF*IDF

TfidfVectorizer

TfidfVectorizer是sklearn库中的一个类，可以用于计算文本的TF-IDF值。

sklearn中TF的定义:

TF = 某个词语在某个文档中的出现次数

sklearn中IDF的定义:

当smooth_idf = True时（默认）：

IDF = log[(1+语料库的文档总数)/(1+包含该词的文档数)]+1

当smooth_idf = False时：

IDF = log(语料库的文档总数/包含该词的文档数)+1

不考虑停用词的TF-IDF实现

TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]

## 声明一个向量化工具vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

## 获取词袋
bag_of_words = vectorizer.get_feature_names()
print('vocabulary: ', bag_of_words)
print('index of vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)

## 将语料集转化为词袋向量
vector = vectorizer.transform(corpus)

## 文本向量化
print(vector.toarray())

结果如下：

vocabulary:  ['are', 'cars', 'cheaper', 'diesel', 'is', 'petrol', 'than']
index of vocabulary:  {'petrol': 5, 'cars': 1, 'are': 0, 'cheaper': 2, 'than': 6, 'diesel': 3, 'is': 4}
idfs:  [1.40546511 1.40546511 1.         1.         1.40546511 1.
 1.        ]
[[0.37729199 0.75458397 0.26844636 0.26844636 0.         0.26844636
  0.26844636]
 [0.         0.         0.4090901  0.4090901  0.57496187 0.4090901
  0.4090901 ]]

手工计算TF-IDF：复现TfidfVectorizer结果

第一步，计算TF

sklearn中TF的定义:

TF = 某个词语在某个文档中的出现次数

	are	cars	cheaper	diesel	is	petrol	than
d1	1	2	1	1	0	1	1
d2	0	0	1	1	1	1	1

第二步，计算IDF

sklearn中IDF的定义:

当smooth_idf = True时（默认）：

idf	计算过程	计算结果
are	log[(1+2)/(1+1)]+1	1.40546511
cars	log[(1+2)/(1+1)]+1	1.40546511
cheaper	log[(1+2)/(1+2)]+1	1
diesel	log[(1+2)/(1+2)]+1	1
is	log[(1+2)/(1+1)]+1	1.40546511
petrol	log[(1+2)/(1+2)]+1	1
than	log[(1+2)/(1+2)]+1	1

第三步，计算TF-IDF

	tf-idf	tf*idf	计算结果
d1	are	1*1.40546511	1.40546511
	cars	2*1.40546511	2.81093022
	cheaper	1*1	1
	diesel	1*1	1
	is	0*1.40546511	0
	petrol	1*1	1
	than	1*1	1
d2	are	0*1.40546511	0
	cars	0*1	0
	cheaper	1*1	1
	diesel	1*1	1
	is	1*1.40546511	1.40546511
	petrol	1*1	1
	than	1*1	1

第四步，TF-IDF标准化

Python计算TF-IDF：复现TfidfVectorizer结果

获取词集

import pandas as pd
import numpy as np
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]

words_set = set()

for doc in  corpus:
    words = doc.split(' ')
    words_set = words_set.union(set(words))
    
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)

结果为：

Number of words in the corpus: 7
The words in the corpus: 
 {'cars', 'petrol', 'cheaper', 'are', 'than', 'diesel', 'is'}

计算词频TF

n_docs = len(corpus)         ## 文档数量
n_words_set = len(words_set)  

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

## 计算词频(TF)
for i in range(n_docs):
    words = corpus[i].split(' ') 
    for w in words:
        df_tf[w][i] = df_tf[w][i] + 1  ## 计算词频：文档中词语的出现次数
        ## 计算词频：文档中词语的出现次数/文档总词数
        ## df_tf[w][i] = df_tf[w][i] + (1 / len(words))
        
print(df_tf)

结果为：

   cars  petrol  cheaper  are  than  diesel   is
0   2.0     1.0      1.0  1.0   1.0     1.0  0.0
1   0.0     1.0      1.0  0.0   1.0     1.0  1.0

计算IDF

print("IDF : ")

idf = {}

for w in words_set:
    k = 0    ## 包含该词的文档数
    
    for i in range(n_docs):
        if w in corpus[i].split():
            k += 1
            
    idf[w] =  np.log((n_docs+1) / (k+1))+1

print(idf)

结果为：

IDF : 
{'cars': 1.4054651081081644, 'petrol': 1.0, 'cheaper': 1.0, 'are': 1.4054651081081644, 'than': 1.0, 'diesel': 1.0, 'is': 1.4054651081081644}

计算TF-IDF

df_tf_idf = df_tf.copy()

for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
        
print(df_tf_idf)

结果为：

      cars  petrol  cheaper       are  than  diesel        is
0  2.81093     1.0      1.0  1.405465   1.0     1.0  0.000000
1  0.00000     1.0      1.0  0.000000   1.0     1.0  1.405465

TF-IDF标准化

df = df_tf_idf.div(df_tf_idf.apply(lambda x:x**2).sum(axis=1).apply(np.sqrt),axis=0)
print(df)

结果为：

    cheaper    petrol    diesel      cars       are      than        is
0  0.268446  0.268446  0.268446  0.754584  0.377292  0.268446  0.000000
1  0.409090  0.409090  0.409090  0.000000  0.000000  0.409090  0.574962

考虑停用词的TF-IDF实现

TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
## 文本文档列表
d1 = "petrol cars are cheaper than diesel cars"
d2 = "diesel is cheaper than petrol"
corpus =[d1, d2]

## 声明一个向量化工具vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(corpus)

## 获取词袋
bag_of_words = vectorizer.get_feature_names()
print('vocabulary: ', bag_of_words)
print('index of vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)

## 将语料集转化为词袋向量
vector = vectorizer.transform(corpus)

## 文本向量化
print(vector.toarray())

结果为：

vocabulary:  ['cars', 'cheaper', 'diesel', 'petrol']
index of vocabulary:  {'petrol': 3, 'cars': 0, 'cheaper': 1, 'diesel': 2}
idfs:  [1.40546511 1.         1.         1.        ]
[[0.85135433 0.30287281 0.30287281 0.30287281]
 [0.         0.57735027 0.57735027 0.57735027]]

继续滑动看下一个

Python for Finance

向上滑动看下一个

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

京东Plus的隐藏特权，很多会员都没领取，白交了会员费...

呼吁四川大学澄清：1998年1月，川大有多少个“姜涛与爱人程月玲”？

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

96岁的朱总理

Python文本分析 | TF-IDF的实现

TF-IDF简介

TF-IDF算法步骤

TF-IDF实例

TfidfVectorizer

不考虑停用词的TF-IDF实现

TfidfVectorizer

手工计算TF-IDF：复现TfidfVectorizer结果

第一步，计算TF

第二步，计算IDF

第三步，计算TF-IDF

第四步，TF-IDF标准化

Python计算TF-IDF：复现TfidfVectorizer结果

获取词集

计算词频TF

计算IDF

计算TF-IDF

TF-IDF标准化

考虑停用词的TF-IDF实现

TfidfVectorizer

您可能也对以下帖子感兴趣

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

京东Plus的隐藏特权，很多会员都没领取，白交了会员费...

呼吁四川大学澄清：1998年1月，川大有多少个“姜涛与爱人程月玲”？

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

96岁的朱总理

生成图片，分享到微信朋友圈

Python文本分析 | TF-IDF的实现

TF-IDF简介

TF-IDF算法步骤

TF-IDF实例

TfidfVectorizer

不考虑停用词的TF-IDF实现

TfidfVectorizer

手工计算TF-IDF：复现TfidfVectorizer结果

第一步，计算TF

第二步，计算IDF

第三步，计算TF-IDF

第四步，TF-IDF标准化

Python计算TF-IDF：复现TfidfVectorizer结果

获取词集

计算词频TF

计算IDF

计算TF-IDF

TF-IDF标准化

考虑停用词的TF-IDF实现

TfidfVectorizer

您可能也对以下帖子感兴趣