使用 Stata 提取一大堆文本文件中的词频并绘图展示

Original RStata RStata 2023-10-24

收录于合集

#Stata编程技巧 27 个

#Stata 数据可视化 63 个

附件下载（点击文末的阅读原文即可跳转）：https://rstata.duanshu.com/#/brief/course/efd05edf653e4eccaf9fc8228948767e

欢迎各位培训班的小伙伴参加明晚 8 点的直播课：「使用 Stata 提取一大堆文本文件中的词频并绘图展示」

如何使用 Stata 提取一大堆文本文件中的词频并绘图展示?

最近有个小伙伴问了类似的一个问题，今天我们就选择 1978～2022 年的政府工作报告数据处理为例讲解这个问题。

附件中的 78-22政府工作报告 存放了 1978～2022 年每年的政府工作报告文本（txt 文件）。我们想统计每年下列词汇的词频：

发展
建设
改革
经济
工作
企业
社会
国家
人民
政府

首先我们把这些词汇保存成一个 dta 数据：

clear all
input str6 word
"发展"
"建设"
"改革"
"经济"
"工作"
"企业"
"社会"
"国家"
"人民"
"政府"
end 
save myword, replace

然后我们把每年的政府工作报告读取进 Stata 中并保存（每年的政府报告内容放到一个观测值中）：

clear all
set obs 45
gen year = 1977 + _n 
gen content = ""
forval i = 1/`=_N' {
	local temp = fileread("78-22政府工作报告/`=year[`i']'.txt")
	replace content = `"`temp'"' in `i'
}
save mydata, replace

然后我们把这两个数据用 cross 命令生成交叉数据集：

use mydata, clear 
cross using myword

subinstr() 函数可以把 content 中的 word 完全替换成 ""，然后比较替换前后的长度差就可以知道 content 中有多少个 word 了，由于 Stata 中每个汉字占 3 个长度，所以这里需要除以 6（两个汉字）。

gen length1 = strlen(content)
gen content2 = subinstr(content, word, "", .)
gen length2 = strlen(content2)

* word 的数量
gen num = (length1 - length2) / 6
keep year word num

这样我们就统计好了每年政府工作报告中各个词组的词频。

list in 1/10

*>     +-------------------+
*>     | year   word   num |
*>     |-------------------|
*>  1. | 1978   发展   105 |
*>  2. | 1979   发展   102 |
*>  3. | 1980   发展    44 |
*>  4. | 1981   发展   163 |
*>  5. | 1982   发展   110 |
*>     |-------------------|
*>  6. | 1983   发展    93 |
*>  7. | 1984   发展    72 |
*>  8. | 1985   发展    81 |
*>  9. | 1986   发展   156 |
*> 10. | 1987   发展   124 |
*>     +-------------------+

这样得到的是长数据，如果想转换成宽数据可以使用 spread 命令：

* 安装 spread：ssc install tidy 
spread word num 
list in 1/10

*>      +----------------------------------------------------------------------------+
*>      | year   人民   企业   发展   国家   工作   建设   改革   政府   社会   经济 |
*>      |----------------------------------------------------------------------------|
*>   1. | 1978    119     27    105     66     49     64      2     10     95     72 |
*>   2. | 1979    160     61    102     76     87     68     18     68    151     89 |
*>   3. | 1980     17     44     44     24     25     29     22      3     13     82 |
*>   4. | 1981     84    102    163     73     85    120     36     21    106    277 |
*>   5. | 1982     34    124    110     75     74    130     41     17     95    183 |
*>      |----------------------------------------------------------------------------|
*>   6. | 1983     84     52     93     82     65    119     71     35     84    127 |
*>   7. | 1984     48     64     72    122     29     57     42     19     45     95 |
*>   8. | 1985     45     40     81     41     18     22     99      4     35    112 |
*>   9. | 1986     64     70    156     58     56     88    106     12    107    249 |
*>  10. | 1987     51    109    124     47     47     90    112     21    116    131 |
*>      +----------------------------------------------------------------------------+

不过上面的代码直接运行会报错。因为 spread 命令认为中文不能作为变量名，我们可以修改下 spread 的源码：

* 安装 adoedit：ssc install adoedit 
adoedit spread

然后把 52～56 行删除或者注释掉：

			/* if `r(N)' > 0 {
				levelsof `variable' if `temp' == 1
				display as error `"Some observations for `variable' don't have valid variable names: `=r(levels)'"'
				exit 4
			} */

最后保存、关闭、在 Stata 里面运行一下 clear all 即可生效。

我们还可以绘制一幅图展示这些词语的词频变化：

use wordnum, clear
ren word v 
gsort v year
gen num1 = -num

* 左图
replace num = num + 2000 if v == "发展"
replace num1 = num1 + 2000 if v == "发展"
replace num = num + 1500 if v == "经济"
replace num1 = num1 + 1500 if v == "经济"
replace num = num + 1000 if v == "工作"
replace num1 = num1 + 1000 if v == "工作"
replace num = num + 500 if v == "社会"
replace num1 = num1 + 500 if v == "社会"

tw ///
rarea num num1 year if v == "发展", fc(red*0.6) lc(red*0.6) ///
	text(2050 2001 "发展", color(red*0.7) size(*2)) || ///
rarea num num1 year if v == "经济", fc(green*0.6) lc(green*0.6) ///
	text(1550 2001 "经济", color(green*0.7) size(*2)) || ///
rarea num num1 year if v == "工作", fc(orange*0.6) lc(orange*0.6) ///
	text(1050 2001 "工作", color(orange*0.7) size(*2)) || ///
rarea num num1 year if v == "社会", fc(pink*0.6) lc(pink*0.6) ///
	text(550 2001 "社会", color(pink*0.7) size(*2)) || ///
rarea num num1 year if v == "人民", fc(brown*0.6) lc(brown*0.6) ///
	text(50 2001 "人民", color(brown*0.7) size(*2)) ||, ///
	xline(1978 1983 1988 1993 1998 2003 2008 2013 2018 2022, lc(grey*0.1)) ///
	yla(, nogrid) ysc(off range(0 2200)) xsc(off) leg(off) ///
	plotr(fc(white) lc(white)) xla(, nogrid) sch(s1mono) ///
	text(2400 1978 "1978", color(grey*0.2) size(*1.5)) /// 
	text(2400 1988 "1988", color(grey*0.2) size(*1.5)) /// 
	text(2400 1998 "1998", color(grey*0.2) size(*1.5)) /// 
	text(2400 2008 "2008", color(grey*0.2) size(*1.5)) /// 
	text(2400 2018 "2018", color(grey*0.2) size(*1.5)) /// 
	graphr(margin(8 8 8 8)) name(a, replace)

* 右图
replace num = num + 2000 if v == "建设"
replace num1 = num1 + 2000 if v == "建设"
replace num = num + 1500 if v == "改革"
replace num1 = num1 + 1500 if v == "改革"
replace num = num + 1000 if v == "企业"
replace num1 = num1 + 1000 if v == "企业"
replace num = num + 500 if v == "国家"
replace num1 = num1 + 500 if v == "国家"

tw ///
rarea num num1 year if v == "建设", fc(cranberry*0.8) lc(cranberry*0.8) ///
	text(2050 2001 "建设", color(cranberry*0.9) size(*2)) || ///
rarea num num1 year if v == "改革", fc(blue*0.6) lc(blue*0.6) ///
	text(1550 2001 "改革", color(blue*0.7) size(*2)) || ///
rarea num num1 year if v == "企业", fc(dkorange*0.6) lc(dkorange*0.6) ///
	text(1050 2001 "企业", color(dkorange*0.7) size(*2)) || ///
rarea num num1 year if v == "国家", fc(khaki*0.6) lc(khaki*0.6) ///
	text(550 2001 "国家", color(khaki*0.7) size(*2)) || ///
rarea num num1 year if v == "政府", fc(erose*0.6) lc(erose*0.6) ///
	text(50 2001 "政府", color(erose*0.7) size(*2)) ||, ///
	xline(1978 1983 1988 1993 1998 2003 2008 2013 2018 2022, lc(grey*0.1)) ///
	yla(, nogrid) ysc(off range(-150 2200)) xsc(off) leg(off) ///
	plotr(fc(white) lc(white)) xla(, nogrid) sch(s1mono) ///
	text(2400 1978 "1978", color(grey*0.2) size(*1.5)) /// 
	text(2400 1988 "1988", color(grey*0.2) size(*1.5)) /// 
	text(2400 1998 "1998", color(grey*0.2) size(*1.5)) /// 
	text(2400 2008 "2008", color(grey*0.2) size(*1.5)) /// 
	text(2400 2018 "2018", color(grey*0.2) size(*1.5)) /// 
	graphr(margin(8 8 8 8)) name(b, replace)

gr combine a b, title("四十五年来政府工作报告中常青词汇的词频变化", pos(12)) ///
	subti("数据来源：中国政府网", pos(12)) ///
	caption("数据处理 & 绘图：微信公众号 RStata")

gr export "pic1.png", width(2400) replace

类似的方法我们可以处理另外一组词汇：

* 另外一组词汇
* 把想要统计词频的短语保存成一个 dta 文件
clear all
input str6 word
"城镇"
"农村"
"工业"
"农业"
"制造"
"服务"
"计划"
"市场"
end 
save myword2, replace 

use mydata, clear 
cross using myword2
gen length1 = strlen(content)
gen content2 = subinstr(content, word, "", .)
gen length2 = strlen(content2)

* word 的数量
gen num = (length1 - length2) / 6
keep year word num
save wordnum2, replace 

* 绘图 
use wordnum2, clear
ren word v 
gsort v year
gen num1 = -num

replace num = num + 600 if inlist(v, "农村", "城镇")
replace num1 = num1 + 600 if inlist(v, "农村", "城镇")

replace num = num + 400 if inlist(v, "农业", "工业")
replace num1 = num1 + 400 if inlist(v, "农业", "工业")

replace num = num + 200 if inlist(v, "制造", "服务")
replace num1 = num1 + 200 if inlist(v, "制造", "服务")

tw rarea num num1 year if v == "农村", fc(green*0.6) lc(green*0.6) ///
	text(680 2010 "农村", color(green*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "城镇", fc(red*0.6) lc(red*0.6) ///
	text(550 2000 "城镇", color(red*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "农业", fc(orange*0.6) lc(orange*0.6) ///
	text(480 2010 "农业", color(orange*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "工业", fc(pink*0.6) lc(pink*0.6) ///
	text(350 2000 "工业", color(pink*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "服务", fc(brown*0.6) lc(brown*0.6) ///
	text(280 2010 "服务", color(brown*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "制造", fc(cranberry*0.8) lc(cranberry*0.8) ///
	text(150 2000 "制造", color(cranberry*0.9) size(*1.5)) || ///
rarea num num1 year if v == "市场", fc(blue*0.6) lc(blue*0.6) ///
	text(80 2010 "市场", color(blue*0.7) size(*1.5)) ///
	fintensity(inten20) || ///
rarea num num1 year if v == "计划", fc(dkorange*0.6) lc(dkorange*0.6) ///
	text(-70 2000 "计划", color(dkorange*0.7) size(*1.5)) ///
	fintensity(inten20) ||, ///
	xline(1978 1983 1988 1993 1998 2003 2008 2013 2018 2022, lc(grey*0.1)) ///
	yla(, nogrid) ysc(off) xsc(off) leg(off) ///
	plotr(fc(white) lc(white)) xla(, nogrid) sch(s1mono) ///
	text(750 1978 "1978", color(grey*0.2)) /// 
	text(750 1988 "1988", color(grey*0.2)) /// 
	text(750 1998 "1998", color(grey*0.2)) /// 
	text(750 2008 "2008", color(grey*0.2)) /// 
	text(750 2018 "2018", color(grey*0.2)) /// 
	graphr(margin(8 8 2 8)) name(c, replace)

gr combine c, title("四十五年来政府工作报告中一些关键词的对比", pos(12)) ///
	subti("数据来源：中国政府网", pos(12)) ///
	caption("数据处理 & 绘图：微信公众号 RStata")

gr export "pic2.png", width(2400) replace

直播信息

为了让大家更好的理解上面的代码，欢迎各位培训班会员参加明晚 8 点的直播课：「使用 Stata 提取一大堆文本文件中的词频并绘图展示」

直播地址：腾讯会议(需要报名 RStata 培训班参加)
讲义材料：需要报名 RStata 培训班，详情可阅读：写论文不会处理数据怎么办？R 语言、Stata、计量经济学如何入门？

更多关于 RStata 会员的更多信息可添加微信号 r_stata 咨询：

中美友好合作故事——十万名中国弃婴长大了

中美友好合作故事——十万名中国弃婴长大了

中美友好合作故事——十万名中国弃婴长大了

看个病要排队两年，癌症都被拖成晚期

中共中央批准：作出对高朋逮捕决定

使用 Stata 提取一大堆文本文件中的词频并绘图展示

直播信息

您可能也对以下帖子感兴趣

中美友好合作故事——十万名中国弃婴长大了

中美友好合作故事——十万名中国弃婴长大了

中美友好合作故事——十万名中国弃婴长大了

看个病要排队两年，癌症都被拖成晚期

中共中央批准：作出对高朋逮捕决定

生成图片，分享到微信朋友圈

使用 Stata 提取一大堆文本文件中的词频并绘图展示

直播信息

您可能也对以下帖子感兴趣