文本分析的步骤, 工具, 途径和可视化如何做？

ML计量研究小组计量经济圈 2021-10-23

凡是搞计量经济的，都关注这个号了

邮箱：econometrics666@126.com

所有计量经济圈方法论丛的code程序, 宏微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.

前些日，咱们引荐了①“实证研究中用到的200篇文章, 社科学者常备toolkit”、②实证文章写作常用到的50篇名家经验帖, 学者必读系列、③过去10年AER上关于中国主题的Articles专辑、④AEA公布2017-19年度最受关注的十大研究话题, 给你的选题方向，⑤2020年中文Top期刊重点选题方向, 写论文就写这些。尔后，咱们又引荐了①使用CFPS, CHFS, CHNS数据实证研究的精选文章专辑！，②这40个微观数据库够你博士毕业了, 反正凭着这些库成了教授，③Python, Stata, R软件史上最全快捷键合辑！，④关于(模糊)断点回归设计的100篇精选Articles专辑！，⑤关于双重差分法DID的32篇精选Articles专辑！，⑥关于合成控制法SCM的33篇精选Articles专辑！⑦最近80篇关于中国国际贸易领域papers合辑！，⑧最近70篇关于中国环境生态的经济学papers合辑！

上一日，咱们引荐了“继诺贝尔奖得主罗默后, 又一世行首席匆匆离职了！背后的原因让人深思！”，主要探讨一下世行与其两位首席经济学家间的恩恩怨怨。今天，ML计量研究小组主要引荐一下杜克大学Angela Zoss老师的《文本分析步骤, 工具, 途径和可视化如何做？》。后面这个短网址（或“阅读原文”）可以打开下文中出现的各种链接：http://dwz.date/yaA

文本分析介绍

“文本分析”是一个广义术语，涵盖各种过程，通过这些过程可以修改文本和自然语言文档，从而可以对它们进行组织和描述。

本指南为文本分析过程的多个阶段收集资源，包括文本收集，文本分析和清理，文本总结和分析方法以及文本可视化。

Overviews/summaries

Ted Underwood – Where to start with text mining
Tooling Up for Digital Humanities – Text Analysis
Ryan Shaw – Text Mining
John Laudin – Text Analytics 101
O'Connor, Bamman, & Smith (2011) – Computational Text Analysis for Social Science
Ben Schmidt – Comparing Corpuses by Word Use

可能的文字来源

Native digital text

(Thunderbird extension, MUSE*)

HTML
RSS feeds
Sample specific services:

Twitter
Wikipedia
Data Liberation Front
New York Times API
CMU Movie Summary Corpus
Corpus of Global Web-Based English (GloWbE)
PLOS Text Mining Collection

Tutorials for data collection from various services

Digitized

Internet Archive
Project Gutenberg
Google Books
Hathi Trust (Hathi Download Helper)
JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org)
PubMed Open Access Subset
Monk Workbench*
Document Cloud*
Open American National Corpus (collection of American English from various sources)
WordHoard* (tagged literary texts)
Corpus of Contemporary American English

* - also has some processing/analysis capabilities

清理文本以进行分析

在进行文本分析项目之前，通常需要对文本进行大量清理和解析。这是因为为了人们可以理解它，很多文本内容被创建和存储起来了，而计算机处理该文本并不总是那么容易。

当数据源具有结构或至少可以识别的某些常规模式时，计算机运行良好。用于文本分析的大多数清理和解析过程都涉及增加规则性（例如，修正错别字）或增加结构（将某些单词标记为重要单词，甚至将文档分成具有特殊含义的不同部分-标题，作者，章节等）。

分析文本的主要方法在“分析方法”下列出，您可能需要对分析方法和将要使用的工具有所了解，然后才能知道需要进行哪种清理。例如，某些技术和工具在对单个单词进行计数时会非常精确，并且它们可能会分别对同一单词的小写和大写版本进行计数。这是您可能需要研究的其他一些清理和解析技术：

Removing stop words (deleting very common words like "a", "the", "and", etc.)
Stemming or lemmatization (ways of combining words that have the same linguistic root or stem)

Tip: 诸如Wordle之类的工具可能会删除停止词（也叫常见词），但它们可能会分别计算一个单词和该单词的复数形式，或者如上所述保留大小写差异。在将内容加载到词云生成器之前，请尝试将所有内容转换为小写并使用快速阻止工具。

文件转换

从PDF提取：

More timesavers to unlock public records data (PDFs into spreadsheets)
Tabula (Java program for all platforms)
gImageReader (OCR for images, PDFs)

清理HTML / XML:

Beautiful Soup
scrubber (also lemmatizes, removes stop words with prepared lists)
HTML to Text (or Story) from Data Science Toolkit

将制表符更改为逗号，删除换行符等.

Sort My List (also changes case, removes punctuation)
TextFixer
Transformer (rescue texts from old file formats)
Text Mechanic

更正/标准化文本

Google Refine for entity normalization
Vard 2 for cleaning historical text
TextFixer for changing case, removing whitespace, sorting
Porter stemmer online for stemming text
Microsoft Word to convert formatting to structure

Finding and replacing formatting and special characters in Word
Using regular expressions in Word
Convert text to table and back

Microsoft Excel to split, concatenate, filter data

Excel Text to Columns tool
Excel Concatenate function
Word Frequency in Excel with Filters, COUNTIF

正则表达式提供帮助

具有正则表达式功能的文本编辑器

Windows系统
(See also Top 10 Cheap Windows Text Editors with Regular Expressions)

Notepad++
GNU Emacs
Vim
Kate
jEdit (instructions)
NoteTab Light
Microsoft Word (Extended Instructions)
Notepad RE
Zeus Lite Editor
Programmer's Notepad
EditPad Lite
PSPad
SciTE
Crimson Editor
Sublime

Mac系统
(See also Top 10 Cheap Mac OS X Text Editors with Regular Expressions)

GNU Emacs
Vim
jEdit (instructions)
Kate
Aquamacs
TextWrangler
Sublime
Microsoft Word (Extended Instructions)

文字分析的类型

基本文本摘要和分析

Word frequency (lists of words and their frequencies)
(See also: Word counts are amazing, Ted Underwood)
Collocation (words commonly appearing near each other)
Concordance (the contexts of a given word or set of words)
N-grams (common two-, three-, etc.- word phrases)
Entity recognition (identifying names, places, time periods, etc.)
Dictionary tagging (locating a specific set of words in the texts)

文本分析的高级目标

(From Underwood, T. (2012). Where to start with text mining.)

文件分类

Information retrieval (e.g., search engines)
Supervised classification (e.g., guessing genres)
Unsupervised clustering (e.g., alternative “genres”)

语料库比较（例如政治演讲）
一段时间内使用的语言（例如Google ngram viewer）
检测文档特征簇(i.e., topic modeling)
实体(entity)识别/提取(e.g., geoparsing)
可视化

工具及其分析方法

网页工具

Voyant Tools – word frequencies, concordance, word clouds, visualizations
TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface
Netlytic – word frequencies, concordance, dictionary tagging, network analysis
Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations
Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc.
ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud)
Overview – Automatic topic tagging and visualization
Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification
LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100

可下载的应用程序（无需编程）

AntWord – word frequencies
AntConc – frequency lists, concordances, collocations, keywords, n-grams
TextSTAT – word frequencies, concordances
Concordance – word frequencies, concordances, indexes
Cowo - semantic network
WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora)
CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count
NVivo (Duke info) - can cluster sources based on text, also produces phrase nets and tag clouds
Tableau (LibGuide) - word clouds

其他工具清单

TAPoR 2

TAPoRware recipes (tutorials)

DiRT - digital research tools

高级文本分析

文字注释工具

NVivo
brat rapid annotation tool

自然语言处理

GATE
nltk
Stanford NLP Group Software
National Centre for Text Mining (includes some tools for medical texts)
Reporters' Lab Reviews: Entity Extraction
Michael Collins' notes on NLP
Natural (natural language facilities for Node.js)

情绪分析

Most powerful open source sentiment analysis tools
Bing Liu's Resources on Opinion Mining (including a sentiment lexicon)
NaCTeM Sentiment Analysis Test Site (web form)
pattern web mining module (python)
SentiWordNet
Umigon (for tweets, etc.)
List of sentiment analysis tools for Twitter

编程资源

The Programming Historian - Lessons
Basic Unix workflow for Text Processing
Helpful Unix commands
Similarity and Dissimilarity Measures
An introduction to text analysis with python
Basic Text Analysis in Mathematica
Zend Framework - PHP framework for collecting data
Text Analysis with R for Students of Literature
Python Programming for the Humanities
Document Similarity with R

文本可视化示例

具有可视化的各种文本分析项目

With Criminal Intent
Various artistic analyses/interpretations of texts by Stefanie Posavec
The state of our union is... dumber
wordcollider
Popcornjs sentiment tracker
Metropho.rs
Novel Views: Les Miserables
A Christmas Carol (TULP interactive)
Tolkien's Books Analyzed

词频可视化

Google n-gram viewer - word frequencies over time
bookworm Open Library - word frequencies over time
Historical culturomics of pronoun frequencies - pronoun frequencies by gender over time
The Words They Used - bubble cloud of words from national convention speeches, with size and color coding
Bib.ly - word frequencies throughout the Bible
Ye Shall Know Them By Their Words - word frequencies by topic for presidential nomination speeches (additional description)
FACTA+ Visualizer - tree map of term frequency
Inaugural language (Boston Globe) - radial scatterplots
Mining Books to Map Emotions - frequencies of sentiment terms over time

主题模型可视化

Termite - tabular, proportional symbol visualization of words and topics
PMLA topic network - a network view of the topics from a topic model of PMLA, where links are created for shared words between topics (additional description)
Using Word Clouds for Topic Modeling Results - visualizing the distribution of words for each topic as separate word clouds

会继续引荐相关议题，以下是计量经济圈社群就文本分析的日常讨论。计量社群每天生产的内容非常多，而且质量很高，值得每个学者认真对待这些insights。

Source:

https://www.evernote.com/shard/s438/client/snv?noteGuid=2794ac6f-aef1-2a32-b400-bd287fd0651f%C2%ACeKey=9a55480d677710d9813d2458e4589890&sn=https%3A%2F%2Fwww.evernote.com%2Fshard%2Fs438%2Fsh%2F2794ac6f-aef1-2a32-b400-bd287fd0651f%2F9a55480d677710d9813d2458e4589890&title=%25E6%2596%2587%25E6%259C%25AC%25E5%2588%2586%25E6%259E%2590%25E6%25AD%25A5%25E9%25AA%25A4%252C%2B%25E5%25B7%25A5%25E5%2585%25B7%252C%2B%25E9%2580%2594%25E5%25BE%2584%25E5%2592%258C%25E5%258F%25AF%25E8%25A7%2586%25E5%258C%2596%25E5%25A6%2582%25E4%25BD%2595%25E5%2581%259A%25EF%25BC%259F

拓展性阅读

之前，咱们小组引荐了1.Python中的计量回归模块及所有模块概览，2.空间计量软件代码资源集锦(Matlab/R/Python/SAS/Stata), 不再因空间效应而感到孤独，3.回归、分类与聚类：三大方向剖解机器学习算法的优缺点（附Python和R实现），4.机器学习第一书, 数据挖掘, 推理和预测，5.从线性回归到机器学习, 一张图帮你文献综述，6.11种与机器学习相关的多元变量分析方法汇总，7.机器学习和大数据计量经济学, 你必须阅读一下这篇，8.机器学习与Econometrics的书籍推荐, 值得拥有的经典，9.机器学习在微观计量的应用最新趋势: 大数据和因果推断，10.机器学习在微观计量的应用最新趋势: 回归模型，11.机器学习对计量经济学的影响, AEA年会独家报道，12.机器学习，可异于数理统计等，受到很多年轻学者的推崇和积极评价。

0.看完顶级期刊文章后, 整理了内生性处理小册子；1.“内生性” 到底是什么鬼? New Yorker告诉你；2.Heckman两步法的内生性问题(IV-Heckman)；3.IV和GMM相关估计步骤，内生性、异方差性等检验方法；4.最全估计方法，解决遗漏变量偏差，内生性，混淆变量和相关问题；5.毛咕噜论文中一些有趣的工具变量！；6.非线性面板模型中内生性解决方案；7.内生性处理的秘密武器－工具变量估计；8.内生性处理方法与进展；9.内生性问题和倾向得分匹配；10.你的内生性解决方式out, ERM独领风骚；11.工具变量IV必读文章20篇, 因果识别就靠他了；12.面板数据是怎样处理内生性的；13.计量分析中的内生性问题综述；14.工具变量IV与内生性处理的解读；15.一份改变实证研究的内生性处理思维导图；16.Top期刊里不同来源内生性处理方法；17.面板数据中heckman方法和程序(xtheckman)；18.控制函数法CF, 处理内生性的广义方法；19.二值选择模型内生性检验方法；20.2SRI还是2SPS, 内生性问题的二阶段CF法实现；21.内生变量的交互项如何寻工具变量；22.工具变量精辟解释, 保证你一辈子都忘不了。

下面这些短链接文章属于合集，可以收藏起来阅读，不然以后都找不到了。

2年，计量经济圈公众号近1000篇文章，

Econometrics Circle

数据处理：Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |

计量经济圈组织了一个计量社群，有如下特征：热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此，建议积极进取和有强烈研习激情的中青年学者到社群交流探讨，始终坚信优秀是通过感染优秀而互相成就彼此的。

: ， . Video Mini Program Like ，轻点两下取消赞 Wow ，轻点两下取消在看

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！

小敏感喊话阿哲，出镜抖音！欠钱不还，小白龙再被扒借贷官司！

文本分析的步骤, 工具, 途径和可视化如何做？

文本分析介绍

Overviews/summaries

可能的文字来源

清理文本以进行分析

文件转换

更正/标准化文本

正则表达式提供帮助

具有正则表达式功能的文本编辑器

文字分析的类型

基本文本摘要和分析

文本分析的高级目标

工具及其分析方法

网页工具

可下载的应用程序（无需编程）

其他工具清单

高级文本分析

文字注释工具

自然语言处理

情绪分析

编程资源

文本可视化示例

具有可视化的各种文本分析项目

词频可视化

主题模型可视化

您可能也对以下帖子感兴趣

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！

小敏感喊话阿哲，出镜抖音！欠钱不还，小白龙再被扒借贷官司！

生成图片，分享到微信朋友圈

文本分析的步骤, 工具, 途径和可视化如何做？

文本分析介绍

Overviews/summaries

可能的文字来源

清理文本以进行分析

文件转换

更正/标准化文本

正则表达式提供帮助

具有正则表达式功能的文本编辑器

文字分析的类型

基本文本摘要和分析

文本分析的高级目标

工具及其分析方法

网页工具

可下载的应用程序（无需编程）

其他工具清单

高级文本分析

文字注释工具

自然语言处理

情绪分析

编程资源

文本可视化示例

具有可视化的各种文本分析项目

词频可视化

主题模型可视化

您可能也对以下帖子感兴趣