文本分析的步骤, 工具, 途径和可视化如何做?
凡是搞计量经济的,都关注这个号了
邮箱:econometrics666@126.com
上一日,咱们引荐了“继诺贝尔奖得主罗默后, 又一世行首席匆匆离职了!背后的原因让人深思!”,主要探讨一下世行与其两位首席经济学家间的恩恩怨怨。今天,ML计量研究小组主要引荐一下杜克大学Angela Zoss老师的《文本分析步骤, 工具, 途径和可视化如何做?》。后面这个短网址(或“阅读原文”)可以打开下文中出现的各种链接:http://dwz.date/yaA
文本分析介绍
“文本分析”是一个广义术语,涵盖各种过程,通过这些过程可以修改文本和自然语言文档,从而可以对它们进行组织和描述。
本指南为文本分析过程的多个阶段收集资源,包括文本收集,文本分析和清理,文本总结和分析方法以及文本可视化。
Overviews/summaries
Ted Underwood – Where to start with text mining Tooling Up for Digital Humanities – Text Analysis Ryan Shaw – Text Mining John Laudin – Text Analytics 101 O'Connor, Bamman, & Smith (2011) – Computational Text Analysis for Social Science Ben Schmidt – Comparing Corpuses by Word Use
可能的文字来源
Native digital text Email (Thunderbird extension, MUSE*) HTML RSS feeds Sample specific services: Twitter Wikipedia Data Liberation Front New York Times API CMU Movie Summary Corpus Corpus of Global Web-Based English (GloWbE) PLOS Text Mining Collection Tutorials for data collection from various services Digitized Internet Archive Project Gutenberg Google Books Hathi Trust (Hathi Download Helper) JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org) PubMed Open Access Subset Monk Workbench* Document Cloud* Open American National Corpus (collection of American English from various sources) WordHoard* (tagged literary texts) Corpus of Contemporary American English
清理文本以进行分析
在进行文本分析项目之前,通常需要对文本进行大量清理和解析。这是因为为了人们可以理解它,很多文本内容被创建和存储起来了,而计算机处理该文本并不总是那么容易。
当数据源具有结构或至少可以识别的某些常规模式时,计算机运行良好。用于文本分析的大多数清理和解析过程都涉及增加规则性(例如,修正错别字)或增加结构(将某些单词标记为重要单词,甚至将文档分成具有特殊含义的不同部分-标题,作者,章节等)。
Removing stop words (deleting very common words like "a", "the", "and", etc.) Stemming or lemmatization (ways of combining words that have the same linguistic root or stem)
文件转换
从PDF提取: More timesavers to unlock public records data (PDFs into spreadsheets) Tabula (Java program for all platforms) gImageReader (OCR for images, PDFs) 清理HTML / XML: Beautiful Soup scrubber (also lemmatizes, removes stop words with prepared lists) HTML to Text (or Story) from Data Science Toolkit 将制表符更改为逗号,删除换行符等. Sort My List (also changes case, removes punctuation) TextFixer Transformer (rescue texts from old file formats) Text Mechanic
更正/标准化文本
Google Refine for entity normalization Vard 2 for cleaning historical text TextFixer for changing case, removing whitespace, sorting Porter stemmer online for stemming text Microsoft Word to convert formatting to structure Finding and replacing formatting and special characters in Word Using regular expressions in Word Convert text to table and back Microsoft Excel to split, concatenate, filter data Excel Text to Columns tool Excel Concatenate function Word Frequency in Excel with Filters, COUNTIF
正则表达式提供帮助
具有正则表达式功能的文本编辑器
Windows系统
(See also Top 10 Cheap Windows Text Editors with Regular Expressions)Notepad++ GNU Emacs Vim Kate jEdit (instructions) NoteTab Light Microsoft Word (Extended Instructions) Notepad RE Zeus Lite Editor Programmer's Notepad EditPad Lite PSPad SciTE Crimson Editor Sublime Mac系统
(See also Top 10 Cheap Mac OS X Text Editors with Regular Expressions)GNU Emacs Vim jEdit (instructions) Kate Aquamacs TextWrangler Sublime Microsoft Word (Extended Instructions)
文字分析的类型
基本文本摘要和分析
Word frequency (lists of words and their frequencies)
(See also: Word counts are amazing, Ted Underwood)Collocation (words commonly appearing near each other) Concordance (the contexts of a given word or set of words) N-grams (common two-, three-, etc.- word phrases) Entity recognition (identifying names, places, time periods, etc.) Dictionary tagging (locating a specific set of words in the texts)
文本分析的高级目标
文件分类 Information retrieval (e.g., search engines) Supervised classification (e.g., guessing genres) Unsupervised clustering (e.g., alternative “genres”) 语料库比较(例如政治演讲) 一段时间内使用的语言(例如Google ngram viewer) 检测文档特征簇(i.e., topic modeling) 实体(entity)识别/提取(e.g., geoparsing) 可视化
工具及其分析方法
网页工具
Voyant Tools – word frequencies, concordance, word clouds, visualizations TAPorWare – various data cleaning, annotating, and summarizing tools in a web interface Netlytic – word frequencies, concordance, dictionary tagging, network analysis Wmatrix – frequency profiles, concordances, compare frequency lists, n-grams and c-grams, collocations Natural Language Processor & Analyzer - word frequencies, collocations, concordance, tokenizer, etc. ManyEyes – interactive text visualizations (network diagram, word tree, phrase net, tag cloud, word cloud) Overview – Automatic topic tagging and visualization Monk Workbench – Corpus selection from library holdings, frequencies and corpora comparisons, supervised classification LIWC - Web version will output a few linguistic dimensions; full version can be licensed for ~$100
可下载的应用程序(无需编程)
AntWord – word frequencies AntConc – frequency lists, concordances, collocations, keywords, n-grams TextSTAT – word frequencies, concordances Concordance – word frequencies, concordances, indexes Cowo - semantic network WordHoard - word frequencies, concordances, collocations, scripting (includes tagged literary corpora) CasualConc - kwic concordance lines, word clusters, collocation analysis, and word count NVivo (Duke info) - can cluster sources based on text, also produces phrase nets and tag clouds Tableau (LibGuide) - word clouds
其他工具清单
TAPoR 2 TAPoRware recipes (tutorials) DiRT - digital research tools
高级文本分析
文字注释工具
NVivo brat rapid annotation tool
自然语言处理
GATE nltk Stanford NLP Group Software National Centre for Text Mining (includes some tools for medical texts) Reporters' Lab Reviews: Entity Extraction Michael Collins' notes on NLP Natural (natural language facilities for Node.js)
情绪分析
Most powerful open source sentiment analysis tools Bing Liu's Resources on Opinion Mining (including a sentiment lexicon) NaCTeM Sentiment Analysis Test Site (web form) pattern web mining module (python) SentiWordNet Umigon (for tweets, etc.) List of sentiment analysis tools for Twitter
编程资源
The Programming Historian - Lessons Basic Unix workflow for Text Processing Helpful Unix commands Similarity and Dissimilarity Measures An introduction to text analysis with python Basic Text Analysis in Mathematica Zend Framework - PHP framework for collecting data Text Analysis with R for Students of Literature Python Programming for the Humanities Document Similarity with R
文本可视化示例
具有可视化的各种文本分析项目
With Criminal Intent Various artistic analyses/interpretations of texts by Stefanie Posavec The state of our union is... dumber wordcollider Popcornjs sentiment tracker Metropho.rs Novel Views: Les Miserables A Christmas Carol (TULP interactive) Tolkien's Books Analyzed
词频可视化
Google n-gram viewer - word frequencies over time bookworm Open Library - word frequencies over time Historical culturomics of pronoun frequencies - pronoun frequencies by gender over time The Words They Used - bubble cloud of words from national convention speeches, with size and color coding Bib.ly - word frequencies throughout the Bible Ye Shall Know Them By Their Words - word frequencies by topic for presidential nomination speeches (additional description) FACTA+ Visualizer - tree map of term frequency Inaugural language (Boston Globe) - radial scatterplots Mining Books to Map Emotions - frequencies of sentiment terms over time
主题模型可视化
Termite - tabular, proportional symbol visualization of words and topics PMLA topic network - a network view of the topics from a topic model of PMLA, where links are created for shared words between topics (additional description) Using Word Clouds for Topic Modeling Results - visualizing the distribution of words for each topic as separate word clouds
会继续引荐相关议题,以下是计量经济圈社群就文本分析的日常讨论。计量社群每天生产的内容非常多,而且质量很高,值得每个学者认真对待这些insights。
拓展性阅读
2年,计量经济圈公众号近1000篇文章,
Econometrics Circle
数据系列:空间矩阵 | 工企数据 | PM2.5 | 市场化指数 | CO2数据 | 夜间灯光 | 官员方言 | 微观数据 |
计量系列:匹配方法 | 内生性 | 工具变量 | DID | 面板数据 | 常用TOOL | 中介调节 | 时间序列 | RDD断点 | 合成控制 |
数据处理:Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |
干货系列:能源环境 | 效率研究 | 空间计量 | 国际经贸 | 计量软件 | 商科研究 | 机器学习 | SSCI | CSSCI | SSCI查询 |
计量经济圈组织了一个计量社群,有如下特征:热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此,建议积极进取和有强烈研习激情的中青年学者到社群交流探讨,始终坚信优秀是通过感染优秀而互相成就彼此的。