河南南阳收割机被堵事件:官员缺德,祸患无穷

极目新闻领导公开“记者毕节采访被打”细节:他们打人后擦去指纹

突发!员工跳楼!只拿低保工资!央企设计院集体罢工!

退休后的温家宝

突发!北京某院集体罢工!

生成图片,分享到微信朋友圈

自由微信安卓APP发布,立即下载! | 提交文章网址
查看原文

R语言实现EDA + HeatMap

文章来源知乎专栏:李小二姜的R语言可视化和数据挖掘专栏

在Kaggle上面找到了Datasets区域暂时排行第一的数据集,看了一下是关于Google IO上面的捐赠项目的分析,最近一直在弄PHP,对于R语言有点生疏,在参考其他kernel上面的大佬的EDA之后,我也拿着个数据集来练习一下数据可视化和数据操作的熟练度。

Data Science For Good: DonorsChoose.orgwww.kaggle.com

数据集介绍

DonorsChoose.org是一个由美国历史老师在2000年建立的一个公益性组织,它主要是公立学校老师用来给他们的学生申请所需教学物资的平台,已经有全国三分之四的老师都在平台上面使用过,这个平台很好的支持了美国公立教育。

数据集一共包含6个数据集,每一个数据集在Data部分都有介绍,包含Donations,Donors,Projects,Resources,Schools和Teachers,不同数据集之间都有SQL主键进行连接,可以很方便的连表查询


加载数据包和数据集

library(plotly,quietly = T) library(data.table,quietly = T) library(tidyverse,quietly = T) library(lubridate) library(stringr,quietly = T) library(treemapify) library(tm) library(magrittr) library(tidytext) Teacher<-fread("Teachers.csv")Donations<-fread("Donations.csv")Donors<-fread("Donors.csv")Projects<-fread("Projects.csv")Resources<-fread("Resources.csv")Schools<-fread("Schools.csv")

基本都是很常用的data manipulation包以及绘图和文件IO的包,因为本身这几个数据集都很大,答主都是用哪个就加载哪个,没办法啊,rstudio把我16G的内存台式机都弄垮了。


数据操作和EDA

Donations部分包含的数据部分不多,主要是以下:

Observations: 4,687,884 Variables: 7 $ Project ID                          <chr> "000009891526c0ade7180f8423792063", "000009891526c0ade7180f8423792063", "00000... $ Donation ID                         <chr> "688729120858666221208529ee3fc18e", "dcf1071da3aa3561f91ac689d1f73dee", "18a23... $ Donor ID                            <chr> "1f4b5b6e68445c6c4a0509b3aca93f38", "4aaab6d244bf3599682239ed5591af8a", "0b076... $ Donation Included Optional Donation <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y... $ Donation Amount                     <dbl> 178.37, 25.00, 20.00, 25.00, 25.00, 15.00, 50.00, 50.00, 200.00, 10.00, 100.00... $ Donor Cart Sequence                 <int> 11, 2, 3, 1, 2, 1, 1, 2, 2, 44, 1, 1, 1, 1, 1, 2, 1, 3, 2, 9, 1, 5, 1, 1, 43, ... $ Donation Received Date              <chr> "2016-08-23 13:15:57", "2016-06-06 20:05:23", "2016-06-06 14:08:46", "2016-05-...

检查里面Donation Amount的数据基本分布和基本的数据分布情况:

mean(Donations$`Donation Amount`);median(Donations$`Donation Amount`);max(Donations$`Donation Amount`);min(Donations$`Donation Amount`) #平均值60.669,中位数值25,最大60000USD,最少0.01USD p1<-Donations %>% filter(`Donation Amount`<=median(Donations$`Donation Amount`)) %>% ggplot(aes(x=`Donation Amount`))+geom_histogram(fill="red")+  labs(y="Total Counts",title="Amount distribution")+theme_bw() p2<-Donations %>% ggplot(aes(x=`Donation Included Optional Donation`))+geom_bar(fill=c("red","blue"))+coord_polar(theta = "x")+  theme(axis.text.y = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+  labs(title="Donation Included Optional Donation") p3<-Donors %>% filter(!is.na(`Donor Is Teacher`)) %>% group_by(`Donor Is Teacher`) %>% summarise(Total = n()) %>%  arrange(desc(Total)) %>% ungroup() %>% ggplot(aes(x=`Donor Is Teacher`))+geom_bar(fill=c("red","blue"))+coord_polar(theta = "x")+  theme(axis.text.y = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+  labs(title="Donor Is Teacher") grid.arrange(p1,p2,p3,ncol=1)

再主要看一下接收时间的一个分布情况:

Donations<-Donations %>% rename(DRD=`Donation Received Date`) %>% mutate(DRD=ymd_hms(DRD)) Donations %>% mutate(yyy=year(DRD)) %>% ggplot(aes(x=yyy,fill="#b388ff"))+  geom_density(aes(y=..scaled..))+labs(y="",x="",title="Donation Received Date Count")+ggthemes::theme_pander()+  theme(legend.position = "none")

很直观的看出来2017年是接收最多的一年,可能2018年的数据还是不是很完整,所以2018的数据下降不少。


接下来是Donors的数据集:

> glimpse(Donors) Observations: 2,122,640 Variables: 5 $ Donor ID         <chr> "00000ce845c00cbf0686c992fc369df4", "00002783bc5d108510f3f9666c8b1edd", "00002d44003ed46b066607c5... $ Donor City       <chr> "Evanston", "Appomattox", "Winton", "Indianapolis", "Paterson", "", "Stamford", "Green Bay", "Arg... $ Donor State      <chr> "Illinois", "other", "California", "Indiana", "New Jersey", "other", "Connecticut", "Wisconsin", ... $ Donor Is Teacher <chr> "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No",... $ Donor Zip        <chr> "602", "245", "953", "462", "75", "", "69", "543", "128", "463", "190", "87", "198", "775", "105"...

除了最后一个Donor Zip的字段我没有用之外,其他的字段都用到了。

p1<-Donors %>% filter(`Donor City`!='' & !is.na(`Donor City`)) %>% group_by(`Donor City`) %>% summarise(total=n()) %>% arrange(desc(total)) %>%  top_n(15,wt=total) %>% ggplot(aes(x=`Donor City`,y=total,fill=as.factor(`Donor City`)))+geom_bar(stat="identity")+coord_flip()+  labs(title="Donors City Frequency",fill="",y="",x="")+theme(legend.position = "none") p2<-Donors %>% filter(`Donor State`!='' & !is.na(`Donor State`)) %>% group_by(`Donor State`) %>% summarise(total=n()) %>% arrange(desc(total)) %>%  top_n(15,wt=total) %>% mutate(`Donor State`=reorder(`Donor State`,total)) %>% ggplot(aes(x=`Donor State`,y=total,fill=as.factor(`Donor State`)))+geom_bar(stat="identity")+coord_flip()+  labs(title="Donor State Frequency",fill="",y="",x="")+theme(legend.position = "none") grid.arrange(p1,p2)

可以看书芝加哥是捐助最多的城市,加州是捐助最多的州。


接下来是最重要的Projects数据集,包含的信息最多,所以我们也分析的最多。

Observations: 1,110,017 Variables: 18 $ Project ID                       <chr> "7685f0265a19d7b52a470ee4bac883ba", "f9f4af7099061fb4bf44642a03e5c331", "afd99a01... $ School ID                        <chr> "e180c7424cb9c68cb49f141b092a988f", "08b20f1e2125103ed7aa17e8d76c71d4", "1287f512... $ Teacher ID                       <chr> "4ee5200e89d9e2998ec8baad8a3c5968", "cca2d1d277fb4adb50147b49cdc3b156", "6c5bd0d4... $ Teacher Project Posted Sequence  <int> 25, 3, 1, 40, 2, 4, 3, 57, 14, 1, 19, 1, 1, 2, 4, 1, 5, 8, 3, 1, 146, 1, 35, 2, 1... $ Project Type                     <chr> "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teach... $ Project Title                    <chr> "Stand Up to Bullying: Together We Can!", "Learning in Color!", "Help Second Grad... $ Project Essay                    <chr> "Did you know that 1-7 students in grades K-12 is either a bully or a victim of b... $ Project Short Description        <chr> "Did you know that 1-7 students in grades K-12 is either a bully or a victim of b... $ Project Need Statement           <chr> "My students need 25 copies of \"\"Bullying in Schools\"\" for each to keep, \"\"... $ Project Subject Category Tree    <chr> "Applied Learning", "Applied Learning, Literacy & Language", "Literacy & Language... $ Project Subject Subcategory Tree <chr> "Character Education, Early Development", "Early Development, Literacy", "ESL", "... $ Project Grade Level Category     <chr> "Grades PreK-2", "Grades PreK-2", "Grades PreK-2", "Grades 3-5", "Grades 3-5", "G... $ Project Resource Category        <chr> "Technology", "Technology", "Supplies", "Books", "Supplies", "Other", "Trips", "B... $ Project Cost                     <dbl> 361.80, 512.85, 435.92, 161.26, 264.19, 175.15, 3020.59, 566.19, 339.20, 566.73, ... $ Project Posted Date              <chr> "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01... $ Project Expiration Date          <chr> "2013-05-30", "2013-05-31", "2013-05-30", "2013-05-31", "2013-05-30", "2013-05-31... $ Project Current Status           <chr> "Fully Funded", "Expired", "Fully Funded", "Fully Funded", "Fully Funded", "Fully... $ Project Fully Funded Date        <chr> "2013-01-11", "", "2013-05-22", "2013-02-06", "2013-01-01", "2013-02-01", "2013-0...

信息量很多,我们一个一个来:

Projects %>% filter(`Project Current Status`!='' && !is.na(`Project Current Status`)) %>% group_by(`Project Current Status`) %>%  summarise(total=n()) %>% mutate(`Project Current Status`=as.factor(`Project Current Status`)) %>%  plot_ly(labels=~`Project Current Status`,values=~total,type="pie",textposition = 'inside',textinfo = 'label+percent') %>%  layout(title="Project Current Status Calculation") Projects %>% filter(`Project Resource Category`!='' && !is.na(`Project Resource Category`)) %>% group_by(`Project Resource Category`) %>%  summarise(total=n()) %>% mutate(`Project Resource Category`=as.factor(`Project Resource Category`)) %>%  plot_ly(labels=~`Project Resource Category`,values=~total,type="pie",textposition = 'inside',textinfo = 'label+percent') %>% layout(title="Project Resource Category Calculation") p3<-Projects %>% filter(`Project Type`!='' && !is.na(`Project Type`)) %>% group_by(`Project Type`) %>%  summarise(total=n()) %>% mutate(`Project Type`=as.factor(`Project Type`)) %>% ggplot(aes(x=`Project Type`,y=total,fill=as.factor(`Project Type`)))+  geom_bar(stat = "identity")+labs(title="Project Type",y="",x="",fill="")+theme(legend.position = "none")+coord_flip() p4<-Projects %>% filter(`Project Resource Category`!='' && !is.na(`Project Resource Category`)) %>% group_by(`Project Resource Category`) %>%  summarise(total=n()) %>% mutate(`Project Resource Category`=as.factor(`Project Resource Category`)) %>% ggplot(aes(x=`Project Resource Category`,y=total,fill=as.factor(`Project Resource Category`)))+  geom_bar(stat = "identity")+labs(title="Project Resource Category",y="",x="",fill="")+theme(legend.position = "none",axis.text.x = element_blank())+coord_flip() grid.arrange(p3,p4,nrow=2)
p1<-Projects %>% filter(`Project Subject Category Tree`!='' && !is.na(`Project Subject Category Tree`)) %>% group_by(`Project Subject Category Tree`) %>%  summarise(total=n()) %>% filter(total>20000) %>% ggplot(aes(area=total,fill=total,label=as.factor(`Project Subject Category Tree`)))+  geom_treemap()+geom_treemap_text(fontface = "italic", colour = "red", place = "centre",grow = TRUE,alpha=1)+  scale_fill_continuous(low="lightblue",high = "darkblue")+ggtitle("Project Subject Category Tree") p2<-Projects %>% filter(`Project Subject Subcategory Tree`!='' && !is.na(`Project Subject Subcategory Tree`)) %>% group_by(`Project Subject Subcategory Tree`) %>%  summarise(total=n()) %>% filter(total>20000) %>% ggplot(aes(area=total,fill=total,label=as.factor(`Project Subject Subcategory Tree`)))+  geom_treemap()+geom_treemap_text(fontface = "italic", colour = "blue", place = "centre",grow = TRUE,alpha=1)+  scale_fill_continuous(low="white",high = "red")+ggtitle("Project Subject Subcategory Tree") grid.arrange(p1,p2,nrow=1)

根据树状图可以找出哪一些项目类型是最多的,也是新学的一个图样。

Projects %>% sample_n(200) %>% unnest_tokens(word,`Project Essay`) %>% filter(!(word %in% stop_words$word)) %>%  count(word,sort = TRUE) %>% ungroup() %>% top_n(200,wt=word) %>% filter(str_detect(word,letters)) %>%ggplot(aes(word, n)) +  geom_col(fill = "#ffc107") + labs(x = NULL, y = "tf-idf") + coord_flip() + theme_bw()+ggtitle("Words in Project Essay") Projects %>% unnest_tokens(word,`Project Title`) %>% filter(!(word %in% stop_words$word)) %>%  count(word,sort = TRUE) %>% ungroup()  %>% head(100) %>%  with(wordcloud::wordcloud(word, n, max.words = 30,colors=RColorBrewer::brewer.pal(8, "Dark2")))

这两段代码是从essay和title里面提取词,来寻找那个是essay里面的关键文字

p1<-Projects %>% group_by(`Project Posted Date`) %>% summarise(total=n()) %>% mutate(`Project Posted Date`=as.POSIXct(`Project Posted Date`)) %>% ggplot(aes(x=`Project Posted Date`,y=total,group=1))+  geom_path(color="blue")+scale_x_datetime(limits = as.POSIXct(c("2013-01-01","2018-05-01")))+labs(y="",title="Project Posted Date") p2<-Projects %>% group_by(`Project Expiration Date`) %>% summarise(total=n()) %>% filter(`Project Expiration Date` != "") %>% mutate(`Project Expiration Date`=as.POSIXct(`Project Expiration Date`)) %>% ggplot(aes(x=`Project Expiration Date`,y=total,group=1))+  geom_path(color="red")+scale_x_datetime(limits = as.POSIXct(c("2013-05-01","2018-12-31")))+labs(y="",title="Project Expiration Date") p3<-Projects %>% group_by(`Project Fully Funded Date`) %>% summarise(total=n()) %>% filter(`Project Fully Funded Date` != "") %>% mutate(`Project Fully Funded Date`=as.POSIXct(`Project Fully Funded Date`)) %>% ggplot(aes(x=`Project Fully Funded Date`,y=total,group=1))+  geom_path(color="green")+scale_x_datetime(limits = as.POSIXct(c("2013-05-01","2018-12-31")))+labs(y="",title="Project Fully Funded Date") gridExtra::grid.arrange(p1,p2,p3,nrow=3)

还有一部分projects的数据是和后面的数据进行连表查询之后的结果,后面会给大家一起放上来。

接下来是School和Resource部分:

> glimpse(Schools) Observations: 72,993 Variables: 9 $ School ID                    <chr> "00003e0fdd601b8ea0a6eb44057b9c5e", "00004e32a448b4832e1b993500bf0731", "0002021bb799... $ School Name                  <chr> "Capon Bridge Middle School", "The Woodlands College Park High School", "Samantha Smi... $ School Metro Type            <chr> "rural", "urban", "suburban", "unknown", "rural", "unknown", "urban", "unknown", "urb... $ School Percentage Free Lunch <int> 56, 41, 2, 76, 50, 63, 17, 15, 46, 29, 26, 31, 47, 70, 61, 82, 34, 49, 95, 86, 45, 32... $ School State                 <chr> "West Virginia", "Texas", "Washington", "Michigan", "Texas", "Arizona", "New York", "... $ School Zip                   <int> 26711, 77384, 98074, 48370, 75573, 85706, 10029, 29045, 95122, 60025, 76262, 20659, 2... $ School City                  <chr> "Capon Bridge", "The Woodlands", "Sammamish", "Oxford", "Redwater", "Tucson", "New Yo... $ School County                <chr> "Hampshire", "Montgomery", "King", "Oakland", "Bowie", "Pima", "New York (Manhattan)"... $ School District              <chr> "Hampshire Co School District", "Conroe Ind School District", "Lake Washington Sch Di... > glimpse(Resources) Observations: 7,210,448 Variables: 5 $ Project ID           <chr> "000009891526c0ade7180f8423792063", "00000ce845c00cbf0686c992fc369df4", "00002d44003ed46b0666... $ Resource Item Name   <chr> "chair move and store cart", "sony mdr zx100 blk   headphones", "gaiam kids stay-n-play balan... $ Resource Quantity    <dbl> 1, 40, 4, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... $ Resource Unit Price  <dbl> 350.00, 12.86, 19.00, 269.00, 131.85, 33.88, 172.11, 6.43, 24.20, 5.90, 3.46, 5.23, 7.87, 5.4... $ Resource Vendor Name <chr> "", "CDW-G", "Amazon Business", "Lakeshore Learning Materials", "Amazon Business", "Amazon Bu...

首先观察school里面的所有变量:

p1<-Schools %>% filter(`School Name` != "" && !is.na(`School Name`)) %>% group_by(`School Name`) %>% summarise(total=n()) %>%  arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=`School Name`,y=total,fill=as.factor(`School Name`)))+  geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School Name")+coord_flip()+theme(legend.position = "none") p2<-Schools %>% filter(`School City` != "" && !is.na(`School City`)) %>% group_by(`School City`) %>% summarise(total=n()) %>%  arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=`School City`,y=total,fill=as.factor(`School City`)))+  geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School City")+coord_flip()+theme(legend.position = "none") p3<-Schools %>% filter(`School State` != "" && !is.na(`School State`)) %>% group_by(`School State`) %>% summarise(total=n()) %>%  arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=reorder(`School State`,total),y=total,fill=as.factor(`School State`)))+  geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School State")+coord_flip()+ggthemes::theme_economist()+  theme(legend.position = "none") p4<-Schools %>% filter(`School District` != "" && !is.na(`School District`)) %>% group_by(`School District`) %>% summarise(total=n()) %>%  arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=reorder(`School District`,total),y=total,fill=as.factor(`School District`)))+  geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School District")+coord_flip()+ggthemes::theme_tufte()+  theme(legend.position = "none") gridExtra::grid.arrange(p1,p2,p3,p4,nrow=2,ncol=2)
p1<-Schools %>%  rename(FreeLunch = `School Percentage Free Lunch`) %>%  ggplot(aes(x = FreeLunch)) +  geom_histogram(bins = 30,fill = "orange") +  labs(x= 'School Percentage Free Lunch',y = 'Count', title = paste("Distribution of", ' School Percentage Free Lunch ')) +  theme_bw() range(Resources$`Resource Unit Price` %>% na.omit()) p2<-Resources %>% rename(uprice=`Resource Unit Price`) %>% na.omit() %>% filter(uprice<=1000) %>%ggplot(aes(x=uprice)) + geom_density(fill="#558b2f")+  labs(x="Resource Unit Price",title="Distribution of Resource Unit Price",y="")+  ggplot2::annotate("text",x=800,y=0.04,label="range of price 0~97085.5")+theme_bw() Projects$duration <- as.numeric(as.Date(Projects$`Project Fully Funded Date`)-as.Date(Projects$`Project Posted Date`)) p3<-Projects %>% na.omit() %>% ggplot(aes(duration))+geom_density(fill="#0277bd")+labs(y="",title="Duration of Time duration")

这里在projects里面构建了一个新变量,就是项目开始和接收的时间的区间时长,看的出来大部分都会在很短的时间里面就完成。

我们在resource里面找到了`Resource Item Name`这个变量,是一些物件的名称,我们提取字来观察那些是最多的,但是由于数据集多大,我随即挑选了5000个样本来观察,估计很难反应全体情况,只是作为一个参考。

Resources %>% sample_n(5e3) %>% rename(rin=`Resource Item Name`) %>% filter(rin != '' && !is.na(rin)) %>%  unnest_tokens(word,rin) %>% filter(!(word %in% stop_words$word)) %>% count(word,sort=T)  %>% top_n(30,wt=n) %>%  filter(str_length(word) >= 2 & !(word %in% as.character(c(1:1000)))) %>% ggplot(aes(x=word,y=n,fill=as.factor(word)))+geom_bar(stat = "identity")+  coord_polar(theta = "x")+labs(title="Resource Item Name Frequency",y="",x="")+ggthemes::theme_fivethirtyeight()+  theme(legend.position = "none",axis.text.y = element_blank())

ipad和books???为毛不直接用ibooks呢???

后面我们联合几个数据集进行联合查询,找寻不同变量之间和捐款金额的差异热图。

teacher_amount<-inner_join(Donations,Donors,by="Donor ID") teacher_amount %>% group_by(`Donor Is Teacher`,`Donation Included Optional Donation`) %>% summarise(total=n()) %>%  ggplot(aes(x=`Donor Is Teacher`,y=`Donation Included Optional Donation`,fill=total))+geom_tile()+  scale_fill_gradient2(low="blue",high = "red")+theme_bw()+labs(title="Relation between DIT and DIOD",fill="Counts") prefix_amount<-inner_join(Donations,Projects,by="Project ID") %>% inner_join(Teacher,by="Teacher ID") prefix_amount %<>% select(`Teacher Prefix`,`Donation Amount`,`Donation Included Optional Donation`) %>% filter(`Teacher Prefix` != "") prefix_amount %>% group_by(`Teacher Prefix`,`Donation Included Optional Donation`) %>% summarise(total=mean(`Donation Amount`)) %>%  ggplot(aes(x=`Teacher Prefix`,y=`Donation Included Optional Donation`,fill=total))+geom_tile()+  scale_fill_continuous(low = "#84ffff",high = "#ff5722")+theme_bw()+labs(title="Relation between TP and DIOD",fill="Donation Amount(mean)")

哪种类型的人捐的多一目了然。


文章有问题?点此查看未经处理的缓存