R语言实现EDA + HeatMap
文章来源知乎专栏:李小二姜的R语言可视化和数据挖掘专栏
在Kaggle上面找到了Datasets区域暂时排行第一的数据集,看了一下是关于Google IO上面的捐赠项目的分析,最近一直在弄PHP,对于R语言有点生疏,在参考其他kernel上面的大佬的EDA之后,我也拿着个数据集来练习一下数据可视化和数据操作的熟练度。
Data Science For Good: DonorsChoose.orgwww.kaggle.com
数据集介绍
http://DonorsChoose.org是一个由美国历史老师在2000年建立的一个公益性组织,它主要是公立学校老师用来给他们的学生申请所需教学物资的平台,已经有全国三分之四的老师都在平台上面使用过,这个平台很好的支持了美国公立教育。
数据集一共包含6个数据集,每一个数据集在Data部分都有介绍,包含Donations,Donors,Projects,Resources,Schools和Teachers,不同数据集之间都有SQL主键进行连接,可以很方便的连表查询
加载数据包和数据集
library(plotly,quietly = T)
library(data.table,quietly = T)
library(tidyverse,quietly = T)
library(lubridate)
library(stringr,quietly = T)
library(treemapify)
library(tm)
library(magrittr)
library(tidytext)
Teacher<-fread("Teachers.csv")Donations<-fread("Donations.csv")Donors<-fread("Donors.csv")Projects<-fread("Projects.csv")Resources<-fread("Resources.csv")Schools<-fread("Schools.csv")基本都是很常用的data manipulation包以及绘图和文件IO的包,因为本身这几个数据集都很大,答主都是用哪个就加载哪个,没办法啊,rstudio把我16G的内存台式机都弄垮了。
数据操作和EDA
Donations部分包含的数据部分不多,主要是以下:
Observations: 4,687,884
Variables: 7
$ Project ID <chr> "000009891526c0ade7180f8423792063", "000009891526c0ade7180f8423792063", "00000...
$ Donation ID <chr> "688729120858666221208529ee3fc18e", "dcf1071da3aa3561f91ac689d1f73dee", "18a23...
$ Donor ID <chr> "1f4b5b6e68445c6c4a0509b3aca93f38", "4aaab6d244bf3599682239ed5591af8a", "0b076...
$ Donation Included Optional Donation <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y...
$ Donation Amount <dbl> 178.37, 25.00, 20.00, 25.00, 25.00, 15.00, 50.00, 50.00, 200.00, 10.00, 100.00...
$ Donor Cart Sequence <int> 11, 2, 3, 1, 2, 1, 1, 2, 2, 44, 1, 1, 1, 1, 1, 2, 1, 3, 2, 9, 1, 5, 1, 1, 43, ...
$ Donation Received Date <chr> "2016-08-23 13:15:57", "2016-06-06 20:05:23", "2016-06-06 14:08:46", "2016-05-...检查里面Donation Amount的数据基本分布和基本的数据分布情况:
mean(Donations$`Donation Amount`);median(Donations$`Donation Amount`);max(Donations$`Donation Amount`);min(Donations$`Donation Amount`)
#平均值60.669,中位数值25,最大60000USD,最少0.01USD
p1<-Donations %>% filter(`Donation Amount`<=median(Donations$`Donation Amount`)) %>% ggplot(aes(x=`Donation Amount`))+geom_histogram(fill="red")+
labs(y="Total Counts",title="Amount distribution")+theme_bw()
p2<-Donations %>% ggplot(aes(x=`Donation Included Optional Donation`))+geom_bar(fill=c("red","blue"))+coord_polar(theta = "x")+
theme(axis.text.y = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+
labs(title="Donation Included Optional Donation")
p3<-Donors %>% filter(!is.na(`Donor Is Teacher`)) %>% group_by(`Donor Is Teacher`) %>% summarise(Total = n()) %>%
arrange(desc(Total)) %>% ungroup() %>% ggplot(aes(x=`Donor Is Teacher`))+geom_bar(fill=c("red","blue"))+coord_polar(theta = "x")+
theme(axis.text.y = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+
labs(title="Donor Is Teacher")
grid.arrange(p1,p2,p3,ncol=1)再主要看一下接收时间的一个分布情况:
Donations<-Donations %>% rename(DRD=`Donation Received Date`) %>% mutate(DRD=ymd_hms(DRD))
Donations %>% mutate(yyy=year(DRD)) %>% ggplot(aes(x=yyy,fill="#b388ff"))+
geom_density(aes(y=..scaled..))+labs(y="",x="",title="Donation Received Date Count")+ggthemes::theme_pander()+
theme(legend.position = "none")很直观的看出来2017年是接收最多的一年,可能2018年的数据还是不是很完整,所以2018的数据下降不少。
接下来是Donors的数据集:
> glimpse(Donors)
Observations: 2,122,640
Variables: 5
$ Donor ID <chr> "00000ce845c00cbf0686c992fc369df4", "00002783bc5d108510f3f9666c8b1edd", "00002d44003ed46b066607c5...
$ Donor City <chr> "Evanston", "Appomattox", "Winton", "Indianapolis", "Paterson", "", "Stamford", "Green Bay", "Arg...
$ Donor State <chr> "Illinois", "other", "California", "Indiana", "New Jersey", "other", "Connecticut", "Wisconsin", ...
$ Donor Is Teacher <chr> "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No",...
$ Donor Zip <chr> "602", "245", "953", "462", "75", "", "69", "543", "128", "463", "190", "87", "198", "775", "105"...除了最后一个Donor Zip的字段我没有用之外,其他的字段都用到了。
p1<-Donors %>% filter(`Donor City`!='' & !is.na(`Donor City`)) %>% group_by(`Donor City`) %>% summarise(total=n()) %>% arrange(desc(total)) %>%
top_n(15,wt=total) %>% ggplot(aes(x=`Donor City`,y=total,fill=as.factor(`Donor City`)))+geom_bar(stat="identity")+coord_flip()+
labs(title="Donors City Frequency",fill="",y="",x="")+theme(legend.position = "none")
p2<-Donors %>% filter(`Donor State`!='' & !is.na(`Donor State`)) %>% group_by(`Donor State`) %>% summarise(total=n()) %>% arrange(desc(total)) %>%
top_n(15,wt=total) %>% mutate(`Donor State`=reorder(`Donor State`,total)) %>% ggplot(aes(x=`Donor State`,y=total,fill=as.factor(`Donor State`)))+geom_bar(stat="identity")+coord_flip()+
labs(title="Donor State Frequency",fill="",y="",x="")+theme(legend.position = "none")
grid.arrange(p1,p2)可以看书芝加哥是捐助最多的城市,加州是捐助最多的州。
接下来是最重要的Projects数据集,包含的信息最多,所以我们也分析的最多。
Observations: 1,110,017
Variables: 18
$ Project ID <chr> "7685f0265a19d7b52a470ee4bac883ba", "f9f4af7099061fb4bf44642a03e5c331", "afd99a01...
$ School ID <chr> "e180c7424cb9c68cb49f141b092a988f", "08b20f1e2125103ed7aa17e8d76c71d4", "1287f512...
$ Teacher ID <chr> "4ee5200e89d9e2998ec8baad8a3c5968", "cca2d1d277fb4adb50147b49cdc3b156", "6c5bd0d4...
$ Teacher Project Posted Sequence <int> 25, 3, 1, 40, 2, 4, 3, 57, 14, 1, 19, 1, 1, 2, 4, 1, 5, 8, 3, 1, 146, 1, 35, 2, 1...
$ Project Type <chr> "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teacher-Led", "Teach...
$ Project Title <chr> "Stand Up to Bullying: Together We Can!", "Learning in Color!", "Help Second Grad...
$ Project Essay <chr> "Did you know that 1-7 students in grades K-12 is either a bully or a victim of b...
$ Project Short Description <chr> "Did you know that 1-7 students in grades K-12 is either a bully or a victim of b...
$ Project Need Statement <chr> "My students need 25 copies of \"\"Bullying in Schools\"\" for each to keep, \"\"...
$ Project Subject Category Tree <chr> "Applied Learning", "Applied Learning, Literacy & Language", "Literacy & Language...
$ Project Subject Subcategory Tree <chr> "Character Education, Early Development", "Early Development, Literacy", "ESL", "...
$ Project Grade Level Category <chr> "Grades PreK-2", "Grades PreK-2", "Grades PreK-2", "Grades 3-5", "Grades 3-5", "G...
$ Project Resource Category <chr> "Technology", "Technology", "Supplies", "Books", "Supplies", "Other", "Trips", "B...
$ Project Cost <dbl> 361.80, 512.85, 435.92, 161.26, 264.19, 175.15, 3020.59, 566.19, 339.20, 566.73, ...
$ Project Posted Date <chr> "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01", "2013-01-01...
$ Project Expiration Date <chr> "2013-05-30", "2013-05-31", "2013-05-30", "2013-05-31", "2013-05-30", "2013-05-31...
$ Project Current Status <chr> "Fully Funded", "Expired", "Fully Funded", "Fully Funded", "Fully Funded", "Fully...
$ Project Fully Funded Date <chr> "2013-01-11", "", "2013-05-22", "2013-02-06", "2013-01-01", "2013-02-01", "2013-0...信息量很多,我们一个一个来:
Projects %>% filter(`Project Current Status`!='' && !is.na(`Project Current Status`)) %>% group_by(`Project Current Status`) %>%
summarise(total=n()) %>% mutate(`Project Current Status`=as.factor(`Project Current Status`)) %>%
plot_ly(labels=~`Project Current Status`,values=~total,type="pie",textposition = 'inside',textinfo = 'label+percent') %>%
layout(title="Project Current Status Calculation")
Projects %>% filter(`Project Resource Category`!='' && !is.na(`Project Resource Category`)) %>% group_by(`Project Resource Category`) %>%
summarise(total=n()) %>% mutate(`Project Resource Category`=as.factor(`Project Resource Category`)) %>%
plot_ly(labels=~`Project Resource Category`,values=~total,type="pie",textposition = 'inside',textinfo = 'label+percent') %>%
layout(title="Project Resource Category Calculation")
p3<-Projects %>% filter(`Project Type`!='' && !is.na(`Project Type`)) %>% group_by(`Project Type`) %>%
summarise(total=n()) %>% mutate(`Project Type`=as.factor(`Project Type`)) %>% ggplot(aes(x=`Project Type`,y=total,fill=as.factor(`Project Type`)))+
geom_bar(stat = "identity")+labs(title="Project Type",y="",x="",fill="")+theme(legend.position = "none")+coord_flip()
p4<-Projects %>% filter(`Project Resource Category`!='' && !is.na(`Project Resource Category`)) %>% group_by(`Project Resource Category`) %>%
summarise(total=n()) %>% mutate(`Project Resource Category`=as.factor(`Project Resource Category`)) %>% ggplot(aes(x=`Project Resource Category`,y=total,fill=as.factor(`Project Resource Category`)))+
geom_bar(stat = "identity")+labs(title="Project Resource Category",y="",x="",fill="")+theme(legend.position = "none",axis.text.x = element_blank())+coord_flip()
grid.arrange(p3,p4,nrow=2)p1<-Projects %>% filter(`Project Subject Category Tree`!='' && !is.na(`Project Subject Category Tree`)) %>% group_by(`Project Subject Category Tree`) %>%
summarise(total=n()) %>% filter(total>20000) %>% ggplot(aes(area=total,fill=total,label=as.factor(`Project Subject Category Tree`)))+
geom_treemap()+geom_treemap_text(fontface = "italic", colour = "red", place = "centre",grow = TRUE,alpha=1)+
scale_fill_continuous(low="lightblue",high = "darkblue")+ggtitle("Project Subject Category Tree")
p2<-Projects %>% filter(`Project Subject Subcategory Tree`!='' && !is.na(`Project Subject Subcategory Tree`)) %>% group_by(`Project Subject Subcategory Tree`) %>%
summarise(total=n()) %>% filter(total>20000) %>% ggplot(aes(area=total,fill=total,label=as.factor(`Project Subject Subcategory Tree`)))+
geom_treemap()+geom_treemap_text(fontface = "italic", colour = "blue", place = "centre",grow = TRUE,alpha=1)+
scale_fill_continuous(low="white",high = "red")+ggtitle("Project Subject Subcategory Tree")
grid.arrange(p1,p2,nrow=1)根据树状图可以找出哪一些项目类型是最多的,也是新学的一个图样。
Projects %>% sample_n(200) %>% unnest_tokens(word,`Project Essay`) %>% filter(!(word %in% stop_words$word)) %>%
count(word,sort = TRUE) %>% ungroup() %>% top_n(200,wt=word) %>% filter(str_detect(word,letters)) %>%ggplot(aes(word, n)) +
geom_col(fill = "#ffc107") + labs(x = NULL, y = "tf-idf") + coord_flip() + theme_bw()+ggtitle("Words in Project Essay")
Projects %>% unnest_tokens(word,`Project Title`) %>% filter(!(word %in% stop_words$word)) %>%
count(word,sort = TRUE) %>% ungroup() %>% head(100) %>%
with(wordcloud::wordcloud(word, n, max.words = 30,colors=RColorBrewer::brewer.pal(8, "Dark2")))这两段代码是从essay和title里面提取词,来寻找那个是essay里面的关键文字
p1<-Projects %>% group_by(`Project Posted Date`) %>% summarise(total=n()) %>% mutate(`Project Posted Date`=as.POSIXct(`Project Posted Date`)) %>% ggplot(aes(x=`Project Posted Date`,y=total,group=1))+
geom_path(color="blue")+scale_x_datetime(limits = as.POSIXct(c("2013-01-01","2018-05-01")))+labs(y="",title="Project Posted Date")
p2<-Projects %>% group_by(`Project Expiration Date`) %>% summarise(total=n()) %>% filter(`Project Expiration Date` != "") %>% mutate(`Project Expiration Date`=as.POSIXct(`Project Expiration Date`)) %>% ggplot(aes(x=`Project Expiration Date`,y=total,group=1))+
geom_path(color="red")+scale_x_datetime(limits = as.POSIXct(c("2013-05-01","2018-12-31")))+labs(y="",title="Project Expiration Date")
p3<-Projects %>% group_by(`Project Fully Funded Date`) %>% summarise(total=n()) %>% filter(`Project Fully Funded Date` != "") %>% mutate(`Project Fully Funded Date`=as.POSIXct(`Project Fully Funded Date`)) %>% ggplot(aes(x=`Project Fully Funded Date`,y=total,group=1))+
geom_path(color="green")+scale_x_datetime(limits = as.POSIXct(c("2013-05-01","2018-12-31")))+labs(y="",title="Project Fully Funded Date")
gridExtra::grid.arrange(p1,p2,p3,nrow=3)还有一部分projects的数据是和后面的数据进行连表查询之后的结果,后面会给大家一起放上来。
接下来是School和Resource部分:
> glimpse(Schools)
Observations: 72,993
Variables: 9
$ School ID <chr> "00003e0fdd601b8ea0a6eb44057b9c5e", "00004e32a448b4832e1b993500bf0731", "0002021bb799...
$ School Name <chr> "Capon Bridge Middle School", "The Woodlands College Park High School", "Samantha Smi...
$ School Metro Type <chr> "rural", "urban", "suburban", "unknown", "rural", "unknown", "urban", "unknown", "urb...
$ School Percentage Free Lunch <int> 56, 41, 2, 76, 50, 63, 17, 15, 46, 29, 26, 31, 47, 70, 61, 82, 34, 49, 95, 86, 45, 32...
$ School State <chr> "West Virginia", "Texas", "Washington", "Michigan", "Texas", "Arizona", "New York", "...
$ School Zip <int> 26711, 77384, 98074, 48370, 75573, 85706, 10029, 29045, 95122, 60025, 76262, 20659, 2...
$ School City <chr> "Capon Bridge", "The Woodlands", "Sammamish", "Oxford", "Redwater", "Tucson", "New Yo...
$ School County <chr> "Hampshire", "Montgomery", "King", "Oakland", "Bowie", "Pima", "New York (Manhattan)"...
$ School District <chr> "Hampshire Co School District", "Conroe Ind School District", "Lake Washington Sch Di...
> glimpse(Resources)
Observations: 7,210,448
Variables: 5
$ Project ID <chr> "000009891526c0ade7180f8423792063", "00000ce845c00cbf0686c992fc369df4", "00002d44003ed46b0666...
$ Resource Item Name <chr> "chair move and store cart", "sony mdr zx100 blk headphones", "gaiam kids stay-n-play balan...
$ Resource Quantity <dbl> 1, 40, 4, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ Resource Unit Price <dbl> 350.00, 12.86, 19.00, 269.00, 131.85, 33.88, 172.11, 6.43, 24.20, 5.90, 3.46, 5.23, 7.87, 5.4...
$ Resource Vendor Name <chr> "", "CDW-G", "Amazon Business", "Lakeshore Learning Materials", "Amazon Business", "Amazon Bu...首先观察school里面的所有变量:
p1<-Schools %>% filter(`School Name` != "" && !is.na(`School Name`)) %>% group_by(`School Name`) %>% summarise(total=n()) %>%
arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=`School Name`,y=total,fill=as.factor(`School Name`)))+
geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School Name")+coord_flip()+theme(legend.position = "none")
p2<-Schools %>% filter(`School City` != "" && !is.na(`School City`)) %>% group_by(`School City`) %>% summarise(total=n()) %>%
arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=`School City`,y=total,fill=as.factor(`School City`)))+
geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School City")+coord_flip()+theme(legend.position = "none")
p3<-Schools %>% filter(`School State` != "" && !is.na(`School State`)) %>% group_by(`School State`) %>% summarise(total=n()) %>%
arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=reorder(`School State`,total),y=total,fill=as.factor(`School State`)))+
geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School State")+coord_flip()+ggthemes::theme_economist()+
theme(legend.position = "none")
p4<-Schools %>% filter(`School District` != "" && !is.na(`School District`)) %>% group_by(`School District`) %>% summarise(total=n()) %>%
arrange(desc(total)) %>% top_n(20,wt=total) %>% ggplot(aes(x=reorder(`School District`,total),y=total,fill=as.factor(`School District`)))+
geom_bar(stat = "identity")+labs(y="",fill="",x="",title="School District")+coord_flip()+ggthemes::theme_tufte()+
theme(legend.position = "none")
gridExtra::grid.arrange(p1,p2,p3,p4,nrow=2,ncol=2)p1<-Schools %>%
rename(FreeLunch = `School Percentage Free Lunch`) %>%
ggplot(aes(x = FreeLunch)) +
geom_histogram(bins = 30,fill = "orange") +
labs(x= 'School Percentage Free Lunch',y = 'Count', title = paste("Distribution of", ' School Percentage Free Lunch ')) +
theme_bw()
range(Resources$`Resource Unit Price` %>% na.omit())
p2<-Resources %>% rename(uprice=`Resource Unit Price`) %>% na.omit() %>% filter(uprice<=1000) %>%ggplot(aes(x=uprice)) + geom_density(fill="#558b2f")+
labs(x="Resource Unit Price",title="Distribution of Resource Unit Price",y="")+
ggplot2::annotate("text",x=800,y=0.04,label="range of price 0~97085.5")+theme_bw()
Projects$duration <- as.numeric(as.Date(Projects$`Project Fully Funded Date`)-as.Date(Projects$`Project Posted Date`))
p3<-Projects %>% na.omit() %>% ggplot(aes(duration))+geom_density(fill="#0277bd")+labs(y="",title="Duration of Time duration")这里在projects里面构建了一个新变量,就是项目开始和接收的时间的区间时长,看的出来大部分都会在很短的时间里面就完成。
我们在resource里面找到了`Resource Item Name`这个变量,是一些物件的名称,我们提取字来观察那些是最多的,但是由于数据集多大,我随即挑选了5000个样本来观察,估计很难反应全体情况,只是作为一个参考。
Resources %>% sample_n(5e3) %>% rename(rin=`Resource Item Name`) %>% filter(rin != '' && !is.na(rin)) %>%
unnest_tokens(word,rin) %>% filter(!(word %in% stop_words$word)) %>% count(word,sort=T) %>% top_n(30,wt=n) %>%
filter(str_length(word) >= 2 & !(word %in% as.character(c(1:1000)))) %>% ggplot(aes(x=word,y=n,fill=as.factor(word)))+geom_bar(stat = "identity")+
coord_polar(theta = "x")+labs(title="Resource Item Name Frequency",y="",x="")+ggthemes::theme_fivethirtyeight()+
theme(legend.position = "none",axis.text.y = element_blank())ipad和books???为毛不直接用ibooks呢???
后面我们联合几个数据集进行联合查询,找寻不同变量之间和捐款金额的差异热图。
teacher_amount<-inner_join(Donations,Donors,by="Donor ID")
teacher_amount %>% group_by(`Donor Is Teacher`,`Donation Included Optional Donation`) %>% summarise(total=n()) %>%
ggplot(aes(x=`Donor Is Teacher`,y=`Donation Included Optional Donation`,fill=total))+geom_tile()+
scale_fill_gradient2(low="blue",high = "red")+theme_bw()+labs(title="Relation between DIT and DIOD",fill="Counts")
prefix_amount<-inner_join(Donations,Projects,by="Project ID") %>% inner_join(Teacher,by="Teacher ID")
prefix_amount %<>% select(`Teacher Prefix`,`Donation Amount`,`Donation Included Optional Donation`) %>% filter(`Teacher Prefix` != "")
prefix_amount %>% group_by(`Teacher Prefix`,`Donation Included Optional Donation`) %>% summarise(total=mean(`Donation Amount`)) %>%
ggplot(aes(x=`Teacher Prefix`,y=`Donation Included Optional Donation`,fill=total))+geom_tile()+
scale_fill_continuous(low = "#84ffff",high = "#ff5722")+theme_bw()+labs(title="Relation between TP and DIOD",fill="Donation Amount(mean)")哪种类型的人捐的多一目了然。