过去50年在美国发生的枪击案——R语言数据分析
文章来源知乎专栏:李小二姜的R语言可视化和数据挖掘专栏
拉斯维加斯以赌城、度假村、购物和娱乐中心闻名世界。但一年前的10月1日,这里还成为了美国由一个人造成的最大的枪击事件的发生地。64岁的斯蒂芬•帕多克从一个酒店的高层房间的窗户向一个户外音乐会发射了1000多发步枪子弹,造成大约60死亡,527人受伤。他在警察逼近时自杀。所以今天分享一篇关于美国枪击事件的数据分析。
数据集
数据集来源于kaggle网站,该数据描述的是过去50年来在美国发生的枪击案的具体内容,由于2017年在拉斯维加斯发生的恶性枪击案事件,使得这次案件的死亡人数比2015和2016年死亡人数之和还要多,可谓是触目惊心。
S# :数据编号
Location:案件发生城市,州
Date:时间
Summary:案件总结
Fatalities:死亡人数
Injured:受伤人数
Total victims:受害者总人数
Mental Health Issues:精神状况
Race:种族
Gender:性别
Latitude:纬度
Longitude:经度
数据读取
setwd("E:/kaggle/us-mass-shootings-last-50-years")
rm(list = ls())
library(tidyverse)
library(stringr)
library(data.table)
library(maps)
library(lubridate)
library(leaflet)
shooting<-as.tibble(fread("Mass Shootings Dataset Ver 2.csv"))glimpse(shooting)我们使用version2版本的数据,该数据在原数据基础上添加了一些内容,我们先审视我们数据 的类型的情况:
Observations: 320Variables: 13$ S# <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,...$ Title <chr> "Las Vegas Strip mass shooting", "San Francisco UPS shooting", "Pennsylvania supermarket shooting", "Florida awning manufacturer s...$ Location <chr> "Las Vegas, NV", "San Francisco, CA", "Tunkhannock, PA", "Orlando, Florida", "Kirkersville, Ohio", "Fresno, California", "Fort Lau...$ Date <chr> "10/1/2017", "6/14/2017", "6/7/2017", "6/5/2017", "5/12/2017", "4/18/2017", "1/6/2017", "9/23/2016", "7/17/2016", "7/7/2016", "6/1...$ Summary <chr> "", "Jimmy Lam, 38, fatally shot three coworkers and wounded two others inside a UPS facility in San Francisco. Lam killed himself...$ Fatalities <int> 58, 3, 3, 5, 3, 3, 5, 5, 3, 5, 49, 0, 1, 0, 0, 1, 4, 1, 0, 0, 0, 0, 3, 2, 3, 2, 2, 4, 4, 1, 2, 0, 0, 2, 0, 2, 6, 5, 5, 0, 4, 1, 0,...$ Injured <int> 527, 2, 0, 0, 0, 0, 6, 0, 3, 11, 53, 4, 4, 6, 4, 4, 0, 4, 3, 4, 4, 5, 1, 2, 0, 2, 2, 0, 0, 4, 2, 4, 4, 2, 5, 2, 3, 3, 0, 4, 0, 6, ...$ Total victims <int> 585, 5, 3, 5, 3, 3, 11, 5, 6, 16, 102, 4, 5, 6, 4, 5, 4, 4, 4, 4, 4, 5, 3, 3, 3, 3, 3, 3, 3, 5, 4, 4, 4, 3, 5, 4, 9, 8, 5, 4, 4, 7...$ Mental Health Issues <chr> "Unclear", "Yes", "Unclear", "Unclear", "Yes", "Unclear", "Yes", "Yes", "Yes", "Unclear", "Unclear", "Unknown", "Unknown", "Unknow...$ Race <chr> "White", "Asian", "White", "", "White", "Black", "Latino", "", "Black", "Black", "Other", "Unknown", "Unknown", "Black American or...$ Gender <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "Unknown", "Unknown", "Male", "Male", "Unknown", "Male", "Male", "Unknown",...$ Latitude <dbl> 36.18127, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 38.74422, 38.84511, 36.76597, 39.29039, 41.87811, 29.78579, 34.16204, 33.77005, ...$ Longitude <dbl> -115.13413, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -90.30539, -76.87497, -78.92834, -76.61219, -87.62980, -95.82440, -86.47554, -...数据量其实并不大,但是存在很多缺失数据,后期会进行处理。下一步进行基本的数据清洗和整理工作,首要就是日期的处理,Race,Gender以及Mental Health Issues的特征统一化。
shooting<-shooting %>% select(1:13) %>% mutate(Date=mdy(Date),year=year(Date))shooting$Gender<-if_else(shooting$Gender=="M","Male",shooting$Gender)shooting$Race<-if_else(str_detect(shooting$Race,"Black American or African American"),"Black",shooting$Race)shooting$Race<-if_else(str_detect(shooting$Race,"White American or European American"),"White",shooting$Race)shooting$Race<-if_else(str_detect(shooting$Race,"Asian American"),"Asian",shooting$Race)shooting$Race<-if_else(str_detect(shooting$Race,"Some other race"),"Other",shooting$Race)shooting$Race<-if_else(str_detect(shooting$Race,"Native American or Alaska Native"),"Native American",shooting$Race)shooting$yearcut<-cut(shooting$year,breaks = 10)shooting$`Mental Health Issues`<-if_else(str_detect(shooting$`Mental Health Issues`,"Un"),"Unknown",shooting$`Mental Health Issues`)shooting$Race<-str_to_upper(shooting$Race)shooting$`Mental Health Issues`<-str_to_upper(shooting$`Mental Health Issues`)#将location分成city和state两个变量shooting$city<-sapply(shooting$Location,function(x){
return(unlist(str_split(x,","))[1] %>% str_trim())})shooting$state<-sapply(shooting$Location,function(x){
return(unlist(str_split(x,","))[2] %>% str_trim())})我们将Gender全部规范为“Male”,“Female”,“Unknown”,在Race问题上面,按照原数据的样例可以分成白种人(White),黑种人(Black),亚裔(Asian),拉丁裔(Latino),北美原著(Native American)以及其他。同时后期为了方便统计每个城市以及每个州发生枪击案的频率,我们通过Location变量处理成city以及state。
探索性数据分析
首先我们观察每一年的枪击死亡人数变化
shooting %>% group_by(year) %>% summarise(total=sum(`Total victims`)) %>% ggplot(aes(x=year,y=total))+
geom_bar(stat = "identity",fill="blue")+geom_text(aes(label=total),vjust=-0.2)+xlim(c(1969,2020))+geom_line(color="red")+ylab("Total victims every year")+
ggtitle("People died because of gun shoot every year")数据显示触目惊心,尤其是今年拉斯维加斯枪击案发生之后,2017的死亡人数上升特别明显。我们已知每个案件的经纬度,我们使用R数据包可以将发生地点可视化出来,这一段处理我借鉴了别人的kernel,我们分别使用死亡人数和死伤总人数作为不同的判别对象来通过地图可视化:
#发生枪击案的地点和死亡人数
shooting %>% select(`Total victims`,Fatalities,Longitude,Latitude,Summary) %>% na.omit() %>% leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
fitBounds(-124, 30, -66, 43) %>% addCircles(color="#8A0707", lng = ~Longitude, lat = ~Latitude, weight = 1,
radius = ~sqrt(`Total victims`) * 20000, popup = ~Summary)
shooting %>% select(`Total victims`,Fatalities,Longitude,Latitude,Summary) %>%na.omit()%>%leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
fitBounds(-124, 30, -66, 43) %>% addCircles(color="blue", lng = ~Longitude, lat = ~Latitude, weight = 1,
radius = ~sqrt(Fatalities) * 40000, popup = ~Summary)图形结果显而易见,全美个个地方都会枪击案的发生,但是西部地区明显比东部地区发生的枪击案数量要少,空白区域属于美国落基山脉地区,地广人稀所以没有枪击案发生也在情理之中。接下来我们去观察枪手的特征分布。
#杀人和性别的关系
shooting %>% ggplot(aes(x=factor(Gender),fill=factor(Gender)))+geom_bar()+labs(x="Gender",y="Number of each gender",title="The distribution of gender",fill="Gender")
#人种和发生案件的频率以及死亡人数分布
shooting %>% na.omit() %>% group_by(Race) %>% summarise(num=sum(`Total victims`)) %>% ggplot(aes(x=factor(Race),y=num,fill=factor(Race)))+geom_bar(stat = "identity")+
coord_polar(theta = "y")+labs(x="Race",y="Number of killed people",fill="Race")+ggtitle("People killed by different races")
#不同性别的杀人分布
shooting %>% na.omit() %>% ggplot(aes(x=factor(Gender),y=`Total victims`,fill=factor(Gender)))+geom_boxplot()+ylim(c(0,90))
shooting %>% mutate(month=month(Date)) %>% group_by(month) %>% summarise(n=sum(`Total victims`)) %>% ggplot(aes(x=factor(month),y=n))+geom_bar(stat = "identity")+
labs(x="Month",title="The distribution of killed people every month",y="Number of killed people")+geom_text(aes(label=n),vjust=-0.2,color="red")+theme_bw()
#精神疾病和死亡人数之间,枪击案发生的关系
shooting %>% na.omit() %>% ggplot(aes(x=`Mental Health Issues`)) + geom_bar()+scale_x_discrete(limits=c("NO","YES"))+theme_bw()
shooting %>% na.omit() %>% group_by(`Mental Health Issues`) %>% summarise(n=sum(`Total victims`)) %>% ggplot(aes(x=factor(`Mental Health Issues`),y=n,group=1))+
geom_bar(stat = "identity",fill="pink")+scale_x_discrete(limits=c("NO","YES"))+geom_text(aes(label=n),vjust=-0.2)+geom_line(color="red")
#不同时间段枪击案的人种变化
shooting %>% na.omit() %>% group_by(yearcut) %>% ggplot(aes(yearcut,fill=Race))+geom_bar(position = "dodge")基本上绝大部分枪手都是男性。
排除unknown的情况,黑人和白人看来任然是枪击案的主角。
十月份是杀人高峰期??大家在美国的十月份尽量就不要出门了,我去把这个消息告诉我在美国读书的同学去,哈哈哈哈~~
看来杀人犯是不是患有精神疾病看来并不是有很大的差别,说明枪击案的发生也可能是我们常常说的普通人和我们没啥两样。
但是但是但是,上面这幅图告诉我们,精神病人枪杀人比非患病人要多一倍。
上面这幅图告诉我们,以前枪手都是白人,现在黑人的数量急速增长,这对美国治安是个考验,这也不算种族歧视,这是赤裸裸的事实。
后面我们要知道枪手的年龄分布,以及不同年龄下种族差异和患有疾病的分布。
tem <- mutate(shooting,age=str_extract_all(shooting$Summary,pattern="(,\\s)\\d{2}(,)"),age2=str_extract_all(shooting$Summary,pattern="(a\\s)\\d{2}(-year)"))
tem$age<- str_sub(tem$age,3,4)
tem$age2<- str_sub(tem$age2,3,4)
te <- subset(tem,tem$age!="ar")
te2 <- subset(tem,tem$age2!="ar")
te <- rbind(te,te2)
for(i in 1:nrow(te))
if(te$age[i]=="ar") te$age[i] <- te$age2[i]
te <- arrange(te,age)
te<- te[-c(1:4),]#Remove the non-digital part
te <- arrange(te,`S#`)
te$age <- as.integer(te$age)
te3 <- te %>% select(`S#`,age) %>% mutate(agecut=cut(te$age,breaks = 10*(1:7))) #Segment the age
shoot_age <- left_join(te3,shooting)
#观察年龄分布
ggplot(shoot_age,aes(agecut))+geom_bar(fill="blue")+theme_bw()
ggplot(shoot_age,aes(agecut,fill=`Mental Health Issues`))+geom_bar()#每个城市和州发生枪击案的次数
shooting %>% group_by(city) %>% summarise(count=n()) %>% filter(city!="" & count>=2) %>% ggplot(aes(x=reorder(city,count),y=count))+
geom_bar(stat = "identity",fill="lightblue")+coord_flip()+labs(y="Number of gun-shot happended",x="City",title="The number of case happened in each city")
shooting %>% group_by(state) %>% summarise(count=n()) %>% filter(state!="" & count>=2) %>% ggplot(aes(x=reorder(state,count),y=count))+
geom_bar(stat = "identity",fill="lightblue")+coord_flip()+labs(y="Number of gun-shot happended",x="State",title="The number of case happened in each state")吓得我再也不敢想去加州了~~~
说在最后
这个kaggle数据集还是很具有分析价值的,这为我们了解整个美国社会的枪击案发生情况有了一个初步的影响。我在参考了其他的kernel之后,发现还有他们将枪击案分成小与5人,大于5人但是小与10人,大于10人的一种情况,答主鉴于还要在实验室搬生物信息的砖,就不再深入分析了,当然我还分析了其他的kaggle数据,这些都在我的GitHub主页上面,大家可以积极交流,相互学习。