其他
左右用R右手Python9——字符串合并与拆分
在文本处理和数据清洗阶段,对字符串或者字符型变量进行分割、提取或者合并虽然谈不上什么高频需求,但是往往也对很重要的。
接下来跟大家大致盘点一下在R语言与Pyhton中,常用的字符串分割与合并的函数。
R语言:
字符串向量:
针对向量:
strsplit #针对字符串向量(拆分)
str_split #针对字符串向量(拆分)stringr包内函数
paste #针对向量合并
针对数据框:
unite #合并数据框中的某几列
separate #将数据框中某一列按照某种模式拆分成几列
R语言:
library(dplyr)
library(stringr)
library(tidyr)
myyear<-sprintf("20%02d",sample(0:17,10))
mymonth<-sprintf("%02d",sample(0:12,10))
myday<-sprintf("%02d",sample(0:31,10))
myyear;mymonth;myday
[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"
[1] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"
[1] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"
首先使用paste函数进行合并:
full<-paste(myyear,mymonth,myday,sep = "-");full #在向量等长的情况下,可以实现配对合并:
[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"
使用strsplit函数进行拆分:
myyear1=mymonth1=myday1=NULL
for( i in 1:length(full)){
myyear1[i]<-strsplit(full[i],"-")[[1]][1]
mymonth1[i]<-strsplit(full[i],"-")[[1]][2]
myday1[i]<-strsplit(full[i],"-")[[1]][3]
}
myyear1;mymonth1;myday1
[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"
[1] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"
[1] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"
str_split函数与strsplit函数用法类似:
myyear1=mymonth1=myday1=NULL
for( i in 1:length(full)){
myyear1[i]<-str_split(full[i],"-")[[1]][1]
mymonth1[i]<-str_split(full[i],"-")[[1]][2]
myday1[i]<-str_split(full[i],"-")[[1]][3]
}
myyear1;mymonth1;myday1
> myyear1;mymonth1;myday1
[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"
[1] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"
[1] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"
接下来解释在如何直接针对数据框进行合并与分列的操作:
mydata<-data.frame(myyear,mymonth,myday);mydata
myyear mymonth myday
1 2000 10 18
2 2010 03 15
3 2002 01 28
4 2012 09 00
5 2015 04 11
6 2006 02 20
7 2001 05 31
8 2017 07 19
9 2005 00 04
10 2013 12 12
unite (data,col, ..., sep = "-", remove = TRUE)
separate(data,col, into,sep="-", remove = TRUE)
unite和separate函数是配对函数,内部的参数严格白痴对称,第一个参数数要操作的数据框名称,第二个参数是合并后的新列名(或者待拆分的列名),第三部分是待合并的列名向量(拆分后的新增列名),sep是拆分(合并)依据,remove则控制输出的数据框是否包含原始向量(针对合并前的待合并变量和拆分前的待拆分变量)。
mydata1<-unite(mydata,col="datetime",c("myyear","mymonth","myday"),sep="-",remove=FALSE);mydata1
datetime myyear mymonth myday
1 2000-10-18 2000 10 18
2 2010-03-15 2010 03 15
3 2002-01-28 2002 01 28
4 2012-09-00 2012 09 00
5 2015-04-11 2015 04 11
6 2006-02-20 2006 02 20
7 2001-05-31 2001 05 31
8 2017-07-19 2017 07 19
9 2005-00-04 2005 00 04
10 2013-12-12 2013 12 12
mydata2<-unite(mydata1,col="datetime1",c("myyear","mymonth","myday"),sep="-",remove=FALSE);mydata2
datetime datetime1 myyear mymonth myday
1 2000-10-18 2000-10-18 2000 10 18
2 2010-03-15 2010-03-15 2010 03 15
3 2002-01-28 2002-01-28 2002 01 28
4 2012-09-00 2012-09-00 2012 09 00
5 2015-04-11 2015-04-11 2015 04 11
6 2006-02-20 2006-02-20 2006 02 20
7 2001-05-31 2001-05-31 2001 05 31
8 2017-07-19 2017-07-19 2017 07 19
9 2005-00-04 2005-00-04 2005 00 04
10 2013-12-12 2013-12-12 2013 12 12
Python字符串合并与分列:
因为对Python的字符串操作掌握有限,再加上Python字符串操作及其灵活,各种推导式和匿名函数可以很方便的完成,这里仅给出自己常用的做法作为实例,未包含所有方法:
字符串合并:
字符串链接符:”+”
字符串合并函数:join
字符串拆分:split
import randomimport pandas as pd
myyear=random.sample(list(range(2000,2017)),10);myyear
mymonth=['%02d' % i for i in random.sample(list(range(1,12)),10)];mymonth
myday=['%02d' % i for i in random.sample(list(range(1,31)),10)];myday
[2006, 2000, 2007, 2001, 2015, 2016, 2002, 2012, 2010, 2004]
['04', '11', '06', '10', '07', '08', '05', '02', '03', '01']
['13', '28', '21', '06', '08', '03', '17', '16', '04', '20']
字符串合并:
mydate=[str(i)+"-"+j+"-"+k for i,j,k in zip(myyear,mymonth,myday)]
['2011-04-25', '2008-11-30', '2003-06-02', '2007-10-22', '2009-07-13', '2005-08-27', '2014-05-28', '2012-02-10', '2016-03-14', '2015-01-21']
mydate=["-".join([str(i),j,k]) for i,j,k in zip(myyear,mymonth,myday)]
['2011-04-25', '2008-11-30', '2003-06-02', '2007-10-22', '2009-07-13', '2005-08-27', '2014-05-28', '2012-02-10', '2016-03-14', '2015-01-21']
字符串拆分:
方法一(列表推导式):
myyear1=[i.split("-")[0] for i in mydate];myyear1
mymonth1=[i.split("-")[1] for i in mydate];mymonth1
myday1=[i.split("-")[2] for i in mydate];myday1
['2011', '2008', '2003', '2007', '2009', '2005', '2014', '2012', '2016', '2015']
['04', '11', '06', '10', '07', '08', '05', '02', '03', '01']
['25', '30', '02', '22', '13', '27', '28', '10', '14', '21']
方法二(使用字典):
mydata=pd.DataFrame({"date":mydate})
mydata["date"].str.split("-",expand=True)
0 1 2
0 2011 04 25
1 2008 11 30
2 2003 06 02
3 2007 10 22
4 2009 07 13
5 2005 08 27
6 2014 05 28
7 2012 02 10
8 2016 03 14
9 2015 01 21
myyear2=mydata["date"].str.split("-",expand=True)[0];print(myyear2)
mymonth2=mydata["date"].str.split("-",expand=True)[1];print(mymonth2)
myday2=mydata["date"].str.split("-",expand=True)[2];print(myday2)0 20111 20082 20033 20074 20095 20056 20147 20128 20169 2015Name: 0, dtype: object0 041 112 063 104 075 086 057 028 039 01Name: 1, dtype: object0 251 302 023 224 135 276 287 108 149 21Name: 2, dtype: object
本文小结——字符串拆分与合并:
R语言:
拆分:
strsplit
str_split
合并:
paste
tidyr::unite
tidyr::separate
Python:
拆分:
.split
合并:
“+”
join
R语言(ggplot2入门)可视化在商务场景中的应用,已经有200+小伙伴加入了!
感兴趣的可点击阅读原文报名参加哦,满满的干货!