查看原文
其他

数据科学家们更换工作都有哪些特征(上)?

云朵君 数据STUDIO 2022-04-28

前面有文章数据分析之探索性数据分析,里面详细阐述了何为EDA,以及一些常用的分析方法,感兴趣的小伙伴们可以参考下。

本文分析数据科学家更换工作情况数据集,运用常见EDA方法分析每个特征情况及他们与目标变量之间的关系。使seaborn进行数据可视化辅助分析数据科学家们更换工作都有哪些特征。

因篇幅过长,将分为上下两篇展开此次探索性数据分析案例精讲。

导包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False 

数据导入与查看

path_train = '../input/hr-analytics-job-change-of-data-scientists/aug_train.csv'
path_test = '../input/hr-analytics-job-change-of-data-scientists/aug_test.csv'
path_submission = '../input/hr-analytics-job-change-of-data-scientists/sample_submission.csv'
train = pd.read_csv(path_train)
test = pd.read_csv(path_test)
train.head()

test.sample(5)
# 表示返回随机5行数据。

探索性数据分析

一般性分析

train.shape
(19158, 14)
test.shape
(2129, 13)
train.dtypes
enrollee_id int64
city object
city_development_index float64
gender object
relevent_experience object
enrolled_university object
education_level object
major_discipline object
experience object
company_size object
company_type object
last_new_job object
training_hours int64
target float64
dtype: object
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2129 entries, 0 to 2128
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 enrollee_id 2129 non-null int64
1 city 2129 non-null object
2 city_development_index 2129 non-null float64
3 gender 1621 non-null object
4 relevent_experience 2129 non-null object
5 enrolled_university 2098 non-null object
6 education_level 2077 non-null object
7 major_discipline 1817 non-null object
8 experience 2124 non-null object
9 company_size 1507 non-null object
10 company_type 1495 non-null object
11 last_new_job 2089 non-null object
12 training_hours 2129 non-null int64
dtypes: float64(1), int64(2), object(10)
memory usage: 216.4+ KB
list(train.columns)
['enrollee_id',
'city',
'city_development_index',
'gender',
'relevent_experience',
'enrolled_university',
'education_level',
'major_discipline',
'experience',
'company_size',
'company_type',
'last_new_job',
'training_hours',
'target']

describe方法用于生成DataFramem描述统计信息。可以很方便的查看数据集的分布情况。注意,这里的统计分布包含NaN值

  • count表示计数。
  • mean表示平均值。
  • std表示标准差。
  • min表示最小值。
  • 25%表示四分之一分位数
  • 50%表示二分之一分位数。
  • 75%表示四分之三分位数。
  • max表示最大值。
train.describe(include='all').T

缺失值分析

def percentage_nulls(df):

    number_nulls = pd.DataFrame(df.isnull().sum(),columns=['Total'])
    number_nulls['% nulls'] = round((number_nulls['Total'] / df.shape[0])*100,1)
    
    return number_nulls

训练集的缺失值。

nulls_train = percentage_nulls(train)
nulls_train

Total% nulls
enrollee_id00.0
city00.0
city_development_index00.0
gender450823.5
relevent_experience00.0
enrolled_university3862.0
education_level4602.4
major_discipline281314.7
experience650.3
company_size593831.0
company_type614032.0
last_new_job4232.2
training_hours00.0
target00.0
# 选择缺失数量最大的5行数据。
nulls_train.nlargest(5'Total')

Total% nulls
company_type614032.0
company_size593831.0
gender450823.5
major_discipline281314.7
education_level4602.4
percentage_nulls(test)

Total% nulls
enrollee_id00.0
city00.0
city_development_index00.0
gender50823.9
relevent_experience00.0
enrolled_university311.5
education_level522.4
major_discipline31214.7
experience50.2
company_size62229.2
company_type63429.8
last_new_job401.9
training_hours00.0

缺失值的可视化

import missingno as msno
msno.matrix(train)

测试集的缺失值。

msno.matrix(test)

为了后面分析方便,将缺失值用字符串'Unknown'填充。

df_train = train.fillna("Unknown")
# 填充完后检查下是否处理完毕。
df_train.isnull().sum()
enrollee_id 0
city 0
city_development_index 0
gender 0
relevent_experience 0
enrolled_university 0
education_level 0
major_discipline 0
experience 0
company_size 0
company_type 0
last_new_job 0
training_hours 0
target 0
dtype: int64

缺失值分析可以参见缺失值处理,你真的会了吗?

分析每个特征

特征 'City' -- 城市

# 切片索引出'City'
city_train = df_train['city']
city_train.value_counts()
city_103 4355
city_21 2702
city_16 1533
city_114 1336
city_160 845
...
city_129 3
city_111 3
city_121 3
city_140 1
city_171 1
Name: city, Length: 123, dtype: int64
city_train.value_counts().plot()

city_test = df_test['city']
city_test.value_counts()
city_103 473
city_21 318
city_16 168
city_114 155
city_160 113
...
city_84 1
city_171 1
city_25 1
city_93 1
city_141 1
Name: city, Length: 108, dtype: int64

观察到,城市是用数字编码的,训练集用123个不同的城市,测试集用108个不同的城市。

'city_ development _index' 城市发展指数

可以查看维基百科中的名词解释 https://en.wikipedia.org/wiki/City_development_index

# 查看特征分布状况
sns.displot(data=df_train, 
            x='city_development_index',
            height=6,
            aspect = 2,            
            color = 'lightblue')
sns.set_context(font_scale=1


with sns.axes_style():
    g = sns.displot(data=df_test, 
            x='city_development_index',
            height=5,
            aspect = 2
            color = 'coral')
    g.set_ylabels(fontsize=15)
    g.set_xlabels(fontsize=15)
    g.set_xticklabels(fontsize=15)
    g.set_yticklabels(fontsize=15)

注意这种写法,设置x,y轴标签字体大小。 

'gender'性别

性别及后面几个特征为分类型变量,因此分别统计每个变量的总数及占比。

gender_train = df_train['gender']
gender_test = df_test['gender']

后面需要多次用到缺失值统计,因此将其定义为函数。

def percent_nan(df):

    number = pd.DataFrame(df.value_counts())
    number.columns = ['Total']
    number['%'] = round((number['Total'] / df.notnull().sum())*100,1)
    
    return number
percent_nan(gender_train)

Total%
Male1322169.0
Unknown450823.5
Female12386.5
Other1911.0
percent_nan(gender_test)

Total%
Male146068.6
Unknown50823.9
Female1376.4
Other241.1

两个数据集的缺失值差不多,因此对两者可视化,以更加方便看出其差异。

同样,为后续分析方便,将其定义为函数。

def draw_countplot(feature,palette,order=None):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(105))

    sns.countplot(x=feature,data=df_train,palette=palette,ax=ax[0],order=order).set_title('Train')
    sns.countplot(x=feature,data=df_test,palette=palette,ax=ax[1],order=order).set_title('Test')
    fig.tight_layout()
draw_countplot("gender","Set1")

'relevent_experience' 相关经验

relevent_experience_train = df_train['relevent_experience']
relevent_experience_test = df_test['relevent_experience']
percent_nan(relevent_experience_train)

Total%
Has relevent experience1379272.0
No relevent experience536628.0
percent_nan(relevent_experience_test)

Total%
Has relevent experience152471.6
No relevent experience60528.4
draw_countplot("relevent_experience","Set2")

'enrolled_university' 专业课

已注册的大学课程类型(如有)。

enrolled_university_train = df_train['enrolled_university']
enrolled_university_test = df_test['enrolled_university']
percent_nan(enrolled_university_train)

Total%
no_enrollment1381772.1
Full time course375719.6
Part time course11986.3
Unknown3862.0
percent_nan(enrolled_university_test)

Total%
no_enrollment151971.3
Full time course43520.4
Part time course1446.8
Unknown311.5
order_enrolled_university = percent_nan(enrolled_university_train).index
order_enrolled_university
Index(['no_enrollment', 'Full time course',
'Part time course', 'Unknown'],
dtype='object')
draw_countplot('enrolled_university',"Set3",order_enrolled_university)

'education_level' 学历

education_level_train = df_train['education_level']
education_level_test = df_test['education_level']
percent_nan(education_level_train)

Total%
Graduate1159860.5
Masters436122.8
High School201710.5
Unknown4602.4
Phd4142.2
Primary School3081.6
percent_nan(education_level_test)

Total%
Graduate126959.6
Masters49623.3
High School22210.4
Phd542.5
Unknown522.4
Primary School361.7
order_education_level = percent_nan(education_level_train).index
draw_countplot('education_level',"Set1",order_education_level)

'major_discipline' 专业学科

major_discipline_train = df_train['major_discipline']
major_discipline_test = df_test['major_discipline']
percent_nan(major_discipline_train)

Total%
STEM1449275.6
Unknown281314.7
Humanities6693.5
Other3812.0
Business Degree3271.7
Arts2531.3
No Major2231.2
percent_nan(major_discipline_test)

Total%
STEM162176.1
Unknown31214.7
Humanities803.8
Other401.9
Business Degree371.7
No Major221.0
Arts170.8
order_major_discipline = percent_nan(major_discipline_train).index
draw_countplot('major_discipline',"Set2",order_major_discipline)

'experience' 工作年限

experience_train = df_train['experience']
experience_test = df_test['experience']
percent_nan(experience_train)

Total%
>20328617.2
514307.5
414037.3
313547.1
612166.3
211275.9
710285.4
109855.1
99805.1
88024.2
156863.6
116643.5
145863.1
15492.9
<15222.7
165082.7
124942.6
133992.1
173421.8
193041.6
182801.5
201480.8
Unknown650.3
percent_nan(experience_test)

Total%
>2038318.0
51637.7
31547.2
41456.8
61306.1
21286.0
71165.4
91135.3
10964.5
11864.0
8823.9
<1743.5
16683.2
15592.8
1562.6
14552.6
13542.5
12522.4
17361.7
19291.4
18261.2
20190.9
Unknown50.2
order_experience = percent_nan(experience_train).index
draw_countplot('experience',"Set3",order_experience)

'company_size'公司规模

company_size_train = df_train['company_size']
company_size_test = df_test['company_size']
percent_nan(company_size_train)

Total%
Unknown593831.0
50-99308316.1
100-500257113.4
10000+201910.5
10/4914717.7
1000-499913286.9
<1013086.8
500-9998774.6
5000-99995632.9
percent_nan(company_size_test)

Total%
Unknown62229.2
50-9933815.9
100-50031814.9
10000+21710.2
10/491728.1
<101637.7
1000-49991436.7
500-999884.1
5000-9999683.2
order_company_size = percent_nan(company_size_train).index
draw_countplot('company_size',"Set1",order_company_size)

'company_type'公司类型

company_type_train = df_train['company_type']
company_type_test = df_test['company_type']
percent_nan(company_type_train)

Total%
Pvt Ltd981751.2
Unknown614032.0
Funded Startup10015.2
Public Sector9555.0
Early Stage Startup6033.1
NGO5212.7
Other1210.6
percent_nan(company_type_test)

Total%
Pvt Ltd114153.6
Unknown63429.8
Public Sector1276.0
Funded Startup974.6
Early Stage Startup653.1
NGO532.5
Other120.6
order_company_type = percent_nan(company_type_train).index
draw_countplot('company_type',"Set2",order_company_type)

'lastnewjob' 以前的工作与现在的工作的年差

last_new_job_train = df_train['last_new_job']
last_new_job_test = df_test['last_new_job']
percent_nan(last_new_job_train)

Total%
1804042.0
>4329017.2
2290015.1
never245212.8
410295.4
310245.3
Unknown4232.2
percent_nan(last_new_job_test)

Total%
188441.5
>435316.6
234216.1
never25812.1
31336.2
41195.6
Unknown401.9
order_last_new_job = percent_nan(last_new_job_train).index
draw_countplot('last_new_job',"Set3",order_last_new_job)

'training_hours'已完成的培训时长

sns.displot(data=df_train, 
            x='training_hours',
            height=5,
            aspect=1.5,
            color = 'lightblue')

sns.displot(data=df_test, 
            x='training_hours',
            height=5,
            aspect=1.5,
            color = 'coral')

'Target'

0 - Not looking for job change
1 - Looking for a job change

target = df_train['target']
percent_nan(target)

Total%
0.01438175.1
1.0477724.9
sns.countplot(x='target',
              data=df_train,
              palette="Set1").set_title('Train')


本文数据使用 kaggle数据集,关注公众号,并回复【 hr 】获取数据。

推荐阅读

Python数据分析之探索性数据分析(EDA)

缺失值处理,你真的会了吗?

-- 数据STUDIO --

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存