Datawhale动手学数据分析第一章

1.1 载入数据

1.1.1 任务一:导入numpy和pandas

import numpy as np
import pandas as pd

1.1.2 任务二:载入数据

df = pd.read_csv('train.csv')#以相对路径载入
df = pd.read_csv(r'D:\titanic\train.csv')#以绝对路径载入

以绝对路径载入时要注意双斜杠或者使用r辅助

df.shape
(891, 12)

pd.read_csv()和pd.read_table()方法的不同之处

read_csv和read_table都是是加载带分隔符的数据,每一个分隔符作为一个数据的标志,read_table是以制表符 \t 作为数据的标志,也就是以行为单位进行存储,而read_csv方法以 ,作为数据的标志。

pd.read_csv('train.csv')
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

pd.read_table('train.csv')
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
... ...
886 887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887 888,1,1,"Graham, Miss. Margaret Edith",female,...
888 889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889 890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890 891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...

891 rows × 1 columns

如何让其效果一致呢?

可以设置read_table方法内的sep属性将其改为“,”

pd.read_table('train.csv',sep=',')
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

1.1.3 任务三:每1000行为一个数据模块,逐块读取

type(df)
pandas.core.frame.DataFrame
chunker = pd.read_csv("train.csv",chunksize=1000)#添加参数chunksize,将代码分块
print(type(chunker))
for piece in chunker:
    print(type(piece))
    print(len(piece))
    piece.head()
<class 'pandas.io.parsers.readers.TextFileReader'>
<class 'pandas.core.frame.DataFrame'>
891
type(chunker)
pandas.io.parsers.readers.TextFileReader

利用chunksize参数来实现对于代码的分块工作,同时函数返回值变为TextFileReader类型区别于原来的DataFrame类型

1.1.4 任务四:将表头改成中文,索引改为乘客ID

方法一 利用columns方法直接修改表头

df.columns=['乘客ID','是否幸存','乘客等级(1/2/3等舱位)','乘客姓名','性别','年龄','堂兄弟/妹个数','父母与小孩个数','船票信息','票价','客舱','登船港口']
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

方法二 利用rename进行修改

示例如下:

df.rename(columns={'乘客ID':'PassengerId'})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

上式中因为未能指定inplace而修改失败,纠正如下:

df.rename(columns={'乘客ID':'PassengerId'},inplace=True)
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df.rename(columns={'PassengerId':'乘客ID'},inplace=True)

可以看到指定该参数后修改成功

1.2 初步观察

1.2.1 任务一:查看数据的基本信息

df.info()        # 打印摘要
df.describe()      # 描述性统计信息
df.values          # 数据 <ndarray>
df.to_numpy()       # 数据 <ndarray> (推荐)
df.shape           # 形状 (行数, 列数)
df.columns         # 列标签 <Index>
df.columns.values  # 列标签 <ndarray>
df.index           # 行标签 <Index>
df.index.values    # 行标签 <ndarray>
df.head(1)         # 前n行
df.tail(1)         # 尾n行
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   乘客ID            891 non-null    int64  
 1   是否幸存            891 non-null    int64  
 2   乘客等级(1/2/3等舱位)  891 non-null    int64  
 3   乘客姓名            891 non-null    object 
 4   性别              891 non-null    object 
 5   年龄              714 non-null    float64
 6   堂兄弟/妹个数         891 non-null    int64  
 7   父母与小孩个数         891 non-null    int64  
 8   船票信息            891 non-null    object 
 9   票价              891 non-null    float64
 10  客舱              204 non-null    object 
 11  登船港口            889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

1.2.2 任务二:观察表格前10行的数据和后15行的数据

df.head(10)
df.tail(15)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False

df.isnull()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
3 False False False False False False False False False False False False
4 False False False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False False True False
887 False False False False False False False False False False False False
888 False False False False False True False False False False True False
889 False False False False False False False False False False False False
890 False False False False False False False False False False True False

891 rows × 12 columns

1.3 保存数据

1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv

df.to_csv('train_chinese.csv')

1.5 筛选的逻辑

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。

df1[df1.Age<10]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S
10 11 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
16 17 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q
24 25 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S
43 44 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C
... ... ... ... ... ... ... ... ... ... ... ...
827 828 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
831 832 2 Richards, Master. George Sibley male 0.83 1 1 29106 18.7500 NaN S
850 851 3 Andersson, Master. Sigvard Harald Elias male 4.00 4 2 347082 31.2750 NaN S
852 853 3 Boulos, Miss. Nourelain female 9.00 1 1 2678 15.2458 NaN C
869 870 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S

62 rows × 11 columns

1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

midage=df1[(df1.Age>10) & (df1.Age<50)]
midage
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
885 886 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 11 columns

1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来

midage.iloc[[99],[1,3]] 
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Sex
148 2 male

1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

这里要求使用loc方法选出所需行,但loc只能根据原始索引寻址,所以需要先修改索引,让索引和行号挂钩。
先用reset_index方法删除之前的索引(即最左边没写名字那列),设置参数drop=True,表示把之前的索引列删除,否则之前的索引列会被命名成index,塞回到数据中,如下所示:

midage.reset_index()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
index PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
571 885 886 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
572 886 887 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
573 887 888 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
574 889 890 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
575 890 891 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 12 columns

下面是修改索引后完成任务的代码。

pd.options.display.max_rows=10
midage.reset_index(drop=True).loc[[99,104,107],['Pclass','Name','Sex']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Name Sex
99 2 Navratil, Mr. Michel ("Louis M Hoffman") male
104 3 Corn, Mr. Harry male
107 3 Bengtsson, Mr. John Viktor male

1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

newMidage=midage.reset_index(drop=True)
col=list(newMidage.columns) # 用于寻找所需字段的下标
newMidage.iloc[[99,104,107],[col.index('Pclass'),col.index('Name'),col.index('Sex')]]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Name Sex
99 2 Navratil, Mr. Michel ("Louis M Hoffman") male
104 3 Corn, Mr. Harry male
107 3 Bengtsson, Mr. John Viktor male

1.3 保存数据

1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv

df.to_csv('train_chinese.csv')

1.4 知道你的数据叫什么

1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子?

Series:一个series是一个一维的数据类型,其中每一个元素都有一个标签。类似于Numpy中元素带标签的数组。其中,标签可以是数字或者字符串。

注意:标签是可以重复的

pandas.Series( data, index, dtype, copy)

创建一个Series

参数 描述
data 数据采取各种形式,如:ndarray,list,constants
index 索引值必须是唯一的和散列的,与数据的长度相同。 默认np.arange(n)如果没有索引被传递。
dtype dtype用于数据类型。如果没有,将推断数据类型
copy 复制数据,默认为false。

1.4.2 任务二:根据上节课的方法载入"train.csv"文件

df1=pd.read_csv('train.csv')

1.4.3 任务三:查看DataFrame数据的每列的名称

df1.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

返回的是array格式。

list(df1)
['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

返回list格式。

df1.keys()
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

返回 Pandas 对象的“信息轴”。如果 Pandas 对象是系列,则它返回索引。如果pandas对象是dataframe,则它返回列。如果 Pandas 对象是面板,则它返回major_axis。

1.4.4任务四:查看"Cabin"这列的所有值

df1.Cabin.values 
array([nan, 'C85', nan, 'C123', nan, nan, 'E46', nan, nan, nan, 'G6',
       'C103', nan, nan, nan, nan, nan, nan, nan, nan, nan, 'D56', nan,
       'A6', nan, nan, nan, 'C23 C25 C27', nan, nan, nan, 'B78', nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'D33', nan, 'B30', 'C52', nan, nan, nan,
       nan, nan, 'B28', 'C83', nan, nan, nan, 'F33', nan, nan, nan, nan,
       nan, nan, nan, nan, 'F G73', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'C23 C25 C27', nan, nan, nan, 'E31', nan,
       nan, nan, 'A5', 'D10 D12', nan, nan, nan, nan, 'D26', nan, nan,
       nan, nan, nan, nan, nan, 'C110', nan, nan, nan, nan, nan, nan, nan,
       'B58 B60', nan, nan, nan, nan, 'E101', 'D26', nan, nan, nan,
       'F E69', nan, nan, nan, nan, nan, nan, nan, 'D47', 'C123', nan,
       'B86', nan, nan, nan, nan, nan, nan, nan, nan, 'F2', nan, nan,
       'C2', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, 'E33', nan, nan, nan, 'B19', nan, nan, nan, 'A7', nan,
       nan, 'C49', nan, nan, nan, nan, nan, 'F4', nan, 'A32', nan, nan,
       nan, nan, nan, nan, nan, 'F2', 'B4', 'B80', nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'G6', nan, nan, nan, 'A31', nan, nan, nan,
       nan, nan, 'D36', nan, nan, 'D15', nan, nan, nan, nan, nan, 'C93',
       nan, nan, nan, nan, nan, 'C83', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, 'C78', nan, nan, 'D35', nan,
       nan, 'G6', 'C87', nan, nan, nan, nan, 'B77', nan, nan, nan, nan,
       'E67', 'B94', nan, nan, nan, nan, 'C125', 'C99', nan, nan, nan,
       'C118', nan, 'D7', nan, nan, nan, nan, nan, nan, nan, nan, 'A19',
       nan, nan, nan, nan, nan, nan, 'B49', 'D', nan, nan, nan, nan,
       'C22 C26', 'C106', 'B58 B60', nan, nan, nan, 'E101', nan,
       'C22 C26', nan, 'C65', nan, 'E36', 'C54', 'B57 B59 B63 B66', nan,
       nan, nan, nan, nan, nan, 'C7', 'E34', nan, nan, nan, nan, nan,
       'C32', nan, 'D', nan, 'B18', nan, 'C124', 'C91', nan, nan, nan,
       'C2', 'E40', nan, 'T', 'F2', 'C23 C25 C27', nan, nan, nan, 'F33',
       nan, nan, nan, nan, nan, 'C128', nan, nan, nan, nan, 'E33', nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 'D37', nan, nan, 'B35',
       'E50', nan, nan, nan, nan, nan, nan, 'C82', nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 'B96 B98', nan, nan, 'D36',
       'G6', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'C78', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, 'E10', 'C52', nan,
       nan, nan, 'E44', 'B96 B98', nan, nan, 'C23 C25 C27', nan, nan, nan,
       nan, nan, nan, 'A34', nan, nan, nan, 'C104', nan, nan, 'C111',
       'C92', nan, nan, 'E38', 'D21', nan, nan, 'E12', nan, 'E63', nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, 'D', nan, 'A14', nan,
       nan, nan, nan, nan, nan, nan, nan, 'B49', nan, 'C93', 'B37', nan,
       nan, nan, nan, 'C30', nan, nan, nan, 'D20', nan, 'C22 C26', nan,
       nan, nan, nan, nan, 'B79', 'C65', nan, nan, nan, nan, nan, nan,
       'E25', nan, nan, 'D46', 'F33', nan, nan, nan, 'B73', nan, nan,
       'B18', nan, nan, nan, 'C95', nan, nan, nan, nan, nan, nan, nan,
       nan, 'B38', nan, nan, 'B39', 'B22', nan, nan, nan, 'C86', nan, nan,
       nan, nan, nan, 'C70', nan, nan, nan, nan, nan, 'A16', nan, 'E67',
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'C101',
       'E25', nan, nan, nan, nan, 'E44', nan, nan, nan, 'C68', nan, 'A10',
       nan, 'E68', nan, 'B41', nan, nan, nan, 'D20', nan, nan, nan, nan,
       nan, nan, nan, 'A20', nan, nan, nan, nan, nan, nan, nan, nan, nan,
       'C125', nan, nan, nan, nan, nan, nan, nan, nan, 'F4', nan, nan,
       'D19', nan, nan, nan, 'D50', nan, 'D9', nan, nan, 'A23', nan,
       'B50', nan, nan, nan, nan, nan, nan, nan, nan, 'B35', nan, nan,
       nan, 'D33', nan, 'A26', nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, 'D48', nan, nan, 'E58', nan, nan, nan, nan, nan,
       nan, 'C126', nan, 'B71', nan, nan, nan, nan, nan, nan, nan,
       'B51 B53 B55', nan, 'D49', nan, nan, nan, nan, nan, nan, nan, 'B5',
       'B20', nan, nan, nan, nan, nan, nan, nan, 'C68', 'F G63',
       'C62 C64', 'E24', nan, nan, nan, nan, nan, 'E24', nan, nan, 'C90',
       'C124', 'C126', nan, nan, 'F G73', 'C45', 'E101', nan, nan, nan,
       nan, nan, nan, 'E8', nan, nan, nan, nan, nan, 'B5', nan, nan, nan,
       nan, nan, nan, 'B101', nan, nan, 'D45', 'C46', 'B57 B59 B63 B66',
       nan, nan, 'B22', nan, nan, 'D30', nan, nan, 'E121', nan, nan, nan,
       nan, nan, nan, nan, 'B77', nan, nan, nan, 'B96 B98', nan, 'D11',
       nan, nan, nan, nan, nan, nan, 'E77', nan, nan, nan, 'F38', nan,
       nan, 'B3', nan, 'B20', 'D6', nan, nan, nan, nan, nan, nan,
       'B82 B84', nan, nan, nan, nan, nan, nan, 'D17', nan, nan, nan, nan,
       nan, 'B96 B98', nan, nan, nan, 'A36', nan, nan, 'E8', nan, nan,
       nan, nan, nan, 'B102', nan, nan, nan, nan, 'B69', nan, nan, 'E121',
       nan, nan, nan, nan, nan, 'B28', nan, nan, nan, nan, nan, 'E49',
       nan, nan, nan, 'C47', nan, nan, nan, nan, nan, nan, nan, nan, nan,
       'C92', nan, nan, nan, 'D28', nan, nan, nan, 'E17', nan, nan, nan,
       nan, 'D17', nan, nan, nan, nan, 'A24', nan, nan, nan, 'D35',
       'B51 B53 B55', nan, nan, nan, nan, nan, nan, 'C50', nan, nan, nan,
       nan, nan, nan, nan, 'B42', nan, 'C148', nan], dtype=object)
df1['Cabin'].values
array([nan, 'C85', nan, 'C123', nan, nan, 'E46', nan, nan, nan, 'G6',
       'C103', nan, nan, nan, nan, nan, nan, nan, nan, nan, 'D56', nan,
       'A6', nan, nan, nan, 'C23 C25 C27', nan, nan, nan, 'B78', nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'D33', nan, 'B30', 'C52', nan, nan, nan,
       nan, nan, 'B28', 'C83', nan, nan, nan, 'F33', nan, nan, nan, nan,
       nan, nan, nan, nan, 'F G73', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'C23 C25 C27', nan, nan, nan, 'E31', nan,
       nan, nan, 'A5', 'D10 D12', nan, nan, nan, nan, 'D26', nan, nan,
       nan, nan, nan, nan, nan, 'C110', nan, nan, nan, nan, nan, nan, nan,
       'B58 B60', nan, nan, nan, nan, 'E101', 'D26', nan, nan, nan,
       'F E69', nan, nan, nan, nan, nan, nan, nan, 'D47', 'C123', nan,
       'B86', nan, nan, nan, nan, nan, nan, nan, nan, 'F2', nan, nan,
       'C2', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, 'E33', nan, nan, nan, 'B19', nan, nan, nan, 'A7', nan,
       nan, 'C49', nan, nan, nan, nan, nan, 'F4', nan, 'A32', nan, nan,
       nan, nan, nan, nan, nan, 'F2', 'B4', 'B80', nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'G6', nan, nan, nan, 'A31', nan, nan, nan,
       nan, nan, 'D36', nan, nan, 'D15', nan, nan, nan, nan, nan, 'C93',
       nan, nan, nan, nan, nan, 'C83', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, 'C78', nan, nan, 'D35', nan,
       nan, 'G6', 'C87', nan, nan, nan, nan, 'B77', nan, nan, nan, nan,
       'E67', 'B94', nan, nan, nan, nan, 'C125', 'C99', nan, nan, nan,
       'C118', nan, 'D7', nan, nan, nan, nan, nan, nan, nan, nan, 'A19',
       nan, nan, nan, nan, nan, nan, 'B49', 'D', nan, nan, nan, nan,
       'C22 C26', 'C106', 'B58 B60', nan, nan, nan, 'E101', nan,
       'C22 C26', nan, 'C65', nan, 'E36', 'C54', 'B57 B59 B63 B66', nan,
       nan, nan, nan, nan, nan, 'C7', 'E34', nan, nan, nan, nan, nan,
       'C32', nan, 'D', nan, 'B18', nan, 'C124', 'C91', nan, nan, nan,
       'C2', 'E40', nan, 'T', 'F2', 'C23 C25 C27', nan, nan, nan, 'F33',
       nan, nan, nan, nan, nan, 'C128', nan, nan, nan, nan, 'E33', nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 'D37', nan, nan, 'B35',
       'E50', nan, nan, nan, nan, nan, nan, 'C82', nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 'B96 B98', nan, nan, 'D36',
       'G6', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, 'C78', nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, 'E10', 'C52', nan,
       nan, nan, 'E44', 'B96 B98', nan, nan, 'C23 C25 C27', nan, nan, nan,
       nan, nan, nan, 'A34', nan, nan, nan, 'C104', nan, nan, 'C111',
       'C92', nan, nan, 'E38', 'D21', nan, nan, 'E12', nan, 'E63', nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, 'D', nan, 'A14', nan,
       nan, nan, nan, nan, nan, nan, nan, 'B49', nan, 'C93', 'B37', nan,
       nan, nan, nan, 'C30', nan, nan, nan, 'D20', nan, 'C22 C26', nan,
       nan, nan, nan, nan, 'B79', 'C65', nan, nan, nan, nan, nan, nan,
       'E25', nan, nan, 'D46', 'F33', nan, nan, nan, 'B73', nan, nan,
       'B18', nan, nan, nan, 'C95', nan, nan, nan, nan, nan, nan, nan,
       nan, 'B38', nan, nan, 'B39', 'B22', nan, nan, nan, 'C86', nan, nan,
       nan, nan, nan, 'C70', nan, nan, nan, nan, nan, 'A16', nan, 'E67',
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'C101',
       'E25', nan, nan, nan, nan, 'E44', nan, nan, nan, 'C68', nan, 'A10',
       nan, 'E68', nan, 'B41', nan, nan, nan, 'D20', nan, nan, nan, nan,
       nan, nan, nan, 'A20', nan, nan, nan, nan, nan, nan, nan, nan, nan,
       'C125', nan, nan, nan, nan, nan, nan, nan, nan, 'F4', nan, nan,
       'D19', nan, nan, nan, 'D50', nan, 'D9', nan, nan, 'A23', nan,
       'B50', nan, nan, nan, nan, nan, nan, nan, nan, 'B35', nan, nan,
       nan, 'D33', nan, 'A26', nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, 'D48', nan, nan, 'E58', nan, nan, nan, nan, nan,
       nan, 'C126', nan, 'B71', nan, nan, nan, nan, nan, nan, nan,
       'B51 B53 B55', nan, 'D49', nan, nan, nan, nan, nan, nan, nan, 'B5',
       'B20', nan, nan, nan, nan, nan, nan, nan, 'C68', 'F G63',
       'C62 C64', 'E24', nan, nan, nan, nan, nan, 'E24', nan, nan, 'C90',
       'C124', 'C126', nan, nan, 'F G73', 'C45', 'E101', nan, nan, nan,
       nan, nan, nan, 'E8', nan, nan, nan, nan, nan, 'B5', nan, nan, nan,
       nan, nan, nan, 'B101', nan, nan, 'D45', 'C46', 'B57 B59 B63 B66',
       nan, nan, 'B22', nan, nan, 'D30', nan, nan, 'E121', nan, nan, nan,
       nan, nan, nan, nan, 'B77', nan, nan, nan, 'B96 B98', nan, 'D11',
       nan, nan, nan, nan, nan, nan, 'E77', nan, nan, nan, 'F38', nan,
       nan, 'B3', nan, 'B20', 'D6', nan, nan, nan, nan, nan, nan,
       'B82 B84', nan, nan, nan, nan, nan, nan, 'D17', nan, nan, nan, nan,
       nan, 'B96 B98', nan, nan, nan, 'A36', nan, nan, 'E8', nan, nan,
       nan, nan, nan, 'B102', nan, nan, nan, nan, 'B69', nan, nan, 'E121',
       nan, nan, nan, nan, nan, 'B28', nan, nan, nan, nan, nan, 'E49',
       nan, nan, nan, 'C47', nan, nan, nan, nan, nan, nan, nan, nan, nan,
       'C92', nan, nan, nan, 'D28', nan, nan, nan, 'E17', nan, nan, nan,
       nan, 'D17', nan, nan, nan, nan, 'A24', nan, nan, nan, 'D35',
       'B51 B53 B55', nan, nan, nan, nan, nan, nan, 'C50', nan, nan, nan,
       nan, nan, nan, nan, 'B42', nan, 'C148', nan], dtype=object)

1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

test1df=pd.read_csv('test.csv')
test1df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C

418 rows × 11 columns

df1.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
test1df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

发现test1df比df1少一列

有两种方法可以删除多出来的列
1.调用 ’ DataFrame ’ 对象的drop方法
2.使用del

# 使用drop方法
# inplace参数用来确定是否在原对象上修改,这里使用默认值False,便于之后展示第二种方法
df1.drop('Survived',axis=1) #axis参数为1表示删除列
#使用del
del df1['Survived'] 
df1

1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素

有两种方法可以实现“隐藏”某几列:
1.使用style属性的hide_columns方法,会返回一个隐藏了特定列的Styler对象,直接展示。
2.使用DataFrame对象的drop方法,把inplace参数设置为False,生成删除特定列的副本,展示这个副本。

# 使用style.hide_columns
df1.style.hide_columns(['PassengerId','Name','Age','Ticket'])
# 使用drop方法
df1.drop(['PassengerId','Name','Age','Ticket'],axis=1,inplace=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Sex SibSp Parch Fare Cabin Embarked
0 3 male 1 0 7.2500 NaN S
1 1 female 1 0 71.2833 C85 C
2 3 female 0 0 7.9250 NaN S
3 1 female 1 0 53.1000 C123 S
4 3 male 0 0 8.0500 NaN S
... ... ... ... ... ... ... ...
886 2 male 0 0 13.0000 NaN S
887 1 female 0 0 30.0000 B42 S
888 3 female 1 2 23.4500 NaN S
889 1 male 0 0 30.0000 C148 C
890 3 male 0 0 7.7500 NaN Q

891 rows × 7 columns

1.5 筛选的逻辑

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。

df1[df1.Age<10]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S
10 11 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
16 17 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q
24 25 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S
43 44 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C
... ... ... ... ... ... ... ... ... ... ... ...
827 828 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
831 832 2 Richards, Master. George Sibley male 0.83 1 1 29106 18.7500 NaN S
850 851 3 Andersson, Master. Sigvard Harald Elias male 4.00 4 2 347082 31.2750 NaN S
852 853 3 Boulos, Miss. Nourelain female 9.00 1 1 2678 15.2458 NaN C
869 870 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S

62 rows × 11 columns

1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

midage=df1[(df1.Age>10) & (df1.Age<50)]
midage
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
885 886 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 11 columns

1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来

midage.iloc[[99],[1,3]] 
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Sex
148 2 male

1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

这里要求使用loc方法选出所需行,但loc只能根据原始索引寻址,所以需要先修改索引,让索引和行号挂钩。
先用reset_index方法删除之前的索引(即最左边没写名字那列),设置参数drop=True,表示把之前的索引列删除,否则之前的索引列会被命名成index,塞回到数据中,如下所示:

midage.reset_index()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
index PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
571 885 886 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
572 886 887 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
573 887 888 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
574 889 890 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
575 890 891 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 12 columns

下面是修改索引后完成任务的代码。

pd.options.display.max_rows=10
midage.reset_index(drop=True).loc[[99,104,107],['Pclass','Name','Sex']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Name Sex
99 2 Navratil, Mr. Michel ("Louis M Hoffman") male
104 3 Corn, Mr. Harry male
107 3 Bengtsson, Mr. John Viktor male

1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

newMidage=midage.reset_index(drop=True)
col=list(newMidage.columns) # 用于寻找所需字段的下标
newMidage.iloc[[99,104,107],[col.index('Pclass'),col.index('Name'),col.index('Sex')]]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Pclass Name Sex
99 2 Navratil, Mr. Michel ("Louis M Hoffman") male
104 3 Corn, Mr. Harry male
107 3 Bengtsson, Mr. John Viktor male

1.6 了解你的数据吗?

#载入之前保存的train_chinese.csv数据,关于泰坦尼克号的任务,我们就使用这个数据
df=pd.read_csv('train_chinese.csv')
pd.options.display.max_rows=6
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ... ...
888 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 13 columns

1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序

先构建一个都为数字的DataFrame对象

dfTmp=pd.DataFrame(np.random.randint(0,10,size=(4,4)),index=[2,1,0,3],columns=['d','a','b','c'])
dfTmp
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
d a b c
2 1 2 8 3
1 7 3 9 3
0 0 1 8 9
3 1 9 9 9

使用DataFrame对象的sort_values方法,by参数用来指定关键字是哪个或哪些字段,ascending参数用来决定是否升序。

以下的排序函数均保持inplace参数为默认值False,不写回原对象。

dfTmp.sort_values(by=['d'],ascending=True)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
d a b c
0 0 1 8 9
2 1 2 8 3
3 1 9 9 9
1 7 3 9 3

下面将不同的排序方式做一个总结:

1.让行索引升序排序

dfTmp.sort_index(axis=0)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
d a b c
0 0 1 8 9
1 7 3 9 3
2 1 2 8 3
3 1 9 9 9

2.让列索引升序排序

dfTmp.sort_index(axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b c d
2 2 8 3 1
1 3 9 3 7
0 1 8 9 0
3 9 9 9 1

3.让列索引降序排序

dfTmp.sort_index(axis=1,ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
d c b a
2 1 3 8 2
1 7 3 9 3
0 0 9 8 1
3 1 9 9 9

4.任选两列数据,分别作为第一关键字和第二关键字,均降序排序

dfTmp.sort_values(by=['b','d'],ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
d a b c
1 7 3 9 3
3 1 9 9 9
2 1 2 8 3
0 0 1 8 9

1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?

#票价最高的20人中的幸存人数
df.sort_values(by=['票价'],ascending=False).head(20).是否幸存.sum()
14
#票价最低的20人中的幸存人数
df.sort_values(by=['票价']).head(20).是否幸存.sum()
1

由上可知是否幸存与票价间确实存在很大差距
但同时票价往往收到乘船区间的影响

#年龄最小的20人中的幸存人数
df.sort_values(by=['年龄']).head(10).是否幸存.sum()
9
#年龄最大的20人中的幸存人数
df.sort_values(by=['年龄']).tail(10).是否幸存.sum()
2

由上可知大部分存活机会留给了孩子

1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果

df1=pd.DataFrame(np.random.randint(0,10,size=(4,4)),index=[0,1,2,3],columns=['a','b','c','d'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b c d
0 5 8 9 8
1 1 3 8 9
2 2 0 1 7
3 5 8 2 7
df2=pd.DataFrame(np.random.randint(0,10,size=(4,4)),index=[1,2,3,4],columns=['a','b','e','f'])
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b e f
1 3 2 5 3
2 6 5 1 6
3 8 1 0 8
4 9 8 4 8

将frame_a和frame_b进行相加

df1+df2
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b c d e f
0 NaN NaN NaN NaN NaN NaN
1 4.0 5.0 NaN NaN NaN NaN
2 8.0 5.0 NaN NaN NaN NaN
3 13.0 9.0 NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。

1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?

df
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ... ...
888 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 13 columns

max(df['堂兄弟/妹个数']+df['父母与小孩个数'])
10

1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息

df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b c d
0 5 8 9 8
1 1 3 8 9
2 2 0 1 7
3 5 8 2 7
pd.options.display.max_rows=20
df1.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
a b c d
count 4.000000 4.000000 4.000000 4.000000
mean 3.250000 4.750000 5.000000 7.750000
std 2.061553 3.947573 4.082483 0.957427
min 1.000000 0.000000 1.000000 7.000000
25% 1.750000 2.250000 1.750000 7.000000
50% 3.500000 5.500000 5.000000 7.500000
75% 5.000000 8.000000 8.250000 8.250000
max 5.000000 8.000000 9.000000 9.000000

1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?

df.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 年龄 堂兄弟/妹个数 父母与小孩个数 票价
count 891.000000 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 445.000000 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 222.500000 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 445.000000 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 667.500000 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 890.000000 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

有几乎80%的乘客的票价不高于平均值
如果忽视乘船区间对于票价的影响,那么可以断定船上的贫富差距还是较大的

多做几个组数据的统计,看看你能分析出什么?

登记了年龄的乘客的平均年龄约30岁,有一半的乘客低于28岁,只有四分之一的乘客高于38岁,说明乘客总体偏年轻。

浙ICP备19012682号