常用的pandas处理数据的命令

使用pandas处理table数据,代码汇总

example 0:取出某几列/某几行
取出某几列:

1
df.loc[:,['State','RegionName']]

取出某几个位置的几列:

1
df_housing_price.iloc[:, [1, 3]]

取出df的10-20行:

1
df[10:20]

从a列到b列的subset:

1
df_housing_price = df.loc[:, 'a':'b']

example 1:寻找dataframe某一列(’Gold’)的最大值以及最大值所在的index

1
2
df['Gold'].max()
country = Top15['ratio'].argmax() #找到top15数据集里ratio列中最大值的index(此处index为country)

example 2:找到某个观测值的某几列最大值(比如从2001-01到2001-09)
先将这几列取出来变成df_sub,比如:

1
2
df_sub = census_df[["POPESTIMATE2010","POPESTIMATE2011"]
max = df2.max(axis=1)

(axis=1表示看行,默认是axis=0,看列)

example 3:赋值
直接用

1
df['a'] == xx

将第a列赋值为xx

or:

1
df['Gold'] = df['Gold_winter']+ df ['Gold_summer']

总的金牌数等于冬天的金牌数加上夏天的

example 4:找到得到最多金牌数的三行
先按照想要的列sort,默认ascending

1
census=census_df.sort_values(by=['CENSUS2010POP'])

return最后的三个

1
return census['STNAME'].tail(3)

example 5:有条件地选取列
当POPESTIMATE2015>POPESTIMATE2014列时

1
counties = census_df[(census_df["POPESTIMATE2015"] > census_df['POPESTIMATE2014'])]

当region列等于1或者2时:

1
counties1= counties[(census_df["REGION"] == 2) | (counties["REGION"] == 1 )]

当CTYNAME这一列含有Washington时

1
counties2=counties1[counties1['CTYNAME'].str.contains('Washington')]

example 7: rename

1
energy.rename(columns={'old': 'new', 'old1': 'new1'}, inplace=True)

example 8: apply函数的使用
apply函数对dataframe列中每一个obs采取措施并且返回一个拼接好的series,有很大的用处。
比如:
将energy(dataframe)中的Country列名中的数字结尾和类似于(xxxstate)的格式去掉

1
2
3
4
5
6
7
8
9
def clean_names(data):
# 将末尾数字去掉并且合并成一个新的list
newData = ''.join([i for i in data if not i.isdigit()])
#去掉末尾的(state),先找到"("所在的位置,并且将其去掉
i = newData.find('(')
if i>-1:
newData = newData[:i]
return newData.strip()
energy['Country']= energy['Country'].apply(clean_names)

example 9: 将旧的obs的值替换成新的

1
2
3
4
5
6
7
#先将旧的值和将要替换的值变成一个dict
di = {"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#用replace替换掉原来的值
energy.replace({"Country": di},inplace = True)

example 10: dataframe group by a dict
假设有一个dataframe(Top15)有很多country的信息,包括人口等。一个dict,mapping country to continent。现在要create 新的dataframe,用大洲信息来做index,包括了每个大洲的人口的std, mean,每个大洲的size(在dataset里含有几个country)等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
先建立一个新的dataframe(最终的)叫groups
groups = pd.DataFrame(columns = ['size', 'sum', 'mean', 'std'])
for group, frame in Top15.groupby(ContinentDict):
groups.loc[group] = [len(frame), frame['Estimate Population'].sum(),frame['Estimate Population'].mean(),frame['Estimate Population'].std()]
return groups
#此处的group即为每一个大洲,frame则是top15里相对应的每一组所含有的countries的所有列信息

example 11: 用每一组的第一行的数值去填充

1
2
3
4
5
6
7
8
df['State'] = df['State'].ffill()
原来比如: 现在:
State1 State1
None State1
None State1
State2 State2
None State2
None State2

example 12: 将某一列变成list并且从中找到某个数值的位置

1
2
diff = df['gdp_diff'].tolist()
i=diff.index('xxx')

example 13: 在dataframe里求lag

1
2
#shift方法
df['lag_gdp'] = df['GDP in billions of chained 2009 dollars'].shift(1)

example 14: 根据条件创建新的列并且构建subset

1
2
3
4
5
#当RegionName列的数在townlist_new的list里是,record as true,否则则是false
df_want['isUniTown'] = df_want['RegionName'].apply(lambda x: x in townlist_new)
#根据df_want['isUniTown']列值为True 和False构建两个df subset
uni_data = df_want[df_want.isUniTown].copy().dropna()
non_uni_data = df_want[~ df_want.isUniTown].copy().dropna()