常用的pandas处理数据的命令

使用pandas处理table数据，代码汇总

example 0:取出某几列/某几行
取出某几列：

1	df.loc[:,['State','RegionName']]

取出某几个位置的几列：

1	df_housing_price.iloc[:, [1, 3]]

取出df的10-20行：

df[10:20]

从a列到b列的subset:

1	df_housing_price = df.loc[:, 'a':'b']

example 1:寻找dataframe某一列（’Gold’)的最大值以及最大值所在的index

1 2	df['Gold'].max() country = Top15['ratio'].argmax() #找到top15数据集里ratio列中最大值的index(此处index为country)

example 2:找到某个观测值的某几列最大值（比如从2001-01到2001-09）
先将这几列取出来变成df_sub，比如：

1 2	df_sub = census_df[["POPESTIMATE2010","POPESTIMATE2011"] max = df2.max(axis=1)

(axis=1表示看行，默认是axis=0,看列）

example 3:赋值
直接用

1	df['a'] == xx

将第a列赋值为xx

or:

1	df['Gold'] = df['Gold_winter']+ df ['Gold_summer']

总的金牌数等于冬天的金牌数加上夏天的

example 4：找到得到最多金牌数的三行
先按照想要的列sort，默认ascending

1	census=census_df.sort_values(by=['CENSUS2010POP'])

return最后的三个

1	return census['STNAME'].tail(3)

example 5：有条件地选取列
当POPESTIMATE2015>POPESTIMATE2014列时

1	counties = census_df[(census_df["POPESTIMATE2015"] > census_df['POPESTIMATE2014'])]

当region列等于1或者2时：

1	counties1= counties[(census_df["REGION"] == 2) \| (counties["REGION"] == 1 )]

当CTYNAME这一列含有Washington时

1	counties2=counties1[counties1['CTYNAME'].str.contains('Washington')]

example 7: rename

1	energy.rename(columns={'old': 'new', 'old1': 'new1'}, inplace=True)

example 8: apply函数的使用
apply函数对dataframe列中每一个obs采取措施并且返回一个拼接好的series，有很大的用处。
比如：
将energy（dataframe）中的Country列名中的数字结尾和类似于（xxxstate）的格式去掉

def clean_names(data):
    # 将末尾数字去掉并且合并成一个新的list
    newData = ''.join([i for i in data if not i.isdigit()])
    #去掉末尾的（state），先找到"（"所在的位置，并且将其去掉
    i = newData.find('(')
    if i>-1: 
        newData = newData[:i]
    return newData.strip()
energy['Country']= energy['Country'].apply(clean_names)

example 9: 将旧的obs的值替换成新的

#先将旧的值和将要替换的值变成一个dict
di = {"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#用replace替换掉原来的值
energy.replace({"Country": di},inplace = True)

example 10: dataframe group by a dict
假设有一个dataframe(Top15)有很多country的信息，包括人口等。一个dict，mapping country to continent。现在要create 新的dataframe，用大洲信息来做index，包括了每个大洲的人口的std, mean,每个大洲的size(在dataset里含有几个country)等

ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
先建立一个新的dataframe（最终的）叫groups
groups = pd.DataFrame(columns = ['size', 'sum', 'mean', 'std'])   
for group, frame in Top15.groupby(ContinentDict):
    groups.loc[group] = [len(frame), frame['Estimate Population'].sum(),frame['Estimate Population'].mean(),frame['Estimate Population'].std()]
    return groups
#此处的group即为每一个大洲，frame则是top15里相对应的每一组所含有的countries的所有列信息

example 11: 用每一组的第一行的数值去填充

df['State'] = df['State'].ffill()
原来比如:                   现在：
State1                      State1
None                        State1
None                        State1
State2                      State2
None                        State2
None                        State2

example 12: 将某一列变成list并且从中找到某个数值的位置

1 2	diff = df['gdp_diff'].tolist() i=diff.index('xxx')

example 13: 在dataframe里求lag

1 2	#shift方法 df['lag_gdp'] = df['GDP in billions of chained 2009 dollars'].shift(1)

example 14: 根据条件创建新的列并且构建subset

#当RegionName列的数在townlist_new的list里是，record as true，否则则是false
df_want['isUniTown'] = df_want['RegionName'].apply(lambda x: x in townlist_new)
#根据df_want['isUniTown']列值为True 和False构建两个df subset
uni_data = df_want[df_want.isUniTown].copy().dropna()
non_uni_data = df_want[~ df_want.isUniTown].copy().dropna()