Pandas - GroupBy

  • 简述

    任何groupby操作涉及对原始对象的以下操作之一。他们是 -
    • Splitting对象
    • Applying一个函数
    • Combining结果集
    在许多情况下,我们将数据拆分为多个集合,并对每个子集应用一些功能。在应用功能中,我们可以执行以下操作 -
    • 聚合− 计算汇总统计量
    • 转换− 执行一些特定于组的操作
    • 清洗− 在某些条件下丢弃数据
    现在让我们创建一个 DataFrame 对象并对其执行所有操作 -
    
    #import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print df
    
    它的输出如下 -
    
        Points   Rank     Team   Year
    0      876      1   Riders   2014
    1      789      2   Riders   2015
    2      863      2   Devils   2014
    3      673      3   Devils   2015
    4      741      3    Kings   2014
    5      812      4    kings   2015
    6      756      1    Kings   2016
    7      788      1    Kings   2017
    8      694      2   Riders   2016
    9      701      4   Royals   2014
    10     804      1   Royals   2015
    11     690      2   Riders   2017
    
  • 将数据拆分为组

    Pandas 对象可以拆分为它们的任何对象。有多种方法可以拆分对象,例如 -
    • obj.groupby('key')
    • obj.groupby(['key1','key2'])
    • obj.groupby(key,axis=1)
    现在让我们看看如何将分组对象应用于 DataFrame 对象

    例子

    
    # import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print df.groupby('Team')
    
    它的输出如下 -
    
    <pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50>
    
  • 查看组

    
    # import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print df.groupby('Team').groups
    
    它的输出如下 -
    
    {'Kings': Int64Index([4, 6, 7],      dtype='int64'),
    'Devils': Int64Index([2, 3],         dtype='int64'),
    'Riders': Int64Index([0, 1, 8, 11],  dtype='int64'),
    'Royals': Int64Index([9, 10],        dtype='int64'),
    'kings' : Int64Index([5],            dtype='int64')}
    

    例子

    Group by多列 -
    
    # import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print df.groupby(['Team','Year']).groups
    
    它的输出如下 -
    
    {('Kings', 2014): Int64Index([4], dtype='int64'),
     ('Royals', 2014): Int64Index([9], dtype='int64'),
     ('Riders', 2014): Int64Index([0], dtype='int64'),
     ('Riders', 2015): Int64Index([1], dtype='int64'),
     ('Kings', 2016): Int64Index([6], dtype='int64'),
     ('Riders', 2016): Int64Index([8], dtype='int64'),
     ('Riders', 2017): Int64Index([11], dtype='int64'),
     ('Devils', 2014): Int64Index([2], dtype='int64'),
     ('Devils', 2015): Int64Index([3], dtype='int64'),
     ('kings', 2015): Int64Index([5], dtype='int64'),
     ('Royals', 2015): Int64Index([10], dtype='int64'),
     ('Kings', 2017): Int64Index([7], dtype='int64')}
    
  • 遍历组

    随着groupby手头的对象,我们可以像 itertools.obj 一样遍历对象。
    
    # import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    for name,group in grouped:
       print name
       print group
    
    它的输出如下 -
    
    2014
       Points  Rank     Team   Year
    0     876     1   Riders   2014
    2     863     2   Devils   2014
    4     741     3   Kings    2014
    9     701     4   Royals   2014
    2015
       Points  Rank     Team   Year
    1     789     2   Riders   2015
    3     673     3   Devils   2015
    5     812     4    kings   2015
    10    804     1   Royals   2015
    2016
       Points  Rank     Team   Year
    6     756     1    Kings   2016
    8     694     2   Riders   2016
    2017
       Points  Rank    Team   Year
    7     788     1   Kings   2017
    11    690     2  Riders   2017
    
    默认情况下,groupby对象具有与组名称相同的标签名称。
  • 选择一个组

    使用get_group()方法,我们可以选择单个组。
    
    # import the pandas library
    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    print grouped.get_group(2014)
    
    它的输出如下 -
    
       Points  Rank     Team    Year
    0     876     1   Riders    2014
    2     863     2   Devils    2014
    4     741     3   Kings     2014
    9     701     4   Royals    2014
    
  • 聚合

    聚合函数为每个组返回一个聚合值。一旦group by创建对象后,可以对分组的数据执行多个聚合操作。
    一个明显的方法是通过聚合或等价物进行聚合agg方法 -
    
    # import the pandas library
    import pandas as pd
    import numpy as np
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    print grouped['Points'].agg(np.mean)
    
    它的输出如下 -
    
    Year
    2014   795.25
    2015   769.50
    2016   725.00
    2017   739.00
    Name: Points, dtype: float64
    
    查看每个组大小的另一种方法是应用 size() 函数 -
    
    import pandas as pd
    import numpy as np
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    Attribute Access in Python Pandas
    grouped = df.groupby('Team')
    print grouped.agg(np.size)
    
    它的输出如下 -
    
             Points   Rank   Year
    Team
    Devils        2      2      2
    Kings         3      3      3
    Riders        4      4      4
    Royals        2      2      2
    kings         1      1      1
    

    一次应用多个聚合函数

    使用分组系列,您还可以通过list要么dict of functions进行聚合,并生成 DataFrame 作为输出 -
    
    # import the pandas library
    import pandas as pd
    import numpy as np
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Team')
    print grouped['Points'].agg([np.sum, np.mean, np.std])
    
    它的输出如下 -
    
    Team      sum      mean          std
    Devils   1536   768.000000   134.350288
    Kings    2285   761.666667    24.006943
    Riders   3049   762.250000    88.567771
    Royals   1505   752.500000    72.831998
    kings     812   812.000000          NaN
    
  • 转换

    对组或列的转换返回一个对象,该对象的索引与被分组的大小相同。因此,转换应返回与组块大小相同的结果。
    
    # import the pandas library
    import pandas as pd
    import numpy as np
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Team')
    score = lambda x: (x - x.mean()) / x.std()*10
    print grouped.transform(score)
    
    它的输出如下 -
    
           Points        Rank        Year
    0   12.843272  -15.000000  -11.618950
    1   3.020286     5.000000   -3.872983
    2   7.071068    -7.071068   -7.071068
    3  -7.071068     7.071068    7.071068
    4  -8.608621    11.547005  -10.910895
    5        NaN          NaN         NaN
    6  -2.360428    -5.773503    2.182179
    7  10.969049    -5.773503    8.728716
    8  -7.705963     5.000000    3.872983
    9  -7.071068     7.071068   -7.071068
    10  7.071068    -7.071068    7.071068
    11 -8.157595     5.000000   11.618950
    
  • 过滤

    过滤根据定义的标准过滤数据并返回数据子集。这filter()函数用于过滤数据。
    
    import pandas as pd
    import numpy as np
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
       'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
       'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
       'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
       'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print df.groupby('Team').filter(lambda x: len(x) >= 3)
    
    它的输出如下 -
    
        Points  Rank     Team   Year
    0      876     1   Riders   2014
    1      789     2   Riders   2015
    4      741     3   Kings    2014
    6      756     1   Kings    2016
    7      788     1   Kings    2017
    8      694     2   Riders   2016
    11     690     2   Riders   2017
    
    在上述过滤条件中,我们要求返回参加过 3 次或以上 IPL 的团队。