Skip to content Skip to sidebar Skip to footer

Conditional Sums For Pandas Aggregate

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been havin

Solution 1:

To complement unutbu's answer, here's an approach using apply on the groupby object.

>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
    sum_up=(x.B == 'up').sum(),
    sum_down=(x.B == 'down').sum(),
    over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
      over_200_up  sum_down  sum_up
A_id                               
a1              001
a2              010
a3              102
a4              000
a5              000

Solution 2:

There might be a better way; I'm pretty new to pandas, but this works:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
                   'B': 'up down up up left right'.split(),
                   'C': [100, 102, 100, 250, 100, 102]})

df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])

defsum_up(grp):
    return np.sum(grp=='up')
defsum_down(grp):
    return np.sum(grp=='down')
defover_200_up(grp):
    return np.sum(grp)

result = grouped.agg({'B': [sum_up, sum_down],
                      'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)

yields

      sum_up  sum_down  over_200_up
A_id                               
a1         1         0            0
a2         0         1            0
a3         2         0            1
a4         0         0            0
a5         0         0            0

Solution 3:

An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:


df = df.set_index('A_id')

outcome = {'sum_up' : df.B.eq('up'),
           'sum_down': df.B.eq('down'),
           'over_200_up' : df.B.eq('up') & df.C.gt(200)}

outcome = pd.DataFrame(outcome).groupby(level=0).sum()

outcome
 
      sum_up  sum_down  over_200_up
A_id                               
a1         100
a2         010
a3         201
a4         000
a5         000

Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:

(df
  .set_index(['A_id', 'B'], append = True)
  .C
  .unstack('B')
  .assign(gt_200 = lambda df: df.up.gt(200))
  .groupby(level='A_id')
  .agg(sum_up=('up', 'count'), 
       sum_down =('down', 'count'), 
       over_200_up = ('gt_200', 'sum')
      )
)

      sum_up  sum_down  over_200_up
A_id                               
a1         100
a2         010
a3         201
a4         000
a5         000

Solution 4:

Here, what I have recently learned using df assign and numpy's where method:

df3=

A_id       B    C
1:   a1    "up"1002:   a2  "down"1023:   a3    "up"1003:   a3    "up"2504:   a4  "left"1005:   a5 "right"102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
          over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})

outcome=

   A_id  sum_up   sum_down  over_200_up
0   a1    1        0         0
1   a2    0        1         0
2   a3    2        0         1
3   a4    0        0         0
4   a5    0        0         0

This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas

select a,
       sum(casewhen B='up'then1else0end) as sum_up
       ....
fromtablegroupby a

Post a Comment for "Conditional Sums For Pandas Aggregate"