Skip to content Skip to sidebar Skip to footer

Df.groupby(...).agg(set) Produces Different Result Compared To Df.groupby(...).agg(lambda X: Set(x))

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results. Data: df = pd.DataFrame({ 'use

Solution 1:

OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

elif is_list_like(arg) andargnotin compat.string_types:

see source

this isn't is_list_like so it returns None up the call chain to end up at this line:

results.append(colg.aggregate(a))

see source

this raises TypeError as TypeError: 'type' object is not iterable

which then raises:

if not len(results):
    raise ValueError("no results")

see source

so because we have no results we end up calling _aggregate_generic:

see source

this then calls:

result[name] = self._try_cast(func(data, *args, **kwargs)

see source

This then ends up as:

(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
->returnself._wrap_generic_output(result, obj)

(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}

I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

In [8]:

df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]: 
        class_type instructor
user_id                      
1NoneNone2NoneNone3NoneNone4NoneNone

but when you use the lambda which is an anonymous function this works as expected.

Solution 2:

Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

df.groupby('user_id').agg(print,end='\n\n')

 class_type instructor  user_id
0  Krav Maga        Bob        14   Ju-jitsu      Alice        1

  class_type instructor  user_id
1       Yoga      Alice        25  Krav Maga      Alice        2

  class_type instructor  user_id
2   Ju-jitsu        Bob        36     Karate        Bob        3


df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))

0    Krav Maga
4     Ju-jitsu
Name: class_type, dtype: object1         Yoga
5    Krav Maga
Name: class_type, dtype: object2    Ju-jitsu
6      Karate
Name: class_type, dtype: object3    Krav Maga
Name: class_type, dtype: object

...

Hope this is the reason why applying set gave the result like the one mentioned above.

Post a Comment for "Df.groupby(...).agg(set) Produces Different Result Compared To Df.groupby(...).agg(lambda X: Set(x))"