Df.groupby(...).agg(set) Produces Different Result Compared To Df.groupby(...).agg(lambda X: Set(x))
Solution 1:
OK what is happening here is that set
isn't being handled as it's not is_list_like
in _aggregate
:
elif is_list_like(arg) andargnotin compat.string_types:
see source
this isn't is_list_like
so it returns None
up the call chain to end up at this line:
results.append(colg.aggregate(a))
see source
this raises TypeError
as TypeError: 'type' object is not iterable
which then raises:
if not len(results):
raise ValueError("no results")
see source
so because we have no results we end up calling _aggregate_generic
:
see source
this then calls:
result[name] = self._try_cast(func(data, *args, **kwargs)
see source
This then ends up as:
(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
->returnself._wrap_generic_output(result, obj)
(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}
I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779
So essentially because set
doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:
In [8]:
df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]:
class_type instructor
user_id
1NoneNone2NoneNone3NoneNone4NoneNone
but when you use the lambda
which is an anonymous function this works as expected.
Solution 2:
Perhaps as @Edchum commented agg
applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.
df.groupby('user_id').agg(print,end='\n\n')
class_type instructor user_id
0 Krav Maga Bob 14 Ju-jitsu Alice 1
class_type instructor user_id
1 Yoga Alice 25 Krav Maga Alice 2
class_type instructor user_id
2 Ju-jitsu Bob 36 Karate Bob 3
df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))
0 Krav Maga
4 Ju-jitsu
Name: class_type, dtype: object1 Yoga
5 Krav Maga
Name: class_type, dtype: object2 Ju-jitsu
6 Karate
Name: class_type, dtype: object3 Krav Maga
Name: class_type, dtype: object
...
Hope this is the reason why applying set gave the result like the one mentioned above.
Post a Comment for "Df.groupby(...).agg(set) Produces Different Result Compared To Df.groupby(...).agg(lambda X: Set(x))"