Skip to content Skip to sidebar Skip to footer

How Can I Sort Within Partitions Defined By One Column But Leave The Partitions Where They Are?

Consider the dataframe df df = pd.DataFrame(dict( A=list('XXYYXXYY'), B=range(8, 0, -1) )) print(df) A B 0 X 8 1 X 7 2 Y 6 3 Y 5 4 X 4 5 X 3 6

Solution 1:

You can use transform to get back your new desired index order, then use reindex to reorder your DataFrame:

# Use transform to return the new ordered index values.
new_idx = df.groupby('A')['B'].transform(lambda grp: grp.sort_values().index)

# Reindex.
df = df.reindex(new_idx.rename(None))

You could combine the two lines above into one long line, if so desired.

The resulting output:

   A  B
5  X  3
4  X  4
7  Y  1
6  Y  2
1  X  7
0  X  8
3  Y  5
2  Y  6

Note that if you don't care about maintaing your old index, you can directly reassign from the transform:

df['B'] = df.groupby('A')['B'].transform(lambda grp: grp.sort_values())

Which yields:

   A  B
0  X  3
1  X  4
2  Y  1
3  Y  2
4  X  7
5  X  8
6  Y  5
7  Y  6

Solution 2:

The only way I figured how to solve this efficiently was to sort twice and unwind once.

v = df.values

# argsort just first column with kind='mergesort' to preserve subgroup order
a1 = v[:, 0].argsort(kind='mergesort')

# Fill in an un-sort array to unwind the `a1` argsort
a_ = np.empty_like(a1)
a_[a1] = np.arange(len(a1))

# argsort by both columns... not exactly what I want, yet.
a2 = np.lexsort(v.T[::-1])

# Sort with `a2` then unwind the first layer with `a_`
pd.DataFrame(v[a2][a_], df.index[a2][a_], df.columns)

   A  B
5  X  3
4  X  4
7  Y  1
6  Y  2
1  X  7
0  X  8
3  Y  5
2  Y  6

Testing

Code

def np_intra_sort(df):
    v = df.values
    a1 = v[:, 0].argsort(kind='mergesort')
    a_ = np.empty_like(a1)
    a_[a1] = np.arange(len(a1))
    a2 = np.lexsort(v.T[::-1])
    return pd.DataFrame(v[a2][a_], df.index[a2][a_], df.columns)

def pd_intra_sort(df):

    def sub_sort(x):
        return x.sort_values().index

    idx = df.groupby('A').B.transform(sub_sort).values

    return df.reindex(idx)

Small data

Enter image description here

Large data

df = pd.DataFrame(dict(
        A=list('XXYYXXYY') * 10000,
        B=range(8 * 10000, 0, -1)
    ))

Enter image description here


Post a Comment for "How Can I Sort Within Partitions Defined By One Column But Leave The Partitions Where They Are?"