Count Number Of Rows Between Two Dates BY ID In A Pandas GroupBy Dataframe
Solution 1:
My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)
IOW, if we turn your frame to something like
>>> df.head()
cid jid change date
0 1 100 1 2015-01-06
1 1 101 1 2015-01-07
21 1 100 -1 2015-01-16
22 1 101 -1 2015-01-17
17 1 117 1 2015-03-01
then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like
df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])
df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()
new_time = pd.date_range(df.date.min(), df.date.max())
df_parts = []
for cid, group in df.groupby("cid"):
full_count = group[["date", "count"]].set_index("date")
full_count = full_count.reindex(new_time)
full_count = full_count.ffill().fillna(0)
full_count["cid"] = cid
df_parts.append(full_count)
df_new = pd.concat(df_parts)
which gives me something like
>>> df_new.head(15)
count cid
2015-01-03 0 1
2015-01-04 0 1
2015-01-05 0 1
2015-01-06 1 1
2015-01-07 2 1
2015-01-08 2 1
2015-01-09 2 1
2015-01-10 2 1
2015-01-11 2 1
2015-01-12 2 1
2015-01-13 2 1
2015-01-14 2 1
2015-01-15 2 1
2015-01-16 1 1
2015-01-17 0 1
There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.
Solution 2:
Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):
from itertools import product
df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)
>>> df_new_date.head(20)
cid newdate cnt
0 1 2015-07-01 0
1 1 2015-07-02 0
2 1 2015-07-03 0
3 1 2015-07-04 0
4 1 2015-07-05 0
5 1 2015-07-06 1
6 1 2015-07-07 1
7 1 2015-07-08 1
8 1 2015-07-09 1
9 1 2015-07-10 1
10 1 2015-07-11 2
11 1 2015-07-12 3
12 1 2015-07-13 3
13 1 2015-07-14 2
14 1 2015-07-15 3
15 1 2015-07-16 3
16 1 2015-07-17 3
17 1 2015-07-18 3
18 1 2015-07-19 2
19 1 2015-07-20 1
You could then drop the zeros if you don't want them. I don't think this will be much better than your original solution, however.
I would like to suggest you use the following improvement on the loop provided by the @DSM solution:
df_parts=[]
for cid in df.cid.unique():
full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
df_parts.append(full_count[full_count['count']!=0])
df_new = pd.concat(df_parts)
>>> df_new
date cid count
0 2015-07-06 1 1
1 2015-07-07 1 1
2 2015-07-08 1 1
3 2015-07-09 1 1
4 2015-07-10 1 1
5 2015-07-11 1 2
6 2015-07-12 1 3
7 2015-07-13 1 3
8 2015-07-14 1 2
9 2015-07-15 1 3
10 2015-07-16 1 3
11 2015-07-17 1 3
12 2015-07-18 1 3
13 2015-07-19 1 2
14 2015-07-20 1 1
15 2015-07-21 1 1
16 2015-07-22 1 1
0 2015-07-01 2 1
1 2015-07-02 2 1
2 2015-07-03 2 1
3 2015-07-04 2 1
4 2015-07-05 2 1
5 2015-07-06 2 1
6 2015-07-07 2 2
7 2015-07-08 2 2
8 2015-07-09 2 2
9 2015-07-10 2 3
10 2015-07-11 2 3
11 2015-07-12 2 4
12 2015-07-13 2 4
13 2015-07-14 2 5
14 2015-07-15 2 4
15 2015-07-16 2 4
16 2015-07-17 2 3
17 2015-07-18 2 2
18 2015-07-19 2 2
19 2015-07-20 2 1
20 2015-07-21 2 1
Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.
Post a Comment for "Count Number Of Rows Between Two Dates BY ID In A Pandas GroupBy Dataframe"