python - Pandas taking Cumulative Sum with Reset -
problem
i'm trying keep running total of consecutive timestamps (minute frequency). have way of taking cumulative sum , resetting on condition 2 columns not match, done loop. wondering if there way without loop.
code
cb_arbitrage['shift'] = cb_arbitrage.index.shift(1, freq='t')
returns:
cccccccc bbbbbbbb cb_spread shift timestamp 2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:24:00 2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:25:00 2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:26:00 2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:27:00 2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:28:00 2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:29:00 2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:30:00 2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 19:00:00 2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:02:00
then following:
cb_arbitrage['shift'] = cb_arbitrage['shift'].shift(1) cb_arbitrage['shift'][0] = cb_arbitrage.index[0] cb_arbitrage['count'] = 0
which returns:
cccccccc bbbbbbbb cb_spread shift count timestamp 2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:23:00 0 2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:24:00 0 2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:25:00 0 2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:26:00 0 2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:27:00 0 2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:28:00 0 2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:29:00 0 2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 18:30:00 0 2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:00:00 0
then, loop calculate cumulative sum, reset:
count = 0 i, row in cb_arbitrage.iterrows(): if == cb_arbitrage.loc[i]['shift']: count += 1 cb_arbitrage.set_value(i, 'count', count) else: count = 1 cb_arbitrage.set_value(i, 'count', count)
which gives me expected result:
cccccccc bbbbbbbb cb_spread shift count timestamp 2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:23:00 1 2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:24:00 2 2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:25:00 3 2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:26:00 4 2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:27:00 5 2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:28:00 6 2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:29:00 7 2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 18:30:00 1 2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:00:00 1 2017-07-07 21:55:00 2499.904560 2510.814000 10.909440 2017-07-07 19:02:00 1 2017-07-07 21:56:00 2500.134615 2510.812857 10.678242 2017-07-07 21:56:00 2
you can use diff
method finds difference between current row , previous row. can check , see if difference equal 1 minute. here, there lots of trickery reset streaks within data.
we first take cumulative sum of boolean series, gets close want. reset series multiply cumulative sum series original boolean, since false evaluates 0.
s = cb_arbitrage.timestamp.diff() == pd.timedelta('1 minute') s1 = s.cumsum() s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1, fill_value=0) + 1 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 5 6.0 6 7.0 7 1.0 8 1.0 9 1.0 10 2.0
Comments
Post a Comment