I'm trying to improve the performances of a pandas.groupby.aggregate
operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas
calls the aggregating function on each block in sequence (I suspect it to be a simple for
-loop).
Since pandas
is heavily based on numpy
, is there a way to speed up the calculation using numpy
's vectorization features?
My code
In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).
What my aggregating function does is:
- remove NaNs
- return NaN if no other value is present
- check if a special flag indicating variable wind direction is present. If it is, return the flag
- average the wind directions with a vector-averaging algorithm
The function is:
def meandir(x):
'''
Parameters
----------
x : pandas.Series
pandas series to be averaged
Returns
-------
float
averaged wind direction
'''
# Removes the NaN from the recording
x = x.dropna()
# If the record is empty, return NaN
if len(x)==0:
return np.nan
# If the record contains variable samples (990) return variable (990)
elif np.any(x == 990):
return 990
# Otherwise sum the vectors and return the angle
else:
angle = np.rad2deg(
np.arctan2(
np.sum(np.sin(np.deg2rad(x))),
np.sum(np.cos(np.deg2rad(x)))
)
)
#Wrap angles from (-pi,pi) to (0,360)
return (angle + 360) % 360
you can test it with
from timeit import repeat
import pandas as pd
import numpy as np
N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)
# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990
# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')
# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H')) # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}
res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')
res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')
which on my PC for N_samples=1e4
outputs:
With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms
with the custom aggregating function being 750 times slower and with N_samples=1e6
outputs:
With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms
with the custom aggregating function being 6500 times slower!
Is there a way to speed up this line of code?
from Improve performances (vectorize?) pandas.groupby.aggregate
No comments:
Post a Comment