I often use Pandas
where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to
While I'm happy to accept this for specific cases, I'm interested to know:
- Do Pandas
wheremethods offer any additional functionality, apart from
try-castparameters? I understand those 3 parameters but rarely use them. For example, I have no idea what the
levelparameter refers to.
- Is there any non-trivial counter-example where
numpy.where? If such an example exists, it could influence how I choose appropriate methods going forwards.
For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0) n = 10000000 df = pd.DataFrame(np.random.random(n)) assert (df.mask(df > 0.5, 1).values == np.where(df > 0.5, 1, df)).all() %timeit df.mask(df > 0.5, 1) # 145 ms per loop %timeit np.where(df > 0.5, 1, df) # 113 ms per loop
The performance appears to diverge further for non-scalar values:
%timeit df.mask(df > 0.5, df*2) # 338 ms per loop %timeit np.where(df > 0.5, df*2, df) # 153 ms per loop