I'm currently trying to get a better understanding of memory related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.
With that in mind, I did the following quick-and-dirty test:
import numpy as np
from timeit import timeit
def setup():
global a, b, c
a = np.random.permutation(N)
b = np.random.random(N)
c = np.empty_like(b)
def fwd():
c = b[a]
def inv():
c[a] = b
N = 10_000
setup()
timeit(fwd, number=100_000)
# 1.4942631321027875
timeit(inv, number=100_000)
# 2.531870319042355
N = 100_000
setup()
timeit(fwd, number=10_000)
# 2.4054739447310567
timeit(inv, number=10_000)
# 3.2365565397776663
N = 1_000_000
setup()
timeit(fwd, number=1_000)
# 11.131387163884938
timeit(inv, number=1_000)
# 14.19817715883255
Now, the result is the opposite of what I would have expected: fwd
which does random read and contiguous write is consistently faster than inv
which does contiguous read and random write.
I suspect I'm missing something fairly obvious, but anyway, here goes:
Why is this so?
And while we are at it, why is the whole thing so nonlinear? My numbers are not insanely big are they?
UPDATE: As pointed out by @Trilarion and @Yann Vernier my snippets aren't properly balanced, so I replaced them with
def fwd():
c[d] = b[a]
b[d] = c[a]
def inv():
c[a] = b[d]
b[a] = c[d]
where d = np.arange(N)
(I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced timeit
with repeat
and reduced the numbers of repeats by a factor of 10.
Then I get
[0.6757169323973358, 0.6705542299896479, 0.6702114241197705] #fwd
[0.8183442652225494, 0.8382121799513698, 0.8173762648366392] #inv
[1.0969422250054777, 1.0725746559910476, 1.0892365919426084] #fwd
[1.0284497970715165, 1.025063106790185, 1.0247828317806125] #inv
[3.073981977067888, 3.077839042060077, 3.072118630632758] #fwd
[3.2967213969677687, 3.2996009718626738, 3.2817375687882304] #inv
So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.
Update 2:
Given the comments and answers so far I would appreciate clarification as to
- is @B.M.'s numba experiment conclusive?
- or could it go either way depending on cache details
Please explain in a beginner friendly way the various cache concepts that may be relevant here (cf. comments by @Trilarion, @Yann Vernier's answer). Supporting code may be in C / cython / numpy /numba or python.
Optionally, explain the behavior of my clearly inadequate python experiments.
from Why is fancy assignment slower than fancy lookup?
No comments:
Post a Comment