================================================== Exercises for Memory-Efficient Computing (answers) ================================================== For the solutions of the exercises, you may want to check the accompanying material: - `Timings-escher.ods`, a LibreOffice spreadsheet with data and graphs for exercises 1 to 6. This has benchmarks ran on a machine with 8 physical cores. - `Timings-40cores.ods`, a LibreOffice spreadsheet with data and graphs for exercises 7 to 10. This has benchmarks ran on a machine with 8 physical cores. Optimizing arithmetic expressions ================================= Exercise 1 ~~~~~~~~~~ - Set the `what` parameter to "numexpr" and take note of the speed-up versus the "numpy" case. Why do you think the speed-up is so large? Numexpr basically follows the blocking technique as explained above. The additional speed-up is due to the fact that the virtual machine in numexpr is implemented at C level, and hence, it is faster than the above pure-python implementation. Also, numexpr can do the ``x**3`` --> ``x*x*x`` expansion, avoiding the expensive `pow()` operation (which is performed in software). Exercise 2 ~~~~~~~~~~ - Why do you think numpy is doing much more efficient with this new expression? Because the `pow()` is computationally very expensive, and we removed it. Note that NumPy cannot expand `pow()` too much as it would create too much (*big*) temporaries. As Numexpr temporaries are much smaller (fits in cache), it can expand pow() much more cheaply. - Why the speed-up in numexpr is not so high in comparison? The virtual machine of numexpr already made the `x**3 -> x*x*x` expansion automatically, although the factorized version contains less temporaries, so this is why we can still get a bit more of performance. - Why numexpr continues to be faster than numpy? In this case, basically just because numexpr uses the blocking technique. Hence, this is a pretty much fair comparison on how much this technique can do for your computations. Exercise 3 ~~~~~~~~~~ - Why do you think it is more efficient than the above approaches? The reason is two-folded. In one hand, C code only need atomic temporaries that are kept in registers in CPU, as computation is done element-by-element, so no blocking technique is really needed here. Also, C-code does not have numexpr's virtual machine overhead, resulting in the fastest implementation. Evaluating transcendental functions =================================== Exercise 4 ~~~~~~~~~~ - Why the difference in time between NumPy and Numexpr is so small? Because in this case the bottleneck is mainly the CPU, not memory, so numexpr (using 1 single thread) cannot offer too much speed-up (but still, there is some). Exercise 5 ~~~~~~~~~~ - Do this pure C approach go faster than the Python-based ones? Not very much because of the same reason than above: the bottleneck is the same than above, the CPU. - What would be needed to accelerate the computations? As this is a CPU-bounded calculation, using several cores would probably help. Parallelism with threads ======================== Exercise 6 ~~~~~~~~~~ - How the efficiency scales? Scaling should approximately follow a negative exponential distribution. However, you will notice that the asymptote is different from zero. Such an asymptote is basically fixed by the memory bandwidth of the machine. - Why do you think it scales that way? The answer is that you are hitting the memory bandwidth limit in some of your computations. So, the reason is that the problem becomes bottle-necked by memory access when you use a large number of threads. - How performance compares with the pure C computation? You will see that numexpr can beat a pure C computation. This is for a good reason: numexpr uses several threads here, while C code does not. But see below. Exercise 7 ~~~~~~~~~~ - How the efficiency scales? It scales very similarly than numexpr, but it performs a bit better. - Which is the asymptotic limit? It is still closer to the bandwidth limit than numexpr one. The reason for this is two-folded: the virtual machine in numexpr always has some overhead that C does not (i.e. dealing with blocks takes time), and also numexpr still have to deal with (small) temporaries in cache. Exercise 8 ~~~~~~~~~~ - Compare the performance of this with polynomial evaluation. `y = x` performs pretty similarly to the polynomial:: y = ((.25*x + .75)*x - 1.5)*x - 2 - Why it scales very similarly than the polynomial evaluation? The bottleneck is in memory access, no CPU computing time. - Could you have a guess at the memory bandwidth of this machine? A rough estimation is that, if saturation point is around 0.01 seconds for computing an array of size:: 8 * 1e7 ~ 76 MB then the bandwidth for this is around:: (76 / .01) ~ 7600 MB/s ~ 7.5 GB/s Using Numba =========== Exercise 9 ~~~~~~~~~~ - Run several expressions and determine which method is faster. What is the compilation time for numba and how it compares with the execution time? In most machines is between 0.3s and 0.5s (depends on the hardware). In this case, it is pretty close to the time that it takes the run, and that overhead should be taken in account when evaluating speedups. - Raise the amount of data points to 100 millions. What happens? The execution time scales linearly, while the compilation remains the same, so the compilation time is much less overhead compared to the run time. - Set the number of threads for numexpr to 8 and redo the computation. How its speed compares with numba? numexpr clearly wins numba in this case (specially when evaluating the expression with transcendental functions). Numba does not support multi-threading well yet (this is something that will be fixed). - Provided this, which do you think is the best scenario for numba? Which is the best scenario for numexpr? numba adapts very well to scenarios where you want to accelerate generic python code, and specially ones that are not easy to vectorize. numexpr adapts better for the cases where you want to get rid of the relatively high compilation times of numba, and when dealing with CPU-bounded problems because it supports efficient multi-threading.