================================================== Exercises for Memory-Efficient Computing (answers) ================================================== For the solutions of the exercises, you may want to check the accompanying material: - `Timings-escher.ods`, a LibreOffice spreadsheet with data and graphs for exercises 1 to 6. This has benchmarks ran on a machine with 8 physical cores. - `Timings-40cores.ods`, a LibreOffice spreadsheet with data and graphs for exercises 7 to 10. This has benchmarks ran on a machine with 8 physical cores. Optimizing arithmetic expressions ================================= Exercise 1 ~~~~~~~~~~ - Set the `what` parameter to "numexpr" and take note of the speed-up versus the "numpy" case. Why do you think the speed-up is so large? Numexpr basically follows the blocking technique as explained above. The additional speed-up is due to the fact that the virtual machine in numexpr is implemented at C level, and hence, it is faster than the above pure-python implementation. Also, numexpr can do the ``x**3`` --> ``x*x*x`` expansion, avoiding the expensive `pow()` operation (which is performed in software). Exercise 2 ~~~~~~~~~~ - Why do you think numpy is doing much more efficient with this new expression? Because the `pow()` is computationally very expensive, and we removed it. Note that NumPy cannot expand `pow()` too much as it would create too much (*big*) temporaries. As Numexpr temporaries are much smaller (fits in cache), it can expand pow() much more cheaply. - Why the speed-up in numexpr is not so high in comparison? The virtual machine of numexpr already made the `x**3 -> x*x*x` expansion automatically. - Why numexpr continues to be faster than NumPy? In this case, basically just because numexpr uses the blocking technique. Hence, this is a pretty much fair comparison on how much this technique can do for your computations. Exercise 3 ~~~~~~~~~~ - Why do you think it is more efficient than the above approaches? The reason is two-folded. In one hand, C code only need atomic temporaries that are kept in registers in CPU, as computation is done element-by-element, so no blocking technique is really needed here. Also, C-code does not have numexpr's virtual machine overhead, resulting in the fastest implementation. Parallelism with threads ======================== Exercise 4 ~~~~~~~~~~ - How the efficiency scales? Scaling should approximately follow a negative exponential distribution. However, you will notice that the asymptote is different from zero. Such an asymptote is basically fixed by the memory bandwidth of the machine. - Why do you think it scales that way? The answer is that you are hitting the memory bandwidth limit in some of your computations. So, the reason is that the problem becomes bottle-necked by memory access when you use a large number of threads. - How performance compares with the pure C computation? You will see that numexpr can beat a pure C computation. This is for a good reason: numexpr uses several threads here, while C code does not. But see below. Exercise 5 ~~~~~~~~~~ - How the efficiency scales? It scales very similarly than numexpr, but it performs a bit better. - Which is the asymptotic limit? It is still closer to the bandwidth limit than numexpr one. The reason for this is two-folded: the virtual machine in numexpr always has some overhead that C does not (i.e. dealing with blocks takes time), and also numexpr still have to deal with (small) temporaries in cache. Exercise 6 ~~~~~~~~~~ - Compare the performance of this with polynomial evaluation. `y = x` performs pretty similarly to the polynomial:: y = ((.25*x + .75)*x - 1.5)*x - 2 - Why it scales very similarly than the polynomial evaluation? The bottleneck is in memory access, no CPU computing time. - Could you have a guess at the memory bandwidth of this machine? A rough estimation is that, if saturation point is around 0.01 seconds for computing an array of size:: 8 * 1e7 ~ 76 MB then the bandwidth for this is around:: (76 / .01) ~ 7600 MB/s ~ 7.5 GB/s Evaluating with carray ====================== Exercise 7 ~~~~~~~~~~ - Why do you think carray evaluates faster than NumPy, even when using the Python VM (virtual machine). carray executes faster because it applies the blocking technique for any virtual machine chosen. NumPy does not do that. - How much the compression slows down the evaluation? Which is the compression ratio achieved? Is that a lot? The compression slows down computation of the polynomial by a factor of 2x. The compression ratio is 2x, which is pretty usual for real-data. Exercise 8 ~~~~~~~~~~ - How much the compression slows down the evaluation? Which is the compression ratio achieved? Is that a lot? In this case, compression ratio is much better: 100x for compression level 5. And yes, 100x is really a lot. - Why do you think the results vary so dramatically? Because the output of the new expression is a boolean, with very low entropy in data dsitribution. Hence, Blosc can do the compression on a very easy and fast way. This is why performance is not affected in this case by using compression. Querying Big Data ================= Exercise 9 ~~~~~~~~~~ - How a carray query compares with a numpy one? First, the syntax is simpler. Just compare the NumPy way:: sum(r["f0"] for r in nt[(nt["f0"]<10)]) with the carray way:: sum(r.f0 for r in t.where("f0<10")) - Which is the compression ratio achieved in the ctable `t`? It is around a factor of 2x. It is a normal ratio. - How the different 'simple' and 'complex' query executes in comparison with the NumPy ones? The 'simple' query executes faster on a NumPy container, while the 'complex' query executes faster on a carray container. The reason is that for the 'simple' query, carray shows a higher overhead for dealing with chunked arrays. For the 'complex' query the overhead is the computation of the expression, and here is where numexpr shows its muscle, making the carray approach to go faster. - If you are in the big Intel's Lab machine, increase the NROWS by one order of magnitude and re-run the benchmark. What do you see? The time used increases linearly and the compression ratio stays more or less the same. Exercise 10 ~~~~~~~~~~~ - Try to find the sweet spot for the 'simple' query. The sweet spot is using 1 thread for both NROWS == 1e7 and NROWS = 1e8. The problem is probably memory-bound, and using more threads probably is not helping a lot. - Repeat for the 'complex' query. The sweet spot in this case is around 13 threads for NROWS = 1e7 and around 16 threads for NROWS = 1e8. - Why do you think there is such a large different in the sweet spot? The big difference between the 'single' query and the 'complex' one is that the numexpr VM used by carray can make a better use of threads for evaluating a complex expression.