==================================================
Exercises for Memory-Efficient Computing (answers)
==================================================
For the solutions of the exercises, you may want to check the
accompanying material:
- `Timings-escher.ods`, a LibreOffice spreadsheet with data and
graphs for exercises 1 to 6. This has benchmarks ran on a machine
with 8 physical cores.
- `Timings-40cores.ods`, a LibreOffice spreadsheet with data and
graphs for exercises 7 to 10. This has benchmarks ran on a
machine with 8 physical cores.
Optimizing arithmetic expressions
=================================
Exercise 1
~~~~~~~~~~
- Set the `what` parameter to "numexpr" and take note of the speed-up
versus the "numpy" case. Why do you think the speed-up is so large?
Numexpr basically follows the blocking technique as explained above.
The additional speed-up is due to the fact that the virtual machine in
numexpr is implemented at C level, and hence, it is faster than the
above pure-python implementation.
Also, numexpr can do the ``x**3`` --> ``x*x*x`` expansion, avoiding the
expensive `pow()` operation (which is performed in software).
Exercise 2
~~~~~~~~~~
- Why do you think numpy is doing much more efficient with this new
expression?
Because the `pow()` is computationally very expensive, and we removed it.
Note that NumPy cannot expand `pow()` too much as it would create too
much (*big*) temporaries. As Numexpr temporaries are much smaller
(fits in cache), it can expand pow() much more cheaply.
- Why the speed-up in numexpr is not so high in comparison?
The virtual machine of numexpr already made the `x**3 -> x*x*x`
expansion automatically.
- Why numexpr continues to be faster than NumPy?
In this case, basically just because numexpr uses the blocking
technique. Hence, this is a pretty much fair comparison on how much
this technique can do for your computations.
Exercise 3
~~~~~~~~~~
- Why do you think it is more efficient than the above approaches?
The reason is two-folded. In one hand, C code only need atomic
temporaries that are kept in registers in CPU, as computation is done
element-by-element, so no blocking technique is really needed here.
Also, C-code does not have numexpr's virtual machine overhead,
resulting in the fastest implementation.
Parallelism with threads
========================
Exercise 4
~~~~~~~~~~
- How the efficiency scales?
Scaling should approximately follow a negative exponential
distribution. However, you will notice that the asymptote is different
from zero. Such an asymptote is basically fixed by the memory
bandwidth of the machine.
- Why do you think it scales that way?
The answer is that you are hitting the memory bandwidth limit in some
of your computations. So, the reason is that the problem becomes
bottle-necked by memory access when you use a large number of threads.
- How performance compares with the pure C computation?
You will see that numexpr can beat a pure C computation. This is for a
good reason: numexpr uses several threads here, while C code does not.
But see below.
Exercise 5
~~~~~~~~~~
- How the efficiency scales?
It scales very similarly than numexpr, but it performs a bit better.
- Which is the asymptotic limit?
It is still closer to the bandwidth limit than numexpr one.
The reason for this is two-folded: the virtual machine in numexpr
always has some overhead that C does not (i.e. dealing with blocks
takes time), and also numexpr still have to deal with (small)
temporaries in cache.
Exercise 6
~~~~~~~~~~
- Compare the performance of this with polynomial evaluation.
`y = x` performs pretty similarly to the polynomial::
y = ((.25*x + .75)*x - 1.5)*x - 2
- Why it scales very similarly than the polynomial evaluation?
The bottleneck is in memory access, no CPU computing time.
- Could you have a guess at the memory bandwidth of this machine?
A rough estimation is that, if saturation point is around 0.01
seconds for computing an array of size::
8 * 1e7 ~ 76 MB
then the bandwidth for this is around::
(76 / .01) ~ 7600 MB/s ~ 7.5 GB/s
Evaluating with carray
======================
Exercise 7
~~~~~~~~~~
- Why do you think carray evaluates faster than NumPy, even when
using the Python VM (virtual machine).
carray executes faster because it applies the blocking technique for
any virtual machine chosen. NumPy does not do that.
- How much the compression slows down the evaluation? Which is the
compression ratio achieved? Is that a lot?
The compression slows down computation of the polynomial by a factor
of 2x. The compression ratio is 2x, which is pretty usual for
real-data.
Exercise 8
~~~~~~~~~~
- How much the compression slows down the evaluation? Which is the
compression ratio achieved? Is that a lot?
In this case, compression ratio is much better: 100x for compression
level 5. And yes, 100x is really a lot.
- Why do you think the results vary so dramatically?
Because the output of the new expression is a boolean, with very low
entropy in data dsitribution. Hence, Blosc can do the compression on
a very easy and fast way. This is why performance is not affected in
this case by using compression.
Querying Big Data
=================
Exercise 9
~~~~~~~~~~
- How a carray query compares with a numpy one?
First, the syntax is simpler. Just compare the NumPy way::
sum(r["f0"] for r in nt[(nt["f0"]<10)])
with the carray way::
sum(r.f0 for r in t.where("f0<10"))
- Which is the compression ratio achieved in the ctable `t`?
It is around a factor of 2x. It is a normal ratio.
- How the different 'simple' and 'complex' query executes in
comparison with the NumPy ones?
The 'simple' query executes faster on a NumPy container, while the
'complex' query executes faster on a carray container. The reason is
that for the 'simple' query, carray shows a higher overhead for
dealing with chunked arrays. For the 'complex' query the overhead is
the computation of the expression, and here is where numexpr shows its
muscle, making the carray approach to go faster.
- If you are in the big Intel's Lab machine, increase the NROWS by one
order of magnitude and re-run the benchmark. What do you see?
The time used increases linearly and the compression ratio stays more
or less the same.
Exercise 10
~~~~~~~~~~~
- Try to find the sweet spot for the 'simple' query.
The sweet spot is using 1 thread for both NROWS == 1e7 and NROWS =
1e8. The problem is probably memory-bound, and using more threads
probably is not helping a lot.
- Repeat for the 'complex' query.
The sweet spot in this case is around 13 threads for NROWS = 1e7 and
around 16 threads for NROWS = 1e8.
- Why do you think there is such a large different in the sweet spot?
The big difference between the 'single' query and the 'complex' one is
that the numexpr VM used by carray can make a better use of threads
for evaluating a complex expression.