========================================
Exercises for Memory-Efficient Computing
========================================
In-memory computations: Numexpr as an accelerator of NumPy expressions
======================================================================
Initially, we are going to see how to optimize the computation of
expressions that fit well in main memory. For the exercises in this
sections we will mainly use the ``poly1.py`` script.
**Warning**: For this part, please remember to login into the remote
server for performing the computations.
1. Use script ``poly1.py`` to check how much time it takes to evaluate
the next polynomial::
y = .25*x**3 + .75*x**2 - 1.5*x - 2
with x in the range [-1, 1], and with 10 millions points.
- Set the `what` parameter to "numexpr" and take note of the
speed-up versus the "numpy" case. Why do you think the speed-up
is so large?
2. The expression below::
y = ((.25*x + .75)*x - 1.5)*x - 2
represents the same polynomial than the original one, but with some
interesting side-effects in efficiency. Repeat the computation for
numpy and numexpr and get your own conclusions.
- Why do you think numpy is performing much more efficiently with
this new expression?
- Why the speed-up in numexpr is not so high in comparison?
- Why numexpr continues to be faster than numpy?
3. The C program ``poly.c`` does the same computation than above, but
in pure C. Compile it like this::
gcc -O3 -o poly poly.c -lm
and execute it.
- Why do you think it is more efficient than the above approaches?
4. Be sure that you are on a multi-processor machine and repeat the
last computation in poly1.py (using numexpr) but increasing the
number of threads one by one (use the `ne.set_num_threads()`
function).
- How the efficency scales?
- Why do you think it scales that way?
- How peformance compares with the C computation?
5. The expression::
y = sin(x)**2 + cos(x)**2
contains the sine and cosine, transcental functions that
cannot easily be computed in terms of simple CPU operations and
need a lot of cycles to complete. Compute this using numpy first.
Then use numexpr with several threads.
- How the efficency scales?
- Why it scales differently that the previous polynomial
expression?
- Modify the poly.c so that you can evaluate this transcendental
expression. How it performs compared with numpy/numexpr?
Out-of-memory computations: ``numpy.memmap`` versus ``tables.Expr``
===================================================================
Now, we are going to make use of the script ``poly2.py`` to compute
the same problem than above, but using an out-of-memory paradigm.
Comparing ``numpy.memmap`` and ``tables.Expr`` approaches
---------------------------------------------------------
6. Use script ``poly2.py`` to study the `compute_numpy` and
`compute_tables` functions and try to understand how the different
``numpy.memmap`` and ``tables.Expr`` paradigms work.
- Compare the times for computing the polynomial via both
``numpy.memmap`` and ``tables.Expr`` (set the `what` variable
properly). Do you notice some difference? Why?
- Compare the latter times with the times for the in-memory
approach. Why do you think the out-of-memory paradigm is slower?
- With the out-of-memory approach, try putting the result
in-memory. Is the improvement noticeable?
Playing with compression
------------------------
7. With the ``tables.Expr`` module, play with different compression
levels (including 0, i.e. no compression) for the Blosc compressor.
- Which one compresses better?
- Which one achieves the best compression/time ratio?
- Is this competitive in terms of speed with the non-compressed
mode?
8. Compare 'blosc' with other compressors in PyTables, like 'zlib' or
'lzo'.
- Which one compresses better?
- Which one achieves the best compression/time ratio?
Making real "out-of-memory" computations
----------------------------------------
Of course, the advantage of the out-of-memory approach is that you can
still perform your computations even if they exceed your available
memory.
**Warning**: In order to not overload the server, please do the next
exercises on your laptops only.
9. Set the number of elements (N) in vector x to some value that
slightly exceeds the amount of the *physical* memory in your
laptop, but still, less than the *virtual* memory.
**Hint**: the working set for this problem is 2*N*size(datatype).
As the datatype is a double precision one, size(datatype)=8. So,
for a laptop with 1 GB of main memory, setting N=80 millions is
fine.
**Warning**: For this part, you should make sure that you have some
swap space available (check with `free` command). If you don't,
please create one.
- Which approach (``numpy.memmap`` or ``tables.Expr``) is faster?
10. You will have surely noticed some important jitter while doing
measurements in this section. Uncomment the::
os.system("sync")
line in `print_filesize()` function and see if measurements are a
bit more reproducible.
- Why do you think it is so?
11. With this setup, try with ``tables.Expr`` together with Blosc and
different compression levels.
- Which compression level gives best speed? Could you explain why?
Beyond virtual memory limits
----------------------------
12. Finally, use a working set slightly larger than your *virtual*
memory. First try ``tables.Expr`` and then ``numpy.memmap``. Spy
the memory consumption in another terminal with the "top" utility.
**Hint**: In this test ``numpy.memmap`` will ask for more virtual
memory than your system can possibly deliver, so be ready for
seeing your process to be killed by the OS, or even worse, you may
end with your kernel frozen for several minutes. If your are a bit
faint of heart, you are not forced to check this experimentally ;-)
- Why do you think ``tables.Expr`` consumes so little memory?