========================================
Exercises for Memory-Efficient Computing
========================================
Optimizing arithmetic expressions
=================================
1. Use script ``poly.py`` to check how much time it takes to evaluate
the next polynomial::
y = .25*x**3 + .75*x**2 - 1.5*x - 2
with x in the range [-1, 1], and with 10 millions points.
- Set the `what` parameter to "numexpr" and take note of the
speed-up versus the "numpy" case. Why do you think the speed-up
is so large?
2. The expression below::
y = ((.25*x + .75)*x - 1.5)*x - 2
represents the same polynomial than the original one, but with some
interesting side-effects in efficiency. Repeat this computation for
numpy and numexpr and get your own conclusions.
- Why do you think numpy is doing much more efficiently with this
new expression?
- Why the speed-up in numexpr is not so high in comparison?
- Why numexpr continues to be faster than numpy?
3. The C program ``poly.c`` does the same computation than above, but
in pure C. Compile it like this::
gcc -O3 -o poly poly.c -lm
and execute it.
- Why do you think it is more efficient than the above approaches?
Evaluating transcental functions
================================
4. Activate the evaluation of the "sin(x)**2+cos(x)**2" expression in
poly.py, a function that includes transcendental functions and run
the script.
- Why the difference in time between NumPy and Numexpr is so small?
5. In poly.c, comment out expression 2) (around line 53) and uncomment
expression 4) (the transcendental function).
- Do this pure C approach go faster than the Python-based ones?
- What would be needed to accelerate the computations?
Parallelism with threads
========================
6. Be sure that you are on a multi-processor machine and activate the::
y = ((.25*x + .75)*x - 1.5)*x - 2
expression in poly-mp.py. Repeat the computation for both numpy and
numexpr for a different number of processes (numpy) or threads
(numexpr) (pass the desired number as a parameter to the script).
- How the efficiency scales?
- Why do you think it scales that way?
- How performance compares with the pure C computation?
7. With the same multi-processor machine, activate the evaluation of
polynomial 2) in poly.c. Recompile now but with OpenMP support::
gcc -O3 -o poly poly.c -lm -fopenmp # notice the new -fopenmp flag!
and execute it for several number of threads (up to e.g. 8)::
OMP_NUM_THREADS=desired_number_of_threads ./poly
Compare its performance with the parallel numexpr.
- How the efficiency scales?
- Which is the asymptotic limit?
8. With the previous examples, compute the expression::
y = x
That is, do a simple copy of the `x` vector. What's the
performance that you are seeing?
- How does it evolve when using different threads? Why it scales very
similarly than the polynomial evaluation?
- Could you have a guess at the memory bandwidth of this machine?
Using Numba
===========
The goal of Numba is to compile arbitrarily complex Python code
on-the-flight and executing it for you. It is fast, although one should
take in account the compile times.
9. Edit poly-numba.py and look at how numba works.
- Run several expressions and determine which method is faster. What
is the compilation time for numba and how it compares with the
execution time?
- Raise the amount of data points to 100 millions. What happens?
- Set the number of threads for numexpr to 8 and redo the
computation. How its speed compares with numba?
- Provided this, which do you think is the best scenario for numba?
Which is the best scenario for numexpr?