======================================== Exercises for Memory-Efficient Computing ======================================== Optimizing arithmetic expressions ================================= 1. Use script ``poly.py`` to check how much time it takes to evaluate the next polynomial:: y = .25*x**3 + .75*x**2 - 1.5*x - 2 with x in the range [-1, 1], and with 10 millions points. - Set the `what` parameter to "numexpr" and take note of the speed-up versus the "numpy" case. Why do you think the speed-up is so large? 2. The expression below:: y = ((.25*x + .75)*x - 1.5)*x - 2 represents the same polynomial than the original one, but with some interesting side-effects in efficiency. Repeat this computation for numpy and numexpr and get your own conclusions. - Why do you think numpy is doing much more efficiently with this new expression? - Why the speed-up in numexpr is not so high in comparison? - Why numexpr continues to be faster than numpy? 3. The C program ``poly.c`` does the same computation than above, but in pure C. Compile it like this:: gcc -O3 -o poly poly.c -lm and execute it. - Why do you think it is more efficient than the above approaches? Evaluating transcental functions ================================ 4. Activate the evaluation of the "sin(x)**2+cos(x)**2" expression in poly.py, a function that includes transcendental functions and run the script. - Why the difference in time between NumPy and Numexpr is so small? 5. In poly.c, comment out expression 2) (around line 53) and uncomment expression 4) (the transcendental function). - Do this pure C approach go faster than the Python-based ones? - What would be needed to accelerate the computations? Parallelism with threads ======================== 6. Be sure that you are on a multi-processor machine and activate the:: y = ((.25*x + .75)*x - 1.5)*x - 2 expression in poly-mp.py. Repeat the computation for both numpy and numexpr for a different number of processes (numpy) or threads (numexpr) (pass the desired number as a parameter to the script). - How the efficiency scales? - Why do you think it scales that way? - How performance compares with the pure C computation? 7. With the same multi-processor machine, activate the evaluation of polynomial 2) in poly.c. Recompile now but with OpenMP support:: gcc -O3 -o poly poly.c -lm -fopenmp # notice the new -fopenmp flag! and execute it for several number of threads (up to e.g. 8):: OMP_NUM_THREADS=desired_number_of_threads ./poly Compare its performance with the parallel numexpr. - How the efficiency scales? - Which is the asymptotic limit? 8. With the previous examples, compute the expression:: y = x That is, do a simple copy of the `x` vector. What's the performance that you are seeing? - How does it evolve when using different threads? Why it scales very similarly than the polynomial evaluation? - Could you have a guess at the memory bandwidth of this machine? Using Numba =========== The goal of Numba is to compile arbitrarily complex Python code on-the-flight and executing it for you. It is fast, although one should take in account the compile times. 9. Edit poly-numba.py and look at how numba works. - Run several expressions and determine which method is faster. What is the compilation time for numba and how it compares with the execution time? - Raise the amount of data points to 100 millions. What happens? - Set the number of threads for numexpr to 8 and redo the computation. How its speed compares with numba? - Provided this, which do you think is the best scenario for numba? Which is the best scenario for numexpr?