Advanced Scientific Programming in Python [starving_cpu_exercises]

Exercises for Memory-Efficient Computing

Optimizing arithmetic expressions

Exercise 1

Use script ``poly1.py`` to check how much time it takes to evaluate the next polynomial:

    y = .25*x**3 + .75*x**2 - 1.5*x - 2

with x in the range [-1, 1], and with 10 millions points.

Set the `what` parameter to “numexpr” and take note of the speed-up versus the “numpy” case.
Why do you think the speed-up is so large?

Exercise 2

The expression below:

  y = ((.25*x + .75)*x - 1.5)*x - 2

represents the same polynomial than the original one, but with some interesting side-effects in efficiency. Repeat the computation for numpy and numexpr and draw your own conclusions.

Why do you think numpy is doing much more efficiently with this new expression?

Why the speed-up in numexpr is not so high in comparison?

Why numexpr continues to be faster than numpy?

Exercise 3

The C program ``poly.c`` does the same computation than above, but in pure C. Compile it like this:

  gcc -O3 -o poly poly.c -lm

and execute it.

Why do you think it is more efficient than the above approaches?

Parallelism with threads

Exercise 4

Be sure that you are on a multi-processor machine and repeat the last computation in poly1.py but increasing the number of threads one by one (change the number in the ``for nt in range(1):`` loop).

How the efficiency scales?

Why do you think it scales that way?

How performance compares with the pure C computation?

Exercise 5

With the same multi-processor, recompile the above poly.c, but with OpenMP support:

  gcc -O3 -o poly poly.c -lm -fopenmp    # notice the new -fopenmp flag!

and execute it for several numbers of threads:

  OMP_NUM_THREADS=desired_number_of_threads ./poly

Compare its performance with the parallel numexpr.

How the efficiency scales?

Which is the asymptotic limit?

Exercise 6

With the previous examples, compute the expression:

  y = x

That is, do a simple copy of the `x` vector. What's the performance that you are seeing? How does it evolve when using different threads?

Evaluating with carray

Exercise 7

Look into the sources of carray-eval.py and run it. For the first expression evaluation, i.e.:

    ((.25*x + .75)*x - 1.5)*x - 2

Why do you think carray evaluates faster than NumPy, even when using the Python VM (virtual machine).

How much the compression slows down the evaluation? Which is the compression ratio achieved? Is that a lot?

Exercise 8

Repeat your reasoning with the second expression:

    ((.25*x + .75)*x - 1.5)*x - 2 < 0

Why do you think the results vary so dramatically?

Querying Big Data

Exercise 9

Look into the sources of 'carray-ctable.py' script and run it.

How a carray query compares with a numpy one?

Which is the compression ratio achieved in the ctable `t`?

How the different 'simple' and 'complex' query executes in comparison with the NumPy ones?

If you are in the big Intel's Lab machine, increase the NROWS by one order of magnitude and re-run the benchmark. What do you see?

Exercise 10

Enter the ipython console and generate the big `t` ctable (just copy and paste the appropriate statements from the previous 'carray-ctable.py').

Try to find the sweet spot for the 'simple' query by selecting different number of threads by running:

      ca.set_nthreads(your_number_of_threads)

Repeat for the 'complex' query.

Why do you think there is such a large different in the sweet spot?