Use script ``poly1.py`` to check how much time it takes to evaluate the next polynomial:

y = .25*x**3 + .75*x**2 - 1.5*x - 2

with x in the range [-1, 1], and with 10 millions points.

- Set the `what` parameter to “numexpr” and take note of the speed-up versus the “numpy” case.
- Why do you think the speed-up is so large?

The expression below:

y = ((.25*x + .75)*x - 1.5)*x - 2

represents the same polynomial than the original one, but with some interesting side-effects in efficiency. Repeat the computation for numpy and numexpr and draw your own conclusions.

- Why do you think numpy is doing much more efficiently with this new expression?

- Why the speed-up in numexpr is not so high in comparison?

- Why numexpr continues to be faster than numpy?

The C program ``poly.c`` does the same computation than above, but in pure C. Compile it like this:

gcc -O3 -o poly poly.c -lm

and execute it.

- Why do you think it is more efficient than the above approaches?

Be sure that you are on a multi-processor machine and repeat the last computation in poly1.py but increasing the number of threads one by one (change the number in the ``for nt in range(1):`` loop).

- How the efficiency scales?

- Why do you think it scales that way?

- How performance compares with the pure C computation?

With the same multi-processor, recompile the above poly.c, but with OpenMP support:

gcc -O3 -o poly poly.c -lm -fopenmp # notice the new -fopenmp flag!

and execute it for several numbers of threads:

OMP_NUM_THREADS=desired_number_of_threads ./poly

Compare its performance with the parallel numexpr.

- How the efficiency scales?

- Which is the asymptotic limit?

With the previous examples, compute the expression:

y = x

That is, do a simple copy of the `x` vector. What's the performance that you are seeing? How does it evolve when using different threads?

Look into the sources of carray-eval.py and run it. For the first expression evaluation, i.e.:

((.25*x + .75)*x - 1.5)*x - 2

- Why do you think carray evaluates faster than NumPy, even when using the Python VM (virtual machine).

- How much the compression slows down the evaluation? Which is the compression ratio achieved? Is that a lot?

Repeat your reasoning with the second expression:

((.25*x + .75)*x - 1.5)*x - 2 < 0

- Why do you think the results vary so dramatically?

Look into the sources of 'carray-ctable.py' script and run it.

- How a carray query compares with a numpy one?

- Which is the compression ratio achieved in the ctable `t`?

- How the different 'simple' and 'complex' query executes in comparison with the NumPy ones?

- If you are in the big Intel's Lab machine, increase the NROWS by one order of magnitude and re-run the benchmark. What do you see?

Enter the ipython console and generate the big `t` ctable (just copy and paste the appropriate statements from the previous 'carray-ctable.py').

- Try to find the sweet spot for the 'simple' query by selecting different number of threads by running:

ca.set_nthreads(your_number_of_threads)

- Repeat for the 'complex' query.

- Why do you think there is such a large different in the sweet spot?