When parallelization does not help: the starving CPUs problem

Executive summary

Computational scientists should know that most of the time their CPUs are waiting for data to arrive. Knowing where the low-level bottlenecks are, and knowing what can be done to ameliorate them, may save hours of frustration when trying to understand why apparently well-written programs perform poorly.

I'll talk about why current CPUs are starving for data, and how to address this issue in modern computers by using different techniques.

Contents

1. Motivation

2. The Data Access Issue

  • Why Modern CPUs Are Starving?
  • Caches And The Hierarchical Memory Model
  • Techniques For Fighting Data Starvation

3. High Performance Libraries

Preliminary slides here. The multiprocessing script for NumPy is here.

Exercises

Fetch the guidelines in TXT format or PDF.

You will need these files:

Obviously, the solutions are not provided (yet!).

Server for exercises
You should see the following
Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb
+--[ RSA 2048]----+
|    o            |
|   o .   .       |
|    = . . .      |
|   . + +   .     |
|      o S . o    |
|   . .   = o +   |
|    o   o o o    |
|     . . ..o     |
|    ... .oo.E    |
+-----------------+

The exercises need to be performed on a machine with more power than a normal notebook. To access the server you need to use a different port (to bypass the firewall :( ):

ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes
CPU vs Memory Benchmark

Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this.

cpu_vs_mem.py
import numpy as np
import timeit
import sys
 
N = 10*1000*1000
x = np.linspace(1, 10, N)
 
def raw():
    """Straight-forward NumPy evaluation of polynomial.
 
    """
    return (((.25 * x) + .75) * x - 1.5) * x - 2
 
def inplace(block_size=20000):
    """Blocked evaluation of polynomial.
 
    """
    y = np.empty(len(x))
    for k in range(len(x) // block_size + 1):
        b, e = k * block_size, (k+1) * block_size
        y[b:e] = x[b:e]
        y[b:e] *= .25
        y[b:e] += .75
        y[b:e] *= x[b:e]
        y[b:e] -= 1.5
        y[b:e] *= x[b:e]
        y[b:e] -= 2
 
    return y
 
def bench():
    """Illustrate CPU vs memory trade-off.
 
    Break up a computation in chunks and benchmark. Small blocks fit into
    cache easily, but the NumPy overhead and the outer Python for-loop
    takes long to execute.  With large blocks, the overhead for NumPy and
    the for-loop is negligible, but the blocks no longer fit into cache,
    resulting in delays.
 
    Returns
    -------
    block_sizes : list
        Size of the different data chunks.
    times : list
        Execution times.
 
    """
    times = []
    blocks = np.round(np.logspace(3, 6, num=50))
    for b in blocks:
        times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b,
                                   'import cpu_vs_mem', number=1))
        print 'Block size: %d  Execution time: %.2f' % (b, times[-1])
        sys.stdout.flush()
 
    return blocks, times
 
if __name__ == "__main__":
    import matplotlib.pyplot as plt
 
    blocks, times = bench()
    plt.semilogx(blocks, times, 'o-')
    plt.xlabel('Block size [b]')
    plt.ylabel('Execution time [s]')
    plt.title('CPU vs Memory Benchmark')
    plt.show()
 
materials/starving_cpus.txt · Last modified: 2010/10/07 17:53 by python-faculty
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki