Problem-set: Advanced NumPy

Please do explore beyond the problems given, and feel free to ask questions at any time.

Note: Solutions to some problems are provided, but try to avoid peeking at them until you've made an attempt! Before we start, load NumPy:

import numpy as np

These exercises explore some of the more advanced features of NumPy. If you prefer to learn the basics from scratch, have a look at this introduction.

If you run out of exercises, there are 90 more here!

The NumPy array

  • Let's explore the basics of a NumPy array. Create a 3x4 array of random values (using np.random.random), and call it x. Create another array as follows: y = x[2]. What happens when you modify y, does x also change?
  • In the array above, have a look at its different properties: x.shape, x.ndim, x.dtype, x.strides. What does each property tell you about the array?
  • Compare the strides property of x.T to the above. What is x.T and can you explain the new strides?
  • Compute the sum of the columns of x. You may find the function to do that by using np.<TAB> in IPython / Jupyter, print(dir(np)) in standard Python or np.lookfor('sum'). Remember that you can always look at the docstring of any function in Python (use np.func_name?<ENTER> in IPython or help(np.func_name) in standard Python).
  • Construct the array x = np.array([0, 1, 2, 3], dtype=np.uint8) (here, uint8 represents a single byte in memory, an unsigned integer between 0 and 255). Can you explain the results obtained by x + 1 and x / 2? Also try x.astype(float) + 1 and x.astype(float) / 2.
  • What is the maximum number of dimensions a NumPy array can have? Use one of the array constructors (np.zeros, np.empty, np.random.random, etc.) to find out.

Broadcasting

Q1

Take a look at the following. Many of you may know it as a meshgrid:

x, y = np.mgrid[:10, :5]
z = x + y

Now, reproduce z from the above snippet, but this time use NumPy's ogrid together with broadcasting.

In our solution, broadcasting is used "behind the scenes". To see what happens more explicitly, apply np.broadcast_arrays on the x and y given by ogrid. This should correspond to the x and y produced by mgrid.

Benchmark the two approaches (one using mgrid, the other ogrid), using IPython's %timeit function, on a larger grid. Can you explain the difference in execution time?

Q2

Given a list of XYZ-coordinates, p,

[[1.0, 2.0, 10],
 [3.0, 4.0, 20],
 [5.0, 6.0, 30],
 [7.0, 8.0, 40]]

Normalise each coordinate by dividing with its Z (3rd) element. For example, the first row becomes:

[1/10, 2/10, 10/10]

Hint: extract the last column into a variable z, and then change its dimensions so that p / z works.

Indexing

Q1

Create a 3x3 ndarray. Use fancy indexing to slice out the diagonal elements.

Q2

Generate a 10 x 3 array of random numbers (all between 0 and 1). From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.

Q3

Predict and verify the shape of the following slicing operation.

x = np.empty((10, 8, 6))

idx0 = np.zeros((3, 8)).astype(int)
idx1 = np.zeros((3, 1)).astype(int)
idx2 = np.zeros((1, 1)).astype(int)

x[idx0, idx1, idx2]

Q4 [advanced]

Strictly speaking, this is not a question on indexing, but it's a fun exercise either way.

Construct an array

x = np.arange(12, dtype=np.int32).reshape((3, 4))

so that x is

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Now, provide to np.lib.stride_tricks.as_strided the strides necessary to view a sliding 2x2 window over this array. The output should be

array([[[[ 0,  1],
         [ 4,  5]],

        [[ 1,  2],
         [ 5,  6]],

        [[ 2,  3],
         [ 6,  7]]],


       [[[ 4,  5],
         [ 8,  9]],

        [[ 5,  6],
         [ 9, 10]],

        [[ 6,  7],
         [10, 11]]]], dtype=int32)

The code is of the form

z = as_strided(x, shape=(2, 3, 2, 2),
                  strides=(..., ..., ..., ...))

This sort of stride manipulation is handy for implementing techniques such as region based statistics, convolutions, etc.

Structured Arrays

Q1: Structured data types

Design a data-type for storing the following record:

  • Timestamp in nanoseconds (a 64-bit unsigned integer)
  • Position (x- and y-coordinates, stored as floating point numbers)

Use it to represent the following data:

dt = np.dtype(<your code here>)
x = np.array([(100, (0, 0.5)),
              (200, (0, 10.3)),
              (300, (5.5, 15.1))], dtype=dt)

Have a look at the np.dtype docstring if you need help. After constructing x, try to print all the x values for which time is greater than 100 (hint: something of the form y[y > 100]).

Q2: Structured file I/O

Place the following data in a text file, data.txt:

% rank         lemma (10 letters max)      frequency       dispersion
21             they                        1865844         0.96
42             her                         969591          0.91
49             as                          829018          0.95
7              to                          6332195         0.98
63             take                        670745          0.97
14             you                         3085642         0.92
35             go                          1151045         0.93
56             think                       772787          0.91
28             not                         1638883         0.98

Now, design a suitable structured data type, then load the data from the text file using np.loadtxt (look at the docstring to see how to handle the '%' comment character). Here's a skeleton to start with:

import numpy as np
txtdata = open('data.txt', 'r')

# Construct the data-type
# In IPython type np.dtype?<ENTER> for examples, e.g.,
#
# np.dtype([('x', np.float), ('y', np.int), ('z', np.uint8)])

dt = np.dtype(...)  # Modify this line to give the correct answer

data = np.loadtxt(...)  # Load data with loadtxt

Examine the data you got, for example:

  • Extract words only
  • Extract the 3rd row
  • Print all words with rank < 30

Sort the data according to frequency (see np.sort).

Save the result to a compressed Numpy data file, sorted.npz, using np.savez and load it back with out = np.load(...).

Do you get back what you put in? If not, use dir(out) or TAB-completion on out to figure out how to access the data (HINT: out behaves like a dictionary).

Array Interface

An author of a foreign package (included with the exercizes as problems/mutable_str.py) provides a string class that allocates its own memory:

In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print(s)
abcde

You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

Add an array_interface dictionary attribute to s, then convert s to an ndarray. Numerically add "2" to the array (use the in-place operator +=).

Then print the original string to ensure that its value was modified.

Here's a skeleton outline:

import numpy as np
from mutable_str import MutableString

s = MutableString('abcde')

# --- EDIT THIS SECTION ---

# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
                         'shape' : XXX,
                         'typestr' : '|u1', # typecode unsigned character
                         }

# --- EDIT THIS SECTION ---

print('String before converting to array:', s)
sa = np.asarray(s)

print('String after converting to array:', sa)

sa += 2
print('String after adding "2" to array:', s)

Hint: Documentation for NumPy's __array_interface__ may be found in the online docs.

Advanced Exercises

Q1

Construct the following two arrays:

x = np.array([[1, 2], [3, 4]], order='C', dtype=np.uint8)
y = np.array([[1, 2], [3, 4]], order='F', dtype=np.uint8)

Compare the bytes they store in memory by using

import ctypes
print(list(ctypes.string_at(x.ctypes.data, 4)))
print(list(ctypes.string_at(y.ctypes.data, 4)))

For any array in general, you can do the above using

print(list(ctypes.string_at(x.ctypes.data, x.size * x.itemsize)))

Note that, even though these arrays store data in different memory order, they are identical from the user's perspective.

print(x)
print(y)

Examine the bytes stored by the following array:

x = np.array([[1, 2], [3, 4]], dtype=np.uint32)

Note that, on most laptops, the byte order will be little Endian, i.e. least significant byte first.

Very Advanced Exercises

Q1 : Tensordot

Consider a set of p matrices, each with shape (n, n), and a set of p vectors, each with shape (n, 1).

Compute the sum of their p matrix products using a single call to np.tensordot. The result has shape (n, 1).

Start with the following code as template:

p, n = 10, 20
M = np.ones((p,n,n))
V = np.ones((p,n,1))

S = np.tensordot(M, V, ...)

print(S)

Example run:

X = np.array([[[1, 1],
              [1, 1]],

             [[2, 2],
              [2, 2]]])

X.shape is (2, 2, 2), i.e. we have two arrays of (2, 2).

Y = np.array([[[1],
               [1]],

              [[2],
               [2]]])

Y.shape is (2, 2, 1), i.e. we have two vectors of (2, 1).

out = np.tensordot(X, Y, ...)

The output is:

array([[10],
       [10]])

Additional note: Since v1.8, numpy implements dot as a generalised ufunc, which means that it can be applied to stacks of matrices.

Q2

Attempt the Fortran-ordering quiz.

In [1]:
%reload_ext load_style
%load_style css/exercises.css