01-numpy.md 24.9 KB
Newer Older
1
---
Greg Wilson's avatar
Greg Wilson committed
2
3
4
5
layout: page
title: Programming with Python
subtitle: Analyzing Patient Data
minutes: 30
6
---
Greg Wilson's avatar
Greg Wilson committed
7
8
9
> ## Learning Objectives {.objectives}
>
> *   Explain what a library is, and what libraries are used for.
10
> *   Import a Python library and use the things it contains.
Greg Wilson's avatar
Greg Wilson committed
11
12
13
14
15
> *   Read tabular data from a file into a program.
> *   Assign values to variables.
> *   Select individual values and subsections from data.
> *   Perform operations on arrays of data.
> *   Display simple graphs.
16

17
Words are useful,
18
but what's more useful are the sentences and stories we build with them.
19
Similarly,
jstapleton's avatar
jstapleton committed
20
while a lot of powerful, general tools are built into languages like Python,
21
specialized tools built up from these basic units live in [libraries](reference.html#library)
jstapleton's avatar
jstapleton committed
22
that can be called upon when needed.
23
24

In order to load our inflammation data,
Trevor Bekolay's avatar
Trevor Bekolay committed
25
26
we need to access ([import](reference.html#import) in Python terminology)
a library called [NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation").
Greg Wilson's avatar
Greg Wilson committed
27
In general you should use this library if you want to do fancy things with numbers,
28
especially if you have matrices or arrays.
Trevor Bekolay's avatar
Trevor Bekolay committed
29
We can import NumPy using:
30

Greg Wilson's avatar
Greg Wilson committed
31
32
33
~~~ {.python}
import numpy
~~~
34

35
Importing a library is like getting a piece of lab equipment out of a storage locker
36
and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space.
37
Once you've imported the library,
38
we can ask the library to read our data file for us:
39

Greg Wilson's avatar
Greg Wilson committed
40
~~~ {.python}
Peter Cock's avatar
Peter Cock committed
41
numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
Greg Wilson's avatar
Greg Wilson committed
42
43
44
~~~
~~~ {.output}
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
45
46
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
47
       ...,
48
49
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
Greg Wilson's avatar
Greg Wilson committed
50
51
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
~~~
52

Greg Wilson's avatar
Greg Wilson committed
53
The expression `numpy.loadtxt(...)` is a [function call](reference.html#function-call)
54
that asks Python to run the [function](reference.html#function) `loadtxt` which belongs to the `numpy` library.
Greg Wilson's avatar
Greg Wilson committed
55
This [dotted notation](reference.html#dotted-notation) is used everywhere in Python
56
to refer to the parts of things as `thing.component`.
57

Greg Wilson's avatar
Greg Wilson committed
58
`numpy.loadtxt` has two [parameters](reference.html#parameter):
59
the name of the file we want to read,
Greg Wilson's avatar
Greg Wilson committed
60
61
and the [delimiter](reference.html#delimiter) that separates values on a line.
These both need to be character strings (or [strings](reference.html#string) for short),
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
so we put them in quotes.

When we are finished typing and press Shift+Enter,
the notebook runs our command.
Since we haven't told it to do anything else with the function's output,
the notebook displays it.
In this case,
that output is the data we just loaded.
By default,
only a few rows and columns are shown
(with `...` to omit elements when displaying big arrays).
To save space,
Python displays numbers as `1.` instead of `1.0`
when there's nothing interesting after the decimal point.

Our call to `numpy.loadtxt` read our file,
but didn't save the data in memory.
To do that,
Greg Wilson's avatar
Greg Wilson committed
80
we need to [assign](reference.html#assignment) the array to a [variable](reference.html#variable).
81
A variable is just a name for a value,
Greg Wilson's avatar
Greg Wilson committed
82
such as `x`, `current_temperature`, or `subject_id`.
83
Python's variables must begin with a letter and are [case sensitive](reference.html#case-sensitive).
Kyler Brown's avatar
Kyler Brown committed
84
We can create a new variable by assigning a value to it using `=`.
Greg Wilson's avatar
Greg Wilson committed
85
86
87
88
As an illustration,
let's step back and instead of considering a table of data,
consider the simplest "collection" of data,
a single value.
89
The line below assigns the value `55` to a variable `weight_kg`:
Greg Wilson's avatar
Greg Wilson committed
90
91
92
93

~~~ {.python}
weight_kg = 55
~~~
94

95
Once a variable has a value, we can print it to the screen:
96

Greg Wilson's avatar
Greg Wilson committed
97
~~~ {.python}
98
print(weight_kg)
Greg Wilson's avatar
Greg Wilson committed
99
100
101
102
~~~
~~~ {.output}
55
~~~
103

104
and do arithmetic with it:
105

Greg Wilson's avatar
Greg Wilson committed
106
~~~ {.python}
107
print('weight in pounds:', 2.2 * weight_kg)
Greg Wilson's avatar
Greg Wilson committed
108
109
110
111
~~~
~~~ {.output}
weight in pounds: 121.0
~~~
112

jstapleton's avatar
jstapleton committed
113
114
115
As the example above shows,
we can print several things at once by separating them with commas.

116
We can also change a variable's value by assigning it a new one:
117

Greg Wilson's avatar
Greg Wilson committed
118
119
~~~ {.python}
weight_kg = 57.5
120
print('weight in kilograms is now:', weight_kg)
Greg Wilson's avatar
Greg Wilson committed
121
122
123
124
~~~
~~~ {.output}
weight in kilograms is now: 57.5
~~~
125

126
127
If we imagine the variable as a sticky note with a name written on it,
assignment is like putting the sticky note on a particular value:
128

129
![Variables as Sticky Notes](fig/python-sticky-note-variables-01.svg)
130

131
132
133
This means that assigning a value to one variable does *not* change the values of other variables.
For example,
let's store the subject's weight in pounds in a variable:
134

Greg Wilson's avatar
Greg Wilson committed
135
136
~~~ {.python}
weight_lb = 2.2 * weight_kg
137
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
138
139
140
141
~~~
~~~ {.output}
weight in kilograms: 57.5 and in pounds: 126.5
~~~
142

143
![Creating Another Variable](fig/python-sticky-note-variables-02.svg)
144

145
and then change `weight_kg`:
146

Greg Wilson's avatar
Greg Wilson committed
147
148
~~~ {.python}
weight_kg = 100.0
149
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
150
151
152
153
~~~
~~~ {.output}
weight in kilograms is now: 100.0 and weight in pounds is still: 126.5
~~~
154

155
![Updating a Variable](fig/python-sticky-note-variables-03.svg)
156

157
158
159
160
Since `weight_lb` doesn't "remember" where its value came from,
it isn't automatically updated when `weight_kg` changes.
This is different from the way spreadsheets work.

Benjamin Laken's avatar
Benjamin Laken committed
161
162
> ## Who's who in the memory {.callout}
>
163
164
165
> You can use the `%whos` command at any time to see what
> variables you have created and what modules you have loaded into the computer's memory.
> As this is an IPython command, it will only work if you are in an IPython terminal or the Jupyter Notebook.
Benjamin Laken's avatar
Benjamin Laken committed
166
>
167
168
169
170
171
172
173
174
175
176
> ~~~ {.python}
> %whos
> ~~~
> ~~~ {.output}
> Variable    Type       Data/Info
> --------------------------------
> numpy       module     <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
> weight_kg   float      100.0
> weight_lb   float      126.5
> ~~~
Benjamin Laken's avatar
Benjamin Laken committed
177

Johnny Lin's avatar
Johnny Lin committed
178
179
Just as we can assign a single value to a variable, we can also assign an array of values
to a variable using the same syntax.  Let's re-run `numpy.loadtxt` and save its result:
180

Greg Wilson's avatar
Greg Wilson committed
181
182
183
~~~ {.python}
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
~~~
184

185
186
187
This statement doesn't produce any output because assignment doesn't display anything.
If we want to check that our data has been loaded,
we can print the variable's value:
188

Greg Wilson's avatar
Greg Wilson committed
189
~~~ {.python}
190
print(data)
Greg Wilson's avatar
Greg Wilson committed
191
192
193
~~~
~~~ {.output}
[[ 0.  0.  1. ...,  3.  0.  0.]
194
195
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
196
 ...,
197
198
199
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]
Greg Wilson's avatar
Greg Wilson committed
200
~~~
201

202
203
204
Now that our data is in memory,
we can start doing things with it.
First,
Greg Wilson's avatar
Greg Wilson committed
205
let's ask what [type](reference.html#type) of thing `data` refers to:
206

Greg Wilson's avatar
Greg Wilson committed
207
~~~ {.python}
208
print(type(data))
Greg Wilson's avatar
Greg Wilson committed
209
210
~~~
~~~ {.output}
211
<class 'numpy.ndarray'>
Greg Wilson's avatar
Greg Wilson committed
212
~~~
213

214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
The output tells us that `data` currently refers to
an N-dimensional array created by the NumPy library.
These data correspond to arthritis patients' inflammation.
The rows are the individual patients and the columns
are their daily inflammation measurements.

> ## Data Type {.callout}
>
> A Numpy array contains one or more elements
> of the same type. `type` will only tell you that
> a variable is a NumPy array.
> We can also find the out the type
> of the data contained in the NumPy array.
>
> ~~~ {.python}
> print(data.dtype)
> ~~~
> ~~~ {.output}
> dtype('float64')
> ~~~
>
> This tells us that the NumPy array's elements are
> [floating-point numbers](reference.html#floating-point number).

jstapleton's avatar
jstapleton committed
238
We can see what the array's [shape](reference.html#shape) is like this:
239

Greg Wilson's avatar
Greg Wilson committed
240
~~~ {.python}
241
print(data.shape)
Greg Wilson's avatar
Greg Wilson committed
242
243
244
245
~~~
~~~ {.output}
(60, 40)
~~~
246

247
248
This tells us that `data` has 60 rows and 40 columns. When we created the
variable `data` to store our arthritis data, we didn't just create the array, we also
Trevor Bekolay's avatar
Trevor Bekolay committed
249
created information about the array, called [members](reference.html#member) or
250
251
attributes. This extra information describes `data` in
the same way an adjective describes a noun.
Bartosz T's avatar
Bartosz T committed
252
`data.shape` is an attribute  of `data` which describes the dimensions of `data`.
253
We use the same dotted notation for the attributes of variables
254
255
that we use for the functions in libraries
because they have the same part-and-whole relationship.
256

257
If we want to get a single number from the array,
Greg Wilson's avatar
Greg Wilson committed
258
we must provide an [index](reference.html#index) in square brackets,
259
just as we do in math:
260

Greg Wilson's avatar
Greg Wilson committed
261
~~~ {.python}
262
print('first value in data:', data[0, 0])
Greg Wilson's avatar
Greg Wilson committed
263
264
265
266
~~~
~~~ {.output}
first value in data: 0.0
~~~
267

Greg Wilson's avatar
Greg Wilson committed
268
~~~ {.python}
269
print('middle value in data:', data[30, 20])
Greg Wilson's avatar
Greg Wilson committed
270
271
272
273
~~~
~~~ {.output}
middle value in data: 13.0
~~~
274

275
276
277
278
279
The expression `data[30, 20]` may not surprise you,
but `data[0, 0]` might.
Programming languages like Fortran and MATLAB start counting at 1,
because that's what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count from 0
280
281
282
because that's more convenient when indices are computed rather than constant
(see [Mike Hoye's blog post](http://exple.tive.org/blarg/2013/10/22/citation-needed/)
for historical details).
283
284
285
286
287
288
289
290
As a result,
if we have an M&times;N array in Python,
its indices go from 0 to M-1 on the first axis
and 0 to N-1 on the second.
It takes a bit of getting used to,
but one way to remember the rule is that
the index is how many steps we have to take from the start to get the item we want.

Greg Wilson's avatar
Greg Wilson committed
291
> ## In the Corner {.callout}
292
293
294
295
296
297
>
> What may also surprise you is that when Python displays an array,
> it shows the element with index `[0, 0]` in the upper left corner
> rather than the lower left.
> This is consistent with the way mathematicians draw matrices,
> but different from the Cartesian coordinates.
298
> The indices are (row, column) instead of (column, row) for the same reason,
299
> which can be confusing when plotting data.
300
301
302
303
304

An index like `[30, 20]` selects a single element of an array,
but we can select whole sections as well.
For example,
we can select the first ten days (columns) of values
305
for the first four patients (rows) like this:
306

Greg Wilson's avatar
Greg Wilson committed
307
~~~ {.python}
308
print(data[0:4, 0:10])
Greg Wilson's avatar
Greg Wilson committed
309
310
311
~~~
~~~ {.output}
[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
312
313
314
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]
Greg Wilson's avatar
Greg Wilson committed
315
~~~
316

Greg Wilson's avatar
Greg Wilson committed
317
The [slice](reference.html#slice) `0:4` means,
318
319
320
321
322
323
"Start at index 0 and go up to, but not including, index 4."
Again,
the up-to-but-not-including takes a bit of getting used to,
but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don't have to start slices at 0:
324

Greg Wilson's avatar
Greg Wilson committed
325
~~~ {.python}
326
print(data[5:10, 0:10])
Greg Wilson's avatar
Greg Wilson committed
327
328
329
~~~
~~~ {.output}
[[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.]
330
331
332
333
 [ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.]
 [ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.]
 [ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.]
 [ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.]]
Greg Wilson's avatar
Greg Wilson committed
334
~~~
335

336
337
338
339
340
341
342
343
We also don't have to include the upper and lower bound on the slice.
If we don't include the lower bound,
Python uses 0 by default;
if we don't include the upper,
the slice runs to the end of the axis,
and if we don't include either
(i.e., if we just use ':' on its own),
the slice includes everything:
344

Greg Wilson's avatar
Greg Wilson committed
345
346
~~~ {.python}
small = data[:3, 36:]
347
348
print('small is:')
print(small)
Greg Wilson's avatar
Greg Wilson committed
349
350
351
~~~
~~~ {.output}
small is:
352
353
354
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
Greg Wilson's avatar
Greg Wilson committed
355
~~~
356

357
Arrays also know how to perform common mathematical operations on their values.
Greg Wilson's avatar
Greg Wilson committed
358
359
360
361
362
The simplest operations with data are arithmetic:
add, subtract, multiply, and divide.
 When you do such operations on arrays,
the operation is done on each individual element of the array.
Thus:
363

Greg Wilson's avatar
Greg Wilson committed
364
365
366
~~~ {.python}
doubledata = data * 2.0
~~~
367

Greg Wilson's avatar
Greg Wilson committed
368
369
will create a new array `doubledata`
whose elements have the value of two times the value of the corresponding elements in `data`:
370

Greg Wilson's avatar
Greg Wilson committed
371
~~~ {.python}
372
373
374
375
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
376
377
378
~~~
~~~ {.output}
original:
379
380
381
382
383
384
385
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
doubledata:
[[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]
Greg Wilson's avatar
Greg Wilson committed
386
~~~
387

Greg Wilson's avatar
Greg Wilson committed
388
389
If,
instead of taking an array and doing arithmetic with a single value (as above)
390
you did the arithmetic operation with another array of the same shape,
Greg Wilson's avatar
Greg Wilson committed
391
392
the operation will be done on corresponding elements of the two arrays.
Thus:
393

Greg Wilson's avatar
Greg Wilson committed
394
395
396
~~~ {.python}
tripledata = doubledata + data
~~~
397

398
399
400
will give you an array where `tripledata[0,0]` will equal `doubledata[0,0]` plus `data[0,0]`,
and so on for all other elements of the arrays.

Greg Wilson's avatar
Greg Wilson committed
401
~~~ {.python}
402
403
print('tripledata:')
print(tripledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
404
405
406
~~~
~~~ {.output}
tripledata:
407
408
409
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]
Greg Wilson's avatar
Greg Wilson committed
410
~~~
411

412
Often, we want to do more than add, subtract, multiply, and divide values of data.
413
NumPy knows how to do more complex operations on arrays.
414
415
If we want to find the average inflammation for all patients on all days,
for example,
416
we can ask NumPy to compute `data`'s mean value:
417

Greg Wilson's avatar
Greg Wilson committed
418
~~~ {.python}
419
print(numpy.mean(data))
Greg Wilson's avatar
Greg Wilson committed
420
421
422
423
~~~
~~~ {.output}
6.14875
~~~
424

425
426
427
428
429
430
431
432
433
`mean` is a [function](reference.html#function) that takes
an array as an [argument](reference.html#argument).
If variables are nouns, functions are verbs:
they do things with variables.

> ## Not all functions have input {.callout}
>
> Generally, a function uses inputs to produce outputs.
> However, some functions produce outputs without
434
435
> needing any input. For example, checking the current time
> doesn't require any input.
436
437
>
> ~~~ {.python}
438
439
> import time
> print(time.ctime())
440
441
> ~~~
> ~~~ {.output}
442
> 'Sat Mar 26 13:07:33 2016'
443
444
445
446
447
448
449
450
> ~~~
>
> For functions that don't take in any arguments,
> we still need parentheses (`()`)
> to tell Python to go and do something for us.

NumPy has lots of useful functions that take an array as input.
Let's use three of those functions to get some descriptive values about the dataset.
451
452
We'll also use multiple assignment,
a convenient Python feature that will enable us to do this all in one line.
453

Greg Wilson's avatar
Greg Wilson committed
454
~~~ {.python}
455
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)
456

Alistair Walsh's avatar
Alistair Walsh committed
457
458
459
print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)
Greg Wilson's avatar
Greg Wilson committed
460
461
462
~~~
~~~ {.output}
maximum inflammation: 20.0
463
464
minimum inflammation: 0.0
standard deviation: 4.61383319712
Greg Wilson's avatar
Greg Wilson committed
465
~~~
466

467
> ## Mystery functions in IPython {.callout}
468
>
469
470
471
472
473
474
475
476
477
> How did we know what functions NumPy has and how to use them?
> If you are working in the IPython/Jupyter Notebook there is an easy way to find out.
> If you type the name of something with a full-stop then you can use tab completion
> (e.g. type `numpy.` and then press tab)
> to see a list of all functions and attributes that you can use. After selecting one you
> can also add a question mark (e.g. `numpy.cumprod?`) and IPython will return an
> explanation of the method! This is the same as doing `help(numpy.cumprod)`.

When analyzing data, though,
478
479
480
we often want to look at partial statistics,
such as the maximum value per patient
or the average value per day.
481
One way to do this is to create a new temporary array of the data we want,
482
then ask it to do the calculation:
483

Greg Wilson's avatar
Greg Wilson committed
484
485
~~~ {.python}
patient_0 = data[0, :] # 0 on the first axis, everything on the second
486
print('maximum inflammation for patient 0:', patient_0.max())
Greg Wilson's avatar
Greg Wilson committed
487
488
489
490
~~~
~~~ {.output}
maximum inflammation for patient 0: 18.0
~~~
491

492
493
494
Everything in a line of code following the '#' symbol is a
[comment](reference.html#comment) that is ignored by the computer.
Comments allow programmers to leave explanatory notes for other
jstapleton's avatar
jstapleton committed
495
496
programmers or their future selves.

497
We don't actually need to store the row in a variable of its own.
498
Instead, we can combine the selection and the function call:
499

Greg Wilson's avatar
Greg Wilson committed
500
~~~ {.python}
501
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
Greg Wilson's avatar
Greg Wilson committed
502
503
504
505
~~~
~~~ {.output}
maximum inflammation for patient 2: 19.0
~~~
506

507
508
509
510
What if we need the maximum inflammation for *all* patients (as in the
next diagram on the left), or the average for each day (as in the
diagram on the right)? As the diagram below shows, we want to perform the
operation across an axis:
511

512
![Operations Across Axes](fig/python-operations-across-axes.png)
513

514
To support this,
515
most array functions allow us to specify the axis we want to work on.
516
If we ask for the average across axis 0 (rows in our 2D example),
517
we get:
518

Greg Wilson's avatar
Greg Wilson committed
519
~~~ {.python}
520
print(numpy.mean(data, axis=0))
Greg Wilson's avatar
Greg Wilson committed
521
522
523
~~~
~~~ {.output}
[  0.           0.45         1.11666667   1.75         2.43333333   3.15
524
525
526
527
528
529
530
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]
Greg Wilson's avatar
Greg Wilson committed
531
~~~
532

533
534
As a quick check,
we can ask this array what its shape is:
535

Greg Wilson's avatar
Greg Wilson committed
536
~~~ {.python}
537
print(numpy.mean(data, axis=0).shape)
Greg Wilson's avatar
Greg Wilson committed
538
539
540
541
~~~
~~~ {.output}
(40,)
~~~
542

543
544
The expression `(40,)` tells us we have an N&times;1 vector,
so this is the average inflammation per day for all patients.
545
If we average across axis 1 (columns in our 2D example), we get:
546

Greg Wilson's avatar
Greg Wilson committed
547
~~~ {.python}
548
print(numpy.mean(data, axis=1))
Greg Wilson's avatar
Greg Wilson committed
549
550
551
~~~
~~~ {.output}
[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
552
553
554
555
556
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]
Greg Wilson's avatar
Greg Wilson committed
557
~~~
558

559
560
561
562
563
564
which is the average inflammation per patient across all days.

The mathematician Richard Hamming once said,
"The purpose of computing is insight, not numbers,"
and the best way to develop insight is often to visualize data.
Visualization deserves an entire lecture (or course) of its own,
565
but we can explore a few features of Python's `matplotlib` library here.
Greg Wilson's avatar
Greg Wilson committed
566
567
While there is no "official" plotting library,
this package is the de facto standard.
568
569
570
First,
we will import the `pyplot` module from `matplotlib`
and use two of its functions to create and display a heat map of our data:
571

Greg Wilson's avatar
Greg Wilson committed
572
~~~ {.python}
573
574
import matplotlib.pyplot
image  = matplotlib.pyplot.imshow(data)
575
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
576
~~~
577

578
![Heatmap of the Data](fig/01-numpy_71_0.png)
579

580
581
582
Blue regions in this heat map are low values, while red shows high values.
As we can see,
inflammation rises and falls over a 40-day period.
583
584
585
586
587
588

> ## Some IPython magic {.callout}
>
> If you're using an IPython / Jupyter notebook,
> you'll need to execute the following command
> in order for your matplotlib images to appear
Damien Irving's avatar
Damien Irving committed
589
> in the notebook when `show()` is called:
Damien Irving's avatar
Damien Irving committed
590
591
592
593
>
> ~~~ {.python}
> % matplotlib inline
> ~~~
594
>
595
596
> The `%` indicates an IPython magic function -
> a function that is only valid within the notebook environment.
597
598
> Note that you only have to execute this function once per notebook.

599
Let's take a look at the average inflammation over time:
600

Greg Wilson's avatar
Greg Wilson committed
601
~~~ {.python}
602
ave_inflammation = numpy.mean(data, axis=0)
603
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
604
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
605
~~~
606

607
![Average Inflammation Over Time](fig/01-numpy_73_0.png)
608

609
610
Here,
we have put the average per day across all patients in the variable `ave_inflammation`,
611
then asked `matplotlib.pyplot` to create and display a line graph of those values.
612
613
614
615
616
The result is roughly a linear rise and fall,
which is suspicious:
based on other studies,
we expect a sharper rise and slower fall.
Let's have a look at two other statistics:
617

Greg Wilson's avatar
Greg Wilson committed
618
~~~ {.python}
619
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
620
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
621
~~~
622

623
![Maximum Value Along The First Axis](fig/01-numpy_75_1.png)
Greg Wilson's avatar
Greg Wilson committed
624
625

~~~ {.python}
626
min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
627
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
628
~~~
629

630
![Minimum Value Along The First Axis](fig/01-numpy_75_3.png)
631

632
633
634
635
636
The maximum value rises and falls perfectly smoothly,
while the minimum seems to be a step function.
Neither result seems particularly likely,
so either there's a mistake in our calculations
or something is wrong with our data.
637
This insight would have been difficult to reach by
jstapleton's avatar
jstapleton committed
638
examining the data without visualization tools.
639

640
You can group similar plots in a single figure using subplots.
641
This script below uses a number of new commands. The function `matplotlib.pyplot.figure()`
642
creates a space into which we will place all of our plots. The parameter `figsize`
643
tells Python how big to make this space. Each subplot is placed into the figure using
644
its `add_subplot` [method](reference.html#method). The `add_subplot` method takes 3 parameters. The first denotes
645
how many total rows of subplots there are, the second parameter refers to the
valiseverywhere's avatar
valiseverywhere committed
646
total number of subplot columns, and the final parameter denotes which subplot
647
648
649
your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a
different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can
be titled using the `set_xlabel()` command (or `set_ylabel()`).
650
Here are our three plots side by side:
651

Greg Wilson's avatar
Greg Wilson committed
652
~~~ {.python}
653
654
import numpy
import matplotlib.pyplot
655

656
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
657

658
fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
659

660
661
662
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
663

664
axes1.set_ylabel('average')
665
axes1.plot(numpy.mean(data, axis=0))
666

667
axes2.set_ylabel('max')
668
axes2.plot(numpy.max(data, axis=0))
669

670
axes3.set_ylabel('min')
671
axes3.plot(numpy.min(data, axis=0))
672
673

fig.tight_layout()
674

675
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
676
~~~
677

678
![The Previous Plots as Subplots](fig/01-numpy_80_0.png)
679

680
The [call](reference.html#function-call) to `loadtxt` reads our data,
681
682
and the rest of the program tells the plotting library
how large we want the figure to be,
683
that we're creating three subplots,
684
685
686
what to draw for each one,
and that we want a tight layout.
(Perversely,
687
if we leave out that call to `fig.tight_layout()`,
688
the graphs will actually be squeezed together more closely.)
689

690
691
692
693
694
695
696
697
698
> ## Scientists dislike typing {.callout}
>
> We will always use the syntax `import numpy` to import NumPy.
> However, in order to save typing, it is
> [often suggested](http://www.scipy.org/getting-started.html#an-example-script)
> to make a shortcut like so: `import numpy as np`.
> If you ever see Python code online using a NumPy function with `np`
> (for example, `np.loadtxt(...)`), it's because they've used this shortcut.

mboisson's avatar
mboisson committed
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
> ## Check your understanding {.challenge}
>
> Draw diagrams showing what variables refer to what values after each statement in the following program:
>
> ~~~ {.python}
> mass = 47.5
> age = 122
> mass = mass * 2.0
> age = age - 20
> ~~~

> ## Sorting out references {.challenge}
>
> What does the following program print out?
>
> ~~~ {.python}
> first, second = 'Grace', 'Hopper'
> third, fourth = second, first
717
> print(third, fourth)
mboisson's avatar
mboisson committed
718
719
720
721
722
723
724
725
726
> ~~~

> ## Slicing strings {.challenge}
>
> A section of an array is called a [slice](reference.html#slice).
> We can take slices of character strings as well:
>
> ~~~ {.python}
> element = 'oxygen'
727
728
> print('first three characters:', element[0:3])
> print('last three characters:', element[3:6])
mboisson's avatar
mboisson committed
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
> ~~~
>
> ~~~ {.output}
> first three characters: oxy
> last three characters: gen
> ~~~
>
> What is the value of `element[:4]`?
> What about `element[4:]`?
> Or `element[:]`?
>
> What is `element[-1]`?
> What is `element[-2]`?
> Given those answers,
> explain what `element[1:-1]` does.

> ## Thin slices {.challenge}
>
> The expression `element[3:3]` produces an [empty string](reference.html#empty-string),
> i.e., a string that contains no characters.
> If `data` holds our array of patient data,
> what does `data[3:3, 4:4]` produce?
> What about `data[3:3, :]`?
Greg Wilson's avatar
Greg Wilson committed
752

Pauline's avatar
Pauline committed
753
> ## Check your understanding: plot scaling {.challenge}
Greg Wilson's avatar
Greg Wilson committed
754
>
755
756
> Why do all of our plots stop just short of the upper end of our graph?
> If we want to change this, we can use the `set_ylim(min, max)` method of each 'axes', for example:
757
758
759
760
>
> ~~~ {.python}
> axes3.set_ylim(0,6)
> ~~~
761
>
762
> Update your plotting code to automatically set a more appropriate scale (hint: you can make use of the `max` and `min` methods to help)
Greg Wilson's avatar
Greg Wilson committed
763

Pauline's avatar
Pauline committed
764
> ## Check your understanding: drawing straight lines {.challenge}
Greg Wilson's avatar
Greg Wilson committed
765
766
>
> Why are the vertical lines in our plot of the minimum inflammation per day not perfectly vertical?
Greg Wilson's avatar
Greg Wilson committed
767

Pauline's avatar
Pauline committed
768
> ## Make your own plot {.challenge}
Greg Wilson's avatar
Greg Wilson committed
769
>
770
> Create a plot showing the standard deviation (`numpy.std`) of the inflammation data for each day across all patients.
771
772
773
774

> ## Moving plots around {.challenge}
>
> Modify the program to display the three plots on top of one another instead of side by side.
775

776
> ## Stacking arrays {.challenge}
777
>
778
779
> Arrays can be concatenated and stacked on top of one another,
> using NumPy's `vstack` and `hstack` functions for vertical and horizontal stacking, respectively.
780
>
781
782
> ~~~ {.python}
> import numpy
783
>
784
785
786
> A = numpy.array([[1,2,3], [4,5,6], [7, 8, 9]])
> print('A = ')
> print(A)
787
>
788
789
790
> B = numpy.hstack([A, A])
> print('B = ')
> print(B)
791
>
792
793
794
> C = numpy.vstack([A, A])
> print('C = ')
> print(C)
795
> ~~~
796
>
797
> ~~~ {.output}
798
> A =
799
800
801
> [[1 2 3]
>  [4 5 6]
>  [7 8 9]]
802
> B =
803
804
805
> [[1 2 3 1 2 3]
>  [4 5 6 4 5 6]
>  [7 8 9 7 8 9]]
806
> C =
807
808
809
810
811
812
813
> [[1 2 3]
>  [4 5 6]
>  [7 8 9]
>  [1 2 3]
>  [4 5 6]
>  [7 8 9]]
> ~~~
814
815
816
817
>
> Write some additional code that slices the first and last columns of `A`,
> and stacks them into a 3x2 array.
> Make sure to `print` the results to verify your solution.