01-numpy.md 35.7 KB
Newer Older
1
---
2
3
4
5
title: Analyzing Patient Data
teaching: 30
exercises: 0
questions:
Greg Wilson's avatar
Greg Wilson committed
6
- "How can I process tabular data files in Python?"
7
objectives:
Brian Jackson's avatar
Brian Jackson committed
8
- "Explain what a library is and what libraries are used for."
9
- "Import a Python library and use the functions it contains."
10
11
12
13
- "Read tabular data from a file into a program."
- "Assign values to variables."
- "Select individual values and subsections from data."
- "Perform operations on arrays of data."
14
- "Plot simple graphs from data."
15
keypoints:
Greg Wilson's avatar
Greg Wilson committed
16
17
18
19
20
21
- "Import a library into a program using `import libraryname`."
- "Use the `numpy` library to work with arrays in Python."
- "Use `variable = value` to assign a value to a variable in order to record it in memory."
- "Variables are created on demand whenever a value is assigned to them."
- "Use `print(something)` to display the value of `something`."
- "The expression `array.shape` gives the shape of an array."
22
- "Use `array[x, y]` to select a single element from a 2D array."
Greg Wilson's avatar
Greg Wilson committed
23
- "Array indices start at 0, not 1."
Dustin Lang's avatar
Dustin Lang committed
24
- "Use `low:high` to specify a `slice` that includes the indices from `low` to `high-1`."
Greg Wilson's avatar
Greg Wilson committed
25
26
27
28
29
- "All the indexing and slicing that works on arrays also works on strings."
- "Use `# some kind of explanation` to add comments to programs."
- "Use `numpy.mean(array)`, `numpy.max(array)`, and `numpy.min(array)` to calculate simple statistics."
- "Use `numpy.mean(array, axis=0)` or `numpy.mean(array, axis=1)` to calculate statistics across the specified axis."
- "Use the `pyplot` library from `matplotlib` for creating simple visualizations."
30
31
---

Justin Pringle's avatar
Justin Pringle committed
32
In this lesson we will learn how to manipulate the inflammation dataset with Python. Before we discuss how to deal with many data points, we will show how to store a single value on the computer.
33

Justin Pringle's avatar
Justin Pringle committed
34
35
36
37
38
39
You can get output from python by typing math into the console:
~~~
3+5
12/7
~~~
However to do anything useful and/or interesting we need to assign values to _variables_ (or link _objects_ to names/variables).
40
The line below [assigns]({{ page.root }}/reference/#assign) the value `60` to a [variable]({{ page.root }}/reference/#variable) `weight_kg`:
41

42
~~~
43
weight_kg = 60
Greg Wilson's avatar
Greg Wilson committed
44
~~~
45
{: .language-python}
46

Justin Pringle's avatar
Justin Pringle committed
47
A variable is a name for a value,
48
such as `x_val`, `current_temperature`, or `subject_id`.
Trevor Bekolay's avatar
Trevor Bekolay committed
49
Python's variables must begin with a letter and are [case sensitive]({{ page.root }}/reference/#case-sensitive).
Kyler Brown's avatar
Kyler Brown committed
50
We can create a new variable by assigning a value to it using `=`.
51
52
When we are finished typing and press Shift+Enter,
the notebook runs our command.
53

54
Once a variable has a value, we can print it to the screen:
55

56
~~~
57
print(weight_kg)
Greg Wilson's avatar
Greg Wilson committed
58
~~~
59
{: .language-python}
60
61

~~~
62
60
Greg Wilson's avatar
Greg Wilson committed
63
~~~
64
{: .output}
65

Brian Jackson's avatar
Brian Jackson committed
66
and do arithmetic with it (remember, there are 2.2 pounds per kilogram):
67

68
~~~
69
print('weight in pounds:', 2.2 * weight_kg)
Greg Wilson's avatar
Greg Wilson committed
70
~~~
71
{: .language-python}
72
73

~~~
74
weight in pounds: 132.0
Greg Wilson's avatar
Greg Wilson committed
75
~~~
76
{: .output}
77

jstapleton's avatar
jstapleton committed
78
79
80
As the example above shows,
we can print several things at once by separating them with commas.

81
We can also change a variable's value by assigning it a new one:
82

83
~~~
84
weight_kg = 65.0
85
print('weight in kilograms is now:', weight_kg)
Greg Wilson's avatar
Greg Wilson committed
86
~~~
87
{: .language-python}
88
89

~~~
90
weight in kilograms is now: 65.0
Greg Wilson's avatar
Greg Wilson committed
91
~~~
92
{: .output}
93

94
95
If we imagine the variable as a sticky note with a name written on it,
assignment is like putting the sticky note on a particular value:
96

97
![Variables as Sticky Notes](../fig/python-sticky-note-variables-01.svg)
98

99
100
101
This means that assigning a value to one variable does *not* change the values of other variables.
For example,
let's store the subject's weight in pounds in a variable:
102

103
~~~
104
# There are 2.2 pounds per kilogram
Greg Wilson's avatar
Greg Wilson committed
105
weight_lb = 2.2 * weight_kg
106
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
107
~~~
108
{: .language-python}
109
110

~~~
111
weight in kilograms: 65.0 and in pounds: 143.0
Greg Wilson's avatar
Greg Wilson committed
112
~~~
113
{: .output}
114

115
![Creating Another Variable](../fig/python-sticky-note-variables-02.svg)
116

117
and then change `weight_kg`:
118

119
~~~
Greg Wilson's avatar
Greg Wilson committed
120
weight_kg = 100.0
121
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
122
~~~
123
{: .language-python}
124
125

~~~
126
weight in kilograms is now: 100.0 and weight in pounds is still: 143.0
Greg Wilson's avatar
Greg Wilson committed
127
~~~
128
{: .output}
129

130
![Updating a Variable](../fig/python-sticky-note-variables-03.svg)
131

Brian Jackson's avatar
Brian Jackson committed
132
Since `weight_lb` doesn't remember where its value came from,
133
134
135
it isn't automatically updated when `weight_kg` changes.
This is different from the way spreadsheets work.

136
> ## Who's Who in Memory
Benjamin Laken's avatar
Benjamin Laken committed
137
>
138
139
140
> You can use the `%whos` command at any time to see what
> variables you have created and what modules you have loaded into the computer's memory.
> As this is an IPython command, it will only work if you are in an IPython terminal or the Jupyter Notebook.
Benjamin Laken's avatar
Benjamin Laken committed
141
>
142
> ~~~
143
144
> %whos
> ~~~
145
> {: .language-python}
146
147
>
> ~~~
148
149
150
> Variable    Type       Data/Info
> --------------------------------
> weight_kg   float      100.0
151
> weight_lb   float      143.0
152
> ~~~
153
154
> {: .output}
{: .callout}
Benjamin Laken's avatar
Benjamin Laken committed
155

devendra1810's avatar
devendra1810 committed
156
157
158
159
Words are useful,
but what's more useful are the sentences and stories we build with them.
Similarly,
while a lot of powerful, general tools are built into languages like Python,
Trevor Bekolay's avatar
Trevor Bekolay committed
160
specialized tools built up from these basic units live in [libraries]({{ page.root }}/reference/#library)
devendra1810's avatar
devendra1810 committed
161
162
163
that can be called upon when needed.

In order to load our inflammation data,
Trevor Bekolay's avatar
Trevor Bekolay committed
164
we need to access ([import]({{ page.root }}/reference/#import) in Python terminology)
devendra1810's avatar
devendra1810 committed
165
166
167
168
169
170
171
172
a library called [NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation").
In general you should use this library if you want to do fancy things with numbers,
especially if you have matrices or arrays.
We can import NumPy using:

~~~
import numpy
~~~
173
{: .language-python}
devendra1810's avatar
devendra1810 committed
174
175
176

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench.
Libraries provide additional functionality to the basic Python package,
177
much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries
Trevor Bekolay's avatar
Trevor Bekolay committed
178
can sometimes complicate and slow down your programs - so we only import what we need for each program.
179
Once we've imported the library,
devendra1810's avatar
devendra1810 committed
180
181
182
183
184
we can ask the library to read our data file for us:

~~~
numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
~~~
185
{: .language-python}
devendra1810's avatar
devendra1810 committed
186
187
188
189
190
191
192
193
194
195
196
197

~~~
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ...,
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
~~~
{: .output}

Trevor Bekolay's avatar
Trevor Bekolay committed
198
199
The expression `numpy.loadtxt(...)` is a [function call]({{ page.root }}/reference/#function-call)
that asks Python to run the [function]({{ page.root }}/reference/#function) `loadtxt` which belongs to the `numpy` library.
Brian Jackson's avatar
Brian Jackson committed
200
This [dotted notation]({{ page.root }}/reference/#dotted-notation) is used everywhere in Python:
201
the thing that appears before the dot contains the thing that appears after.
Brian Jackson's avatar
Brian Jackson committed
202

Brian Jackson's avatar
Brian Jackson committed
203
As an example, John Smith is the John that belongs to the Smith family,
204
We could use the dot notation to write his name `smith.john`,
Brian Jackson's avatar
Brian Jackson committed
205
just as `loadtxt` is a function that belongs to the `numpy` library.
devendra1810's avatar
devendra1810 committed
206

Trevor Bekolay's avatar
Trevor Bekolay committed
207
`numpy.loadtxt` has two [parameters]({{ page.root }}/reference/#parameter):
Brian Jackson's avatar
Brian Jackson committed
208
the name of the file we want to read
Trevor Bekolay's avatar
Trevor Bekolay committed
209
210
and the [delimiter]({{ page.root }}/reference/#delimiter) that separates values on a line.
These both need to be character strings (or [strings]({{ page.root }}/reference/#string) for short),
devendra1810's avatar
devendra1810 committed
211
212
213
214
215
216
217
218
219
220
221
222
223
so we put them in quotes.

Since we haven't told it to do anything else with the function's output,
the notebook displays it.
In this case,
that output is the data we just loaded.
By default,
only a few rows and columns are shown
(with `...` to omit elements when displaying big arrays).
To save space,
Python displays numbers as `1.` instead of `1.0`
when there's nothing interesting after the decimal point.

Brian Jackson's avatar
Brian Jackson committed
224
Our call to `numpy.loadtxt` read our file
devendra1810's avatar
devendra1810 committed
225
226
but didn't save the data in memory.
To do that,
227
we need to assign the array to a variable. Just as we can assign a single value to a variable, we can also assign an array of values
Brian Jackson's avatar
Brian Jackson committed
228
to a variable using the same syntax.  Let's re-run `numpy.loadtxt` and save the returned data:
229

230
~~~
Greg Wilson's avatar
Greg Wilson committed
231
232
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
~~~
233
{: .language-python}
234

235
This statement doesn't produce any output because we've assigned the output to the variable `data`.
Brian Jackson's avatar
Brian Jackson committed
236
If we want to check that the data have been loaded,
237
we can print the variable's value:
238

239
~~~
240
print(data)
Greg Wilson's avatar
Greg Wilson committed
241
~~~
242
{: .language-python}
243
244

~~~
Greg Wilson's avatar
Greg Wilson committed
245
[[ 0.  0.  1. ...,  3.  0.  0.]
246
247
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
248
 ...,
249
250
251
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]
Greg Wilson's avatar
Greg Wilson committed
252
~~~
253
{: .output}
254

Brian Jackson's avatar
Brian Jackson committed
255
256
Now that the data are in memory,
we can manipulate them.
257
First,
258
let's ask what [type]({{ page.root }}/reference/#type) of thing `data` refers to:
259

260
~~~
261
print(type(data))
Greg Wilson's avatar
Greg Wilson committed
262
~~~
263
{: .language-python}
264
265

~~~
266
<class 'numpy.ndarray'>
Greg Wilson's avatar
Greg Wilson committed
267
~~~
268
{: .output}
269

270
The output tells us that `data` currently refers to
Brian Jackson's avatar
Brian Jackson committed
271
an N-dimensional array, the functionality for which is provided by the NumPy library.
272
These data correspond to arthritis patients' inflammation.
Brian Jackson's avatar
Brian Jackson committed
273
The rows are the individual patients, and the columns
274
275
are their daily inflammation measurements.

276
> ## Data Type
277
278
>
> A Numpy array contains one or more elements
Brian Jackson's avatar
Brian Jackson committed
279
280
281
282
> of the same type. The `type` function will only tell you that
> a variable is a NumPy array but won't tell you the type of
> thing inside the array.
> We can find out the type
283
284
> of the data contained in the NumPy array.
>
285
> ~~~
286
287
> print(data.dtype)
> ~~~
288
> {: .language-python}
289
290
>
> ~~~
291
292
> dtype('float64')
> ~~~
293
> {: .output}
294
295
>
> This tells us that the NumPy array's elements are
296
> [floating-point numbers]({{ page.root }}/reference/#floating-point number).
297
{: .callout}
298

Brian Jackson's avatar
Brian Jackson committed
299
With the following command, we can see the array's [shape]({{ page.root }}/reference/#shape):
300

301
~~~
302
print(data.shape)
Greg Wilson's avatar
Greg Wilson committed
303
~~~
304
{: .language-python}
305
306

~~~
Greg Wilson's avatar
Greg Wilson committed
307
308
(60, 40)
~~~
309
{: .output}
310

Brian Jackson's avatar
Brian Jackson committed
311
312
The output tells us that the `data` array variable contains 60 rows and 40 columns. When we created the
variable `data` to store our arthritis data, we didn't just create the array; we also
313
created information about the array, called [members]({{ page.root }}/reference/#member) or
314
315
attributes. This extra information describes `data` in
the same way an adjective describes a noun.
Brian Jackson's avatar
Brian Jackson committed
316
`data.shape` is an attribute of `data` which describes the dimensions of `data`.
317
We use the same dotted notation for the attributes of variables
318
319
that we use for the functions in libraries
because they have the same part-and-whole relationship.
320

321
If we want to get a single number from the array,
Brian Jackson's avatar
Brian Jackson committed
322
323
we must provide an [index]({{ page.root }}/reference/#index) in square brackets after the variable name,
just as we do in math when referring to an element of a matrix.  Our inflammation data has two dimensions, so we will need to use two indices to refer to one specific value:
324

325
~~~
326
print('first value in data:', data[0, 0])
Greg Wilson's avatar
Greg Wilson committed
327
~~~
328
{: .language-python}
329
330

~~~
Greg Wilson's avatar
Greg Wilson committed
331
332
first value in data: 0.0
~~~
333
{: .output}
334

335
~~~
336
print('middle value in data:', data[30, 20])
Greg Wilson's avatar
Greg Wilson committed
337
~~~
338
{: .language-python}
339
340

~~~
Greg Wilson's avatar
Greg Wilson committed
341
342
middle value in data: 13.0
~~~
343
{: .output}
344

345
346
The expression `data[30, 20]` accesses the element at row 30, column 20. While this expression may not surprise you,
 `data[0, 0]` might.
Brian Jackson's avatar
Brian Jackson committed
347
Programming languages like Fortran, MATLAB and R start counting at 1
348
349
because that's what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count from 0
350
351
352
353
354
because it represents an offset from the first value in the array (the second
value is offset by one index from the first value). This is closer to the way
that computers represent arrays (if you are interested in the historical
reasons behind counting indices from zero, you can read
[Mike Hoye's blog post](http://exple.tive.org/blarg/2013/10/22/citation-needed/)).
355
As a result,
Greg Wilson's avatar
Greg Wilson committed
356
if we have an M×N array in Python,
357
358
359
360
361
362
its indices go from 0 to M-1 on the first axis
and 0 to N-1 on the second.
It takes a bit of getting used to,
but one way to remember the rule is that
the index is how many steps we have to take from the start to get the item we want.

363
364
![Zero Index](../fig/python-zero-index.png)

365
> ## In the Corner
366
367
368
369
>
> What may also surprise you is that when Python displays an array,
> it shows the element with index `[0, 0]` in the upper left corner
> rather than the lower left.
Brian Jackson's avatar
Brian Jackson committed
370
> This is consistent with the way mathematicians draw matrices
371
> but different from the Cartesian coordinates.
372
> The indices are (row, column) instead of (column, row) for the same reason,
373
> which can be confusing when plotting data.
374
{: .callout}
375
376
377
378
379

An index like `[30, 20]` selects a single element of an array,
but we can select whole sections as well.
For example,
we can select the first ten days (columns) of values
380
for the first four patients (rows) like this:
381

382
~~~
383
print(data[0:4, 0:10])
Greg Wilson's avatar
Greg Wilson committed
384
~~~
385
{: .language-python}
386
387

~~~
Greg Wilson's avatar
Greg Wilson committed
388
[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
389
390
391
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]
Greg Wilson's avatar
Greg Wilson committed
392
~~~
393
{: .output}
394

395
The [slice]({{ page.root }}/reference/#slice) `0:4` means,
396
397
398
399
400
401
"Start at index 0 and go up to, but not including, index 4."
Again,
the up-to-but-not-including takes a bit of getting used to,
but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don't have to start slices at 0:
402

403
~~~
404
print(data[5:10, 0:10])
Greg Wilson's avatar
Greg Wilson committed
405
~~~
406
{: .language-python}
407
408

~~~
Greg Wilson's avatar
Greg Wilson committed
409
[[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.]
410
411
412
413
 [ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.]
 [ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.]
 [ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.]
 [ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.]]
Greg Wilson's avatar
Greg Wilson committed
414
~~~
415
{: .output}
416

417
418
419
420
421
422
423
424
We also don't have to include the upper and lower bound on the slice.
If we don't include the lower bound,
Python uses 0 by default;
if we don't include the upper,
the slice runs to the end of the axis,
and if we don't include either
(i.e., if we just use ':' on its own),
the slice includes everything:
425

426
~~~
Greg Wilson's avatar
Greg Wilson committed
427
small = data[:3, 36:]
428
429
print('small is:')
print(small)
Greg Wilson's avatar
Greg Wilson committed
430
~~~
431
{: .language-python}
Brian Jackson's avatar
Brian Jackson committed
432
The above example selects rows 0 through 2 and columns 36 through to the end of the array.
433
434

~~~
Greg Wilson's avatar
Greg Wilson committed
435
small is:
436
437
438
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
Greg Wilson's avatar
Greg Wilson committed
439
~~~
440
{: .output}
441

442
Arrays also know how to perform common mathematical operations on their values.
Greg Wilson's avatar
Greg Wilson committed
443
The simplest operations with data are arithmetic:
Brian Jackson's avatar
Brian Jackson committed
444
addition, subtraction, multiplication, and division.
Greg Wilson's avatar
Greg Wilson committed
445
 When you do such operations on arrays,
Brian Jackson's avatar
Brian Jackson committed
446
the operation is done element-by-element.
Greg Wilson's avatar
Greg Wilson committed
447
Thus:
448

449
~~~
Greg Wilson's avatar
Greg Wilson committed
450
451
doubledata = data * 2.0
~~~
452
{: .language-python}
453

Greg Wilson's avatar
Greg Wilson committed
454
will create a new array `doubledata`
Brian Jackson's avatar
Brian Jackson committed
455
each elements of which is twice the value of the corresponding element in `data`:
456

457
~~~
458
459
460
461
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
462
~~~
463
{: .language-python}
464
465

~~~
Greg Wilson's avatar
Greg Wilson committed
466
original:
467
468
469
470
471
472
473
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
doubledata:
[[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]
Greg Wilson's avatar
Greg Wilson committed
474
~~~
475
{: .output}
476

Greg Wilson's avatar
Greg Wilson committed
477
If,
Brian Jackson's avatar
Brian Jackson committed
478
instead of taking an array and doing arithmetic with a single value (as above),
479
you did the arithmetic operation with another array of the same shape,
Greg Wilson's avatar
Greg Wilson committed
480
481
the operation will be done on corresponding elements of the two arrays.
Thus:
482

483
~~~
Greg Wilson's avatar
Greg Wilson committed
484
485
tripledata = doubledata + data
~~~
486
{: .language-python}
487

488
489
490
will give you an array where `tripledata[0,0]` will equal `doubledata[0,0]` plus `data[0,0]`,
and so on for all other elements of the arrays.

491
~~~
492
493
print('tripledata:')
print(tripledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
494
~~~
495
{: .language-python}
496
497

~~~
Greg Wilson's avatar
Greg Wilson committed
498
tripledata:
499
500
501
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]
Greg Wilson's avatar
Greg Wilson committed
502
~~~
503
{: .output}
504

Brian Jackson's avatar
Brian Jackson committed
505
506
Often, we want to do more than add, subtract, multiply, and divide array elements.
NumPy knows how to do more complex operations, too.
507
508
If we want to find the average inflammation for all patients on all days,
for example,
509
we can ask NumPy to compute `data`'s mean value:
510

511
~~~
512
print(numpy.mean(data))
Greg Wilson's avatar
Greg Wilson committed
513
~~~
514
{: .language-python}
515
516

~~~
Greg Wilson's avatar
Greg Wilson committed
517
518
6.14875
~~~
519
{: .output}
520

521
522
`mean` is a [function]({{ page.root }}/reference/#function) that takes
an array as an [argument]({{ page.root }}/reference/#argument).
523

524
> ## Not All Functions Have Input
525
526
527
>
> Generally, a function uses inputs to produce outputs.
> However, some functions produce outputs without
528
529
> needing any input. For example, checking the current time
> doesn't require any input.
530
>
531
> ~~~
532
533
> import time
> print(time.ctime())
534
> ~~~
535
> {: .language-python}
536
537
>
> ~~~
538
> 'Sat Mar 26 13:07:33 2016'
539
> ~~~
540
> {: .output}
541
542
543
544
>
> For functions that don't take in any arguments,
> we still need parentheses (`()`)
> to tell Python to go and do something for us.
545
{: .callout}
546
547
548

NumPy has lots of useful functions that take an array as input.
Let's use three of those functions to get some descriptive values about the dataset.
549
550
We'll also use multiple assignment,
a convenient Python feature that will enable us to do this all in one line.
551

552
~~~
553
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)
554

Alistair Walsh's avatar
Alistair Walsh committed
555
556
557
print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)
Greg Wilson's avatar
Greg Wilson committed
558
~~~
559
{: .language-python}
Brian Jackson's avatar
Brian Jackson committed
560
Here we've assigned the return value from `numpy.max(data)` to the variable `maxval`, the value from `numpy.min(data)` to `minval`, and so on.
561
~~~
Greg Wilson's avatar
Greg Wilson committed
562
maximum inflammation: 20.0
563
564
minimum inflammation: 0.0
standard deviation: 4.61383319712
Greg Wilson's avatar
Greg Wilson committed
565
~~~
566
{: .output}
567

568
> ## Mystery Functions in IPython
569
>
570
> How did we know what functions NumPy has and how to use them?
Brian Jackson's avatar
Brian Jackson committed
571
> If you are working in the IPython/Jupyter Notebook, there is an easy way to find out.
Dustin Lang's avatar
Dustin Lang committed
572
> If you type the name of something followed by a dot, then you can use tab completion
573
> (e.g. type `numpy.` and then press tab)
Brian Jackson's avatar
Brian Jackson committed
574
575
> to see a list of all functions and attributes that you can use. After selecting one, you
> can also add a question mark (e.g. `numpy.cumprod?`), and IPython will return an
576
> explanation of the method! This is the same as doing `help(numpy.cumprod)`.
577
{: .callout}
578
579

When analyzing data, though,
Brian Jackson's avatar
Brian Jackson committed
580
581
582
we often want to look at variations in statistical values,
such as the maximum inflammation per patient
or the average inflammation per day.
583
One way to do this is to create a new temporary array of the data we want,
584
then ask it to do the calculation:
585

586
~~~
Dustin Lang's avatar
Dustin Lang committed
587
patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)
588
print('maximum inflammation for patient 0:', patient_0.max())
Greg Wilson's avatar
Greg Wilson committed
589
~~~
590
{: .language-python}
591
592

~~~
Greg Wilson's avatar
Greg Wilson committed
593
594
maximum inflammation for patient 0: 18.0
~~~
595
{: .output}
596

597
Everything in a line of code following the '#' symbol is a
598
[comment]({{ page.root }}/reference/#comment) that is ignored by Python.
599
Comments allow programmers to leave explanatory notes for other
jstapleton's avatar
jstapleton committed
600
601
programmers or their future selves.

602
We don't actually need to store the row in a variable of its own.
603
Instead, we can combine the selection and the function call:
604

605
~~~
606
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
Greg Wilson's avatar
Greg Wilson committed
607
~~~
608
{: .language-python}
609
610

~~~
Greg Wilson's avatar
Greg Wilson committed
611
612
maximum inflammation for patient 2: 19.0
~~~
613
{: .output}
614

615
What if we need the maximum inflammation for each patient over all days (as in the
Brian Jackson's avatar
Brian Jackson committed
616
next diagram on the left) or the average for each day (as in the
617
618
diagram on the right)? As the diagram below shows, we want to perform the
operation across an axis:
619

620
![Operations Across Axes](../fig/python-operations-across-axes.png)
621

Brian Jackson's avatar
Brian Jackson committed
622
To support this functionality,
623
most array functions allow us to specify the axis we want to work on.
624
If we ask for the average across axis 0 (rows in our 2D example),
625
we get:
626

627
~~~
628
print(numpy.mean(data, axis=0))
Greg Wilson's avatar
Greg Wilson committed
629
~~~
630
{: .language-python}
631
632

~~~
Greg Wilson's avatar
Greg Wilson committed
633
[  0.           0.45         1.11666667   1.75         2.43333333   3.15
634
635
636
637
638
639
640
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]
Greg Wilson's avatar
Greg Wilson committed
641
~~~
642
{: .output}
643

644
645
As a quick check,
we can ask this array what its shape is:
646

647
~~~
648
print(numpy.mean(data, axis=0).shape)
Greg Wilson's avatar
Greg Wilson committed
649
~~~
650
{: .language-python}
651
652

~~~
Greg Wilson's avatar
Greg Wilson committed
653
654
(40,)
~~~
655
{: .output}
656

Greg Wilson's avatar
Greg Wilson committed
657
The expression `(40,)` tells us we have an N×1 vector,
658
so this is the average inflammation per day for all patients.
659
If we average across axis 1 (columns in our 2D example), we get:
660

661
~~~
662
print(numpy.mean(data, axis=1))
Greg Wilson's avatar
Greg Wilson committed
663
~~~
664
{: .language-python}
665
666

~~~
Greg Wilson's avatar
Greg Wilson committed
667
[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
668
669
670
671
672
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]
Greg Wilson's avatar
Greg Wilson committed
673
~~~
674
{: .output}
675

676
677
678
679
680
which is the average inflammation per patient across all days.

The mathematician Richard Hamming once said,
"The purpose of computing is insight, not numbers,"
and the best way to develop insight is often to visualize data.
Brian Jackson's avatar
Brian Jackson committed
681
Visualization deserves an entire lecture of its own,
682
but we can explore a few features of Python's `matplotlib` library here.
Brian Jackson's avatar
Brian Jackson committed
683
684
While there is no official plotting library,
`matplotlib` is the de facto standard.
685
686
687
First,
we will import the `pyplot` module from `matplotlib`
and use two of its functions to create and display a heat map of our data:
688

689
~~~
690
import matplotlib.pyplot
691
image = matplotlib.pyplot.imshow(data)
692
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
693
~~~
694
{: .language-python}
695

696
![Heatmap of the Data](../fig/01-numpy_71_0.png)
697

698
Blue pixels in this heat map represent low values, while yellow pixels represent high values.
699
700
As we can see,
inflammation rises and falls over a 40-day period.
701

702
> ## Some IPython Magic
703
704
705
706
>
> If you're using an IPython / Jupyter notebook,
> you'll need to execute the following command
> in order for your matplotlib images to appear
Damien Irving's avatar
Damien Irving committed
707
> in the notebook when `show()` is called:
Damien Irving's avatar
Damien Irving committed
708
>
709
> ~~~
710
> %matplotlib inline
Damien Irving's avatar
Damien Irving committed
711
> ~~~
712
> {: .language-python}
713
>
714
715
> The `%` indicates an IPython magic function -
> a function that is only valid within the notebook environment.
716
> Note that you only have to execute this function once per notebook.
717
{: .callout}
718

719
Let's take a look at the average inflammation over time:
720

721
~~~
722
ave_inflammation = numpy.mean(data, axis=0)
723
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
724
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
725
~~~
726
{: .language-python}
727

728
![Average Inflammation Over Time](../fig/01-numpy_73_0.png)
729

730
731
Here,
we have put the average per day across all patients in the variable `ave_inflammation`,
732
then asked `matplotlib.pyplot` to create and display a line graph of those values.
Brian Jackson's avatar
Brian Jackson committed
733
The result is a roughly linear rise and fall,
734
which is suspicious:
Brian Jackson's avatar
Brian Jackson committed
735
we might instead expect a sharper rise and slower fall.
736
Let's have a look at two other statistics:
737

738
~~~
739
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
740
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
741
~~~
742
{: .language-python}
743

744
![Maximum Value Along The First Axis](../fig/01-numpy_75_1.png)
Greg Wilson's avatar
Greg Wilson committed
745

746
~~~
747
min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
748
matplotlib.pyplot.show()