01-numpy.md 35.8 KB
Newer Older
1
---
2
3
4
5
title: Analyzing Patient Data
teaching: 30
exercises: 0
questions:
Greg Wilson's avatar
Greg Wilson committed
6
- "How can I process tabular data files in Python?"
7
objectives:
Brian Jackson's avatar
Brian Jackson committed
8
- "Explain what a library is and what libraries are used for."
9
- "Import a Python library and use the functions it contains."
10
11
12
13
- "Read tabular data from a file into a program."
- "Assign values to variables."
- "Select individual values and subsections from data."
- "Perform operations on arrays of data."
14
- "Plot simple graphs from data."
15
keypoints:
Greg Wilson's avatar
Greg Wilson committed
16
17
18
19
20
21
- "Import a library into a program using `import libraryname`."
- "Use the `numpy` library to work with arrays in Python."
- "Use `variable = value` to assign a value to a variable in order to record it in memory."
- "Variables are created on demand whenever a value is assigned to them."
- "Use `print(something)` to display the value of `something`."
- "The expression `array.shape` gives the shape of an array."
22
- "Use `array[x, y]` to select a single element from a 2D array."
Greg Wilson's avatar
Greg Wilson committed
23
- "Array indices start at 0, not 1."
Dustin Lang's avatar
Dustin Lang committed
24
- "Use `low:high` to specify a `slice` that includes the indices from `low` to `high-1`."
Greg Wilson's avatar
Greg Wilson committed
25
26
27
28
29
- "All the indexing and slicing that works on arrays also works on strings."
- "Use `# some kind of explanation` to add comments to programs."
- "Use `numpy.mean(array)`, `numpy.max(array)`, and `numpy.min(array)` to calculate simple statistics."
- "Use `numpy.mean(array, axis=0)` or `numpy.mean(array, axis=1)` to calculate statistics across the specified axis."
- "Use the `pyplot` library from `matplotlib` for creating simple visualizations."
30
31
---

32
33
In this lesson we will learn how to manipulate the inflammation dataset with Python. Before we
discuss how to deal with many data points, we will show how to store a single value on the computer.
34

Justin Pringle's avatar
Justin Pringle committed
35
36
37
38
39
You can get output from python by typing math into the console:
~~~
3+5
12/7
~~~
40

41
However, to do anything useful and/or interesting we need to assign values to _variables_
42
43
44
(or link _objects_ to names/variables).
The line below [assigns]({{ page.root }}/reference/#assign) the value `60` to a
[variable]({{ page.root }}/reference/#variable) `weight_kg`:
45

46
~~~
47
weight_kg = 60
Greg Wilson's avatar
Greg Wilson committed
48
~~~
49
{: .language-python}
50

Justin Pringle's avatar
Justin Pringle committed
51
A variable is a name for a value,
52
such as `x_val`, `current_temperature`, or `subject_id`.
53
54
Python's variables must begin with a letter and are
[case sensitive]({{ page.root }}/reference/#case-sensitive).
Kyler Brown's avatar
Kyler Brown committed
55
We can create a new variable by assigning a value to it using `=`.
56
When we are finished typing and press <kbd>Shift</kbd>+<kbd>Return</kbd>,
57
the notebook runs our command.
58

59
Once a variable has a value, we can print it to the screen:
60

61
~~~
62
print(weight_kg)
Greg Wilson's avatar
Greg Wilson committed
63
~~~
64
{: .language-python}
65
66

~~~
67
60
Greg Wilson's avatar
Greg Wilson committed
68
~~~
69
{: .output}
70

Brian Jackson's avatar
Brian Jackson committed
71
and do arithmetic with it (remember, there are 2.2 pounds per kilogram):
72

73
~~~
74
print('weight in pounds:', 2.2 * weight_kg)
Greg Wilson's avatar
Greg Wilson committed
75
~~~
76
{: .language-python}
77
78

~~~
79
weight in pounds: 132.0
Greg Wilson's avatar
Greg Wilson committed
80
~~~
81
{: .output}
82

jstapleton's avatar
jstapleton committed
83
84
85
As the example above shows,
we can print several things at once by separating them with commas.

86
We can also change a variable's value by assigning it a new one:
87

88
~~~
89
weight_kg = 65.0
90
print('weight in kilograms is now:', weight_kg)
Greg Wilson's avatar
Greg Wilson committed
91
~~~
92
{: .language-python}
93
94

~~~
95
weight in kilograms is now: 65.0
Greg Wilson's avatar
Greg Wilson committed
96
~~~
97
{: .output}
98

99
100
If we imagine the variable as a sticky note with a name written on it,
assignment is like putting the sticky note on a particular value:
101

102
![Variables as Sticky Notes](../fig/python-sticky-note-variables-01.svg)
103

104
105
106
This means that assigning a value to one variable does *not* change the values of other variables.
For example,
let's store the subject's weight in pounds in a variable:
107

108
~~~
109
# There are 2.2 pounds per kilogram
Greg Wilson's avatar
Greg Wilson committed
110
weight_lb = 2.2 * weight_kg
111
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
112
~~~
113
{: .language-python}
114
115

~~~
116
weight in kilograms: 65.0 and in pounds: 143.0
Greg Wilson's avatar
Greg Wilson committed
117
~~~
118
{: .output}
119

120
![Creating Another Variable](../fig/python-sticky-note-variables-02.svg)
121

122
and then change `weight_kg`:
123

124
~~~
Greg Wilson's avatar
Greg Wilson committed
125
weight_kg = 100.0
126
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
Greg Wilson's avatar
Greg Wilson committed
127
~~~
128
{: .language-python}
129
130

~~~
131
weight in kilograms is now: 100.0 and weight in pounds is still: 143.0
Greg Wilson's avatar
Greg Wilson committed
132
~~~
133
{: .output}
134

135
![Updating a Variable](../fig/python-sticky-note-variables-03.svg)
136

Brian Jackson's avatar
Brian Jackson committed
137
Since `weight_lb` doesn't remember where its value came from,
138
139
140
it isn't automatically updated when `weight_kg` changes.
This is different from the way spreadsheets work.

141
> ## Who's Who in Memory
Benjamin Laken's avatar
Benjamin Laken committed
142
>
143
144
> You can use the `%whos` command at any time to see what
> variables you have created and what modules you have loaded into the computer's memory.
145
146
> As this is an IPython command, it will only work if you are in an IPython terminal or the
> Jupyter Notebook.
Benjamin Laken's avatar
Benjamin Laken committed
147
>
148
> ~~~
149
150
> %whos
> ~~~
151
> {: .language-python}
152
153
>
> ~~~
154
155
156
> Variable    Type       Data/Info
> --------------------------------
> weight_kg   float      100.0
157
> weight_lb   float      143.0
158
> ~~~
159
160
> {: .output}
{: .callout}
Benjamin Laken's avatar
Benjamin Laken committed
161

162
163
164
165
Words are useful, but what's more useful are the sentences and stories we build with them.
Similarly, while a lot of powerful, general tools are built into languages like Python,
specialized tools built up from these basic units live in
[libraries]({{ page.root }}/reference/#library)
devendra1810's avatar
devendra1810 committed
166
167
168
that can be called upon when needed.

In order to load our inflammation data,
Trevor Bekolay's avatar
Trevor Bekolay committed
169
we need to access ([import]({{ page.root }}/reference/#import) in Python terminology)
devendra1810's avatar
devendra1810 committed
170
171
172
173
174
175
176
177
a library called [NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation").
In general you should use this library if you want to do fancy things with numbers,
especially if you have matrices or arrays.
We can import NumPy using:

~~~
import numpy
~~~
178
{: .language-python}
devendra1810's avatar
devendra1810 committed
179

180
181
182
183
184
185
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it
up on the bench. Libraries provide additional functionality to the basic Python package, much like
a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too
many libraries can sometimes complicate and slow down your programs - so we only import what we
need for each program. Once we've imported the library, we can ask the library to read our data
file for us:
devendra1810's avatar
devendra1810 committed
186
187
188
189

~~~
numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
~~~
190
{: .language-python}
devendra1810's avatar
devendra1810 committed
191
192
193
194
195
196
197
198
199
200
201
202

~~~
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ...,
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
~~~
{: .output}

Trevor Bekolay's avatar
Trevor Bekolay committed
203
The expression `numpy.loadtxt(...)` is a [function call]({{ page.root }}/reference/#function-call)
204
205
206
207
that asks Python to run the [function]({{ page.root }}/reference/#function) `loadtxt` which
belongs to the `numpy` library. This [dotted notation]({{ page.root }}/reference/#dotted-notation)
is used everywhere in Python: the thing that appears before the dot contains the thing that
appears after.
Brian Jackson's avatar
Brian Jackson committed
208

Brian Jackson's avatar
Brian Jackson committed
209
As an example, John Smith is the John that belongs to the Smith family,
210
We could use the dot notation to write his name `smith.john`,
Brian Jackson's avatar
Brian Jackson committed
211
just as `loadtxt` is a function that belongs to the `numpy` library.
devendra1810's avatar
devendra1810 committed
212

213
214
215
216
`numpy.loadtxt` has two [parameters]({{ page.root }}/reference/#parameter): the name of the file
we want to read and the [delimiter]({{ page.root }}/reference/#delimiter) that separates values on
a line. These both need to be character strings (or [strings]({{ page.root }}/reference/#string)
for short), so we put them in quotes.
devendra1810's avatar
devendra1810 committed
217
218
219
220
221
222
223
224
225
226
227
228

Since we haven't told it to do anything else with the function's output,
the notebook displays it.
In this case,
that output is the data we just loaded.
By default,
only a few rows and columns are shown
(with `...` to omit elements when displaying big arrays).
To save space,
Python displays numbers as `1.` instead of `1.0`
when there's nothing interesting after the decimal point.

Brian Jackson's avatar
Brian Jackson committed
229
Our call to `numpy.loadtxt` read our file
devendra1810's avatar
devendra1810 committed
230
231
but didn't save the data in memory.
To do that,
232
233
234
we need to assign the array to a variable. Just as we can assign a single value to a variable, we
can also assign an array of values to a variable using the same syntax.  Let's re-run
`numpy.loadtxt` and save the returned data:
235

236
~~~
Greg Wilson's avatar
Greg Wilson committed
237
238
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
~~~
239
{: .language-python}
240

241
This statement doesn't produce any output because we've assigned the output to the variable `data`.
Brian Jackson's avatar
Brian Jackson committed
242
If we want to check that the data have been loaded,
243
we can print the variable's value:
244

245
~~~
246
print(data)
Greg Wilson's avatar
Greg Wilson committed
247
~~~
248
{: .language-python}
249
250

~~~
Greg Wilson's avatar
Greg Wilson committed
251
[[ 0.  0.  1. ...,  3.  0.  0.]
252
253
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
254
 ...,
255
256
257
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]
Greg Wilson's avatar
Greg Wilson committed
258
~~~
259
{: .output}
260

Brian Jackson's avatar
Brian Jackson committed
261
262
Now that the data are in memory,
we can manipulate them.
263
First,
264
let's ask what [type]({{ page.root }}/reference/#type) of thing `data` refers to:
265

266
~~~
267
print(type(data))
Greg Wilson's avatar
Greg Wilson committed
268
~~~
269
{: .language-python}
270
271

~~~
272
<class 'numpy.ndarray'>
Greg Wilson's avatar
Greg Wilson committed
273
~~~
274
{: .output}
275

276
The output tells us that `data` currently refers to
Brian Jackson's avatar
Brian Jackson committed
277
an N-dimensional array, the functionality for which is provided by the NumPy library.
278
These data correspond to arthritis patients' inflammation.
Brian Jackson's avatar
Brian Jackson committed
279
The rows are the individual patients, and the columns
280
281
are their daily inflammation measurements.

282
> ## Data Type
283
284
>
> A Numpy array contains one or more elements
Brian Jackson's avatar
Brian Jackson committed
285
286
287
288
> of the same type. The `type` function will only tell you that
> a variable is a NumPy array but won't tell you the type of
> thing inside the array.
> We can find out the type
289
290
> of the data contained in the NumPy array.
>
291
> ~~~
292
293
> print(data.dtype)
> ~~~
294
> {: .language-python}
295
296
>
> ~~~
297
298
> dtype('float64')
> ~~~
299
> {: .output}
300
301
>
> This tells us that the NumPy array's elements are
302
> [floating-point numbers]({{ page.root }}/reference/#floating-point number).
303
{: .callout}
304

Brian Jackson's avatar
Brian Jackson committed
305
With the following command, we can see the array's [shape]({{ page.root }}/reference/#shape):
306

307
~~~
308
print(data.shape)
Greg Wilson's avatar
Greg Wilson committed
309
~~~
310
{: .language-python}
311
312

~~~
Greg Wilson's avatar
Greg Wilson committed
313
314
(60, 40)
~~~
315
{: .output}
316

317
318
The output tells us that the `data` array variable contains 60 rows and 40 columns. When we
created the variable `data` to store our arthritis data, we didn't just create the array; we also
319
created information about the array, called [members]({{ page.root }}/reference/#member) or
320
321
322
323
attributes. This extra information describes `data` in the same way an adjective describes a noun.
`data.shape` is an attribute of `data` which describes the dimensions of `data`. We use the same
dotted notation for the attributes of variables that we use for the functions in libraries because
they have the same part-and-whole relationship.
324

325
326
327
328
If we want to get a single number from the array, we must provide an
[index]({{ page.root }}/reference/#index) in square brackets after the variable name, just as we
do in math when referring to an element of a matrix.  Our inflammation data has two dimensions, so
we will need to use two indices to refer to one specific value:
329

330
~~~
331
print('first value in data:', data[0, 0])
Greg Wilson's avatar
Greg Wilson committed
332
~~~
333
{: .language-python}
334
335

~~~
Greg Wilson's avatar
Greg Wilson committed
336
337
first value in data: 0.0
~~~
338
{: .output}
339

340
~~~
341
print('middle value in data:', data[30, 20])
Greg Wilson's avatar
Greg Wilson committed
342
~~~
343
{: .language-python}
344
345

~~~
Greg Wilson's avatar
Greg Wilson committed
346
347
middle value in data: 13.0
~~~
348
{: .output}
349

350
351
The expression `data[30, 20]` accesses the element at row 30, column 20. While this expression may
not surprise you,
352
 `data[0, 0]` might.
Brian Jackson's avatar
Brian Jackson committed
353
Programming languages like Fortran, MATLAB and R start counting at 1
354
355
because that's what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count from 0
356
357
358
359
360
because it represents an offset from the first value in the array (the second
value is offset by one index from the first value). This is closer to the way
that computers represent arrays (if you are interested in the historical
reasons behind counting indices from zero, you can read
[Mike Hoye's blog post](http://exple.tive.org/blarg/2013/10/22/citation-needed/)).
361
As a result,
Greg Wilson's avatar
Greg Wilson committed
362
if we have an M×N array in Python,
363
364
365
366
367
368
its indices go from 0 to M-1 on the first axis
and 0 to N-1 on the second.
It takes a bit of getting used to,
but one way to remember the rule is that
the index is how many steps we have to take from the start to get the item we want.

369
370
![Zero Index](../fig/python-zero-index.png)

371
> ## In the Corner
372
373
374
375
>
> What may also surprise you is that when Python displays an array,
> it shows the element with index `[0, 0]` in the upper left corner
> rather than the lower left.
Brian Jackson's avatar
Brian Jackson committed
376
> This is consistent with the way mathematicians draw matrices
377
> but different from the Cartesian coordinates.
378
> The indices are (row, column) instead of (column, row) for the same reason,
379
> which can be confusing when plotting data.
380
{: .callout}
381
382
383
384
385

An index like `[30, 20]` selects a single element of an array,
but we can select whole sections as well.
For example,
we can select the first ten days (columns) of values
386
for the first four patients (rows) like this:
387

388
~~~
389
print(data[0:4, 0:10])
Greg Wilson's avatar
Greg Wilson committed
390
~~~
391
{: .language-python}
392
393

~~~
Greg Wilson's avatar
Greg Wilson committed
394
[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
395
396
397
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]
Greg Wilson's avatar
Greg Wilson committed
398
~~~
399
{: .output}
400

401
402
403
The [slice]({{ page.root }}/reference/#slice) `0:4` means, "Start at index 0 and go up to, but not
including, index 4."Again, the up-to-but-not-including takes a bit of getting used to, but the
rule is that the difference between the upper and lower bounds is the number of values in the slice.
404
405

We don't have to start slices at 0:
406

407
~~~
408
print(data[5:10, 0:10])
Greg Wilson's avatar
Greg Wilson committed
409
~~~
410
{: .language-python}
411
412

~~~
Greg Wilson's avatar
Greg Wilson committed
413
[[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.]
414
415
416
417
 [ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.]
 [ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.]
 [ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.]
 [ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.]]
Greg Wilson's avatar
Greg Wilson committed
418
~~~
419
{: .output}
420

421
422
423
424
425
426
427
428
We also don't have to include the upper and lower bound on the slice.
If we don't include the lower bound,
Python uses 0 by default;
if we don't include the upper,
the slice runs to the end of the axis,
and if we don't include either
(i.e., if we just use ':' on its own),
the slice includes everything:
429

430
~~~
Greg Wilson's avatar
Greg Wilson committed
431
small = data[:3, 36:]
432
433
print('small is:')
print(small)
Greg Wilson's avatar
Greg Wilson committed
434
~~~
435
{: .language-python}
Brian Jackson's avatar
Brian Jackson committed
436
The above example selects rows 0 through 2 and columns 36 through to the end of the array.
437
438

~~~
Greg Wilson's avatar
Greg Wilson committed
439
small is:
440
441
442
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
Greg Wilson's avatar
Greg Wilson committed
443
~~~
444
{: .output}
445

446
Arrays also know how to perform common mathematical operations on their values.
Greg Wilson's avatar
Greg Wilson committed
447
The simplest operations with data are arithmetic:
Brian Jackson's avatar
Brian Jackson committed
448
addition, subtraction, multiplication, and division.
Greg Wilson's avatar
Greg Wilson committed
449
 When you do such operations on arrays,
Brian Jackson's avatar
Brian Jackson committed
450
the operation is done element-by-element.
Greg Wilson's avatar
Greg Wilson committed
451
Thus:
452

453
~~~
Greg Wilson's avatar
Greg Wilson committed
454
455
doubledata = data * 2.0
~~~
456
{: .language-python}
457

Greg Wilson's avatar
Greg Wilson committed
458
will create a new array `doubledata`
Brian Jackson's avatar
Brian Jackson committed
459
each elements of which is twice the value of the corresponding element in `data`:
460

461
~~~
462
463
464
465
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
466
~~~
467
{: .language-python}
468
469

~~~
Greg Wilson's avatar
Greg Wilson committed
470
original:
471
472
473
474
475
476
477
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
doubledata:
[[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]
Greg Wilson's avatar
Greg Wilson committed
478
~~~
479
{: .output}
480

Greg Wilson's avatar
Greg Wilson committed
481
If,
Brian Jackson's avatar
Brian Jackson committed
482
instead of taking an array and doing arithmetic with a single value (as above),
483
you did the arithmetic operation with another array of the same shape,
Greg Wilson's avatar
Greg Wilson committed
484
485
the operation will be done on corresponding elements of the two arrays.
Thus:
486

487
~~~
Greg Wilson's avatar
Greg Wilson committed
488
489
tripledata = doubledata + data
~~~
490
{: .language-python}
491

492
493
494
will give you an array where `tripledata[0,0]` will equal `doubledata[0,0]` plus `data[0,0]`,
and so on for all other elements of the arrays.

495
~~~
496
497
print('tripledata:')
print(tripledata[:3, 36:])
Greg Wilson's avatar
Greg Wilson committed
498
~~~
499
{: .language-python}
500
501

~~~
Greg Wilson's avatar
Greg Wilson committed
502
tripledata:
503
504
505
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]
Greg Wilson's avatar
Greg Wilson committed
506
~~~
507
{: .output}
508

Brian Jackson's avatar
Brian Jackson committed
509
510
Often, we want to do more than add, subtract, multiply, and divide array elements.
NumPy knows how to do more complex operations, too.
511
512
If we want to find the average inflammation for all patients on all days,
for example,
513
we can ask NumPy to compute `data`'s mean value:
514

515
~~~
516
print(numpy.mean(data))
Greg Wilson's avatar
Greg Wilson committed
517
~~~
518
{: .language-python}
519
520

~~~
Greg Wilson's avatar
Greg Wilson committed
521
522
6.14875
~~~
523
{: .output}
524

525
526
`mean` is a [function]({{ page.root }}/reference/#function) that takes
an array as an [argument]({{ page.root }}/reference/#argument).
527

528
> ## Not All Functions Have Input
529
530
531
>
> Generally, a function uses inputs to produce outputs.
> However, some functions produce outputs without
532
533
> needing any input. For example, checking the current time
> doesn't require any input.
534
>
535
> ~~~
536
537
> import time
> print(time.ctime())
538
> ~~~
539
> {: .language-python}
540
541
>
> ~~~
542
> 'Sat Mar 26 13:07:33 2016'
543
> ~~~
544
> {: .output}
545
546
547
548
>
> For functions that don't take in any arguments,
> we still need parentheses (`()`)
> to tell Python to go and do something for us.
549
{: .callout}
550
551
552

NumPy has lots of useful functions that take an array as input.
Let's use three of those functions to get some descriptive values about the dataset.
553
554
We'll also use multiple assignment,
a convenient Python feature that will enable us to do this all in one line.
555

556
~~~
557
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)
558

Alistair Walsh's avatar
Alistair Walsh committed
559
560
561
print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)
Greg Wilson's avatar
Greg Wilson committed
562
~~~
563
{: .language-python}
564
565
566
567

Here we've assigned the return value from `numpy.max(data)` to the variable `maxval`, the value
from `numpy.min(data)` to `minval`, and so on.

568
~~~
Greg Wilson's avatar
Greg Wilson committed
569
maximum inflammation: 20.0
570
571
minimum inflammation: 0.0
standard deviation: 4.61383319712
Greg Wilson's avatar
Greg Wilson committed
572
~~~
573
{: .output}
574

575
> ## Mystery Functions in IPython
576
>
577
> How did we know what functions NumPy has and how to use them?
Brian Jackson's avatar
Brian Jackson committed
578
> If you are working in the IPython/Jupyter Notebook, there is an easy way to find out.
Dustin Lang's avatar
Dustin Lang committed
579
> If you type the name of something followed by a dot, then you can use tab completion
580
> (e.g. type `numpy.` and then press tab)
Brian Jackson's avatar
Brian Jackson committed
581
582
> to see a list of all functions and attributes that you can use. After selecting one, you
> can also add a question mark (e.g. `numpy.cumprod?`), and IPython will return an
583
> explanation of the method! This is the same as doing `help(numpy.cumprod)`.
584
{: .callout}
585
586

When analyzing data, though,
Brian Jackson's avatar
Brian Jackson committed
587
588
589
we often want to look at variations in statistical values,
such as the maximum inflammation per patient
or the average inflammation per day.
590
One way to do this is to create a new temporary array of the data we want,
591
then ask it to do the calculation:
592

593
~~~
Dustin Lang's avatar
Dustin Lang committed
594
patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)
595
print('maximum inflammation for patient 0:', patient_0.max())
Greg Wilson's avatar
Greg Wilson committed
596
~~~
597
{: .language-python}
598
599

~~~
Greg Wilson's avatar
Greg Wilson committed
600
601
maximum inflammation for patient 0: 18.0
~~~
602
{: .output}
603

604
Everything in a line of code following the '#' symbol is a
605
[comment]({{ page.root }}/reference/#comment) that is ignored by Python.
606
Comments allow programmers to leave explanatory notes for other
jstapleton's avatar
jstapleton committed
607
608
programmers or their future selves.

609
We don't actually need to store the row in a variable of its own.
610
Instead, we can combine the selection and the function call:
611

612
~~~
613
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
Greg Wilson's avatar
Greg Wilson committed
614
~~~
615
{: .language-python}
616
617

~~~
Greg Wilson's avatar
Greg Wilson committed
618
619
maximum inflammation for patient 2: 19.0
~~~
620
{: .output}
621

622
What if we need the maximum inflammation for each patient over all days (as in the
Brian Jackson's avatar
Brian Jackson committed
623
next diagram on the left) or the average for each day (as in the
624
625
diagram on the right)? As the diagram below shows, we want to perform the
operation across an axis:
626

627
![Operations Across Axes](../fig/python-operations-across-axes.png)
628

Brian Jackson's avatar
Brian Jackson committed
629
To support this functionality,
630
most array functions allow us to specify the axis we want to work on.
631
If we ask for the average across axis 0 (rows in our 2D example),
632
we get:
633

634
~~~
635
print(numpy.mean(data, axis=0))
Greg Wilson's avatar
Greg Wilson committed
636
~~~
637
{: .language-python}
638
639

~~~
Greg Wilson's avatar
Greg Wilson committed
640
[  0.           0.45         1.11666667   1.75         2.43333333   3.15
641
642
643
644
645
646
647
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]
Greg Wilson's avatar
Greg Wilson committed
648
~~~
649
{: .output}
650

651
652
As a quick check,
we can ask this array what its shape is:
653

654
~~~
655
print(numpy.mean(data, axis=0).shape)
Greg Wilson's avatar
Greg Wilson committed
656
~~~
657
{: .language-python}
658
659

~~~
Greg Wilson's avatar
Greg Wilson committed
660
661
(40,)
~~~
662
{: .output}
663

Greg Wilson's avatar
Greg Wilson committed
664
The expression `(40,)` tells us we have an N×1 vector,
665
so this is the average inflammation per day for all patients.
666
If we average across axis 1 (columns in our 2D example), we get:
667

668
~~~
669
print(numpy.mean(data, axis=1))
Greg Wilson's avatar
Greg Wilson committed
670
~~~
671
{: .language-python}
672
673

~~~
Greg Wilson's avatar
Greg Wilson committed
674
[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
675
676
677
678
679
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]
Greg Wilson's avatar
Greg Wilson committed
680
~~~
681
{: .output}
682

683
684
685
686
687
which is the average inflammation per patient across all days.

The mathematician Richard Hamming once said,
"The purpose of computing is insight, not numbers,"
and the best way to develop insight is often to visualize data.
Brian Jackson's avatar
Brian Jackson committed
688
Visualization deserves an entire lecture of its own,
689
but we can explore a few features of Python's `matplotlib` library here.
Brian Jackson's avatar
Brian Jackson committed
690
691
While there is no official plotting library,
`matplotlib` is the de facto standard.
692
693
694
First,
we will import the `pyplot` module from `matplotlib`
and use two of its functions to create and display a heat map of our data:
695

696
~~~
697
import matplotlib.pyplot
698
image = matplotlib.pyplot.imshow(data)
699
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
700
~~~
701
{: .language-python}
702

703
![Heatmap of the Data](../fig/01-numpy_71_0.png)
704

705
Blue pixels in this heat map represent low values, while yellow pixels represent high values.
706
707
As we can see,
inflammation rises and falls over a 40-day period.
708

709
> ## Some IPython Magic
710
711
712
713
>
> If you're using an IPython / Jupyter notebook,
> you'll need to execute the following command
> in order for your matplotlib images to appear
Damien Irving's avatar
Damien Irving committed
714
> in the notebook when `show()` is called:
Damien Irving's avatar
Damien Irving committed
715
>
716
> ~~~
717
> %matplotlib inline
Damien Irving's avatar
Damien Irving committed
718
> ~~~
719
> {: .language-python}
720
>
721
722
> The `%` indicates an IPython magic function -
> a function that is only valid within the notebook environment.
723
> Note that you only have to execute this function once per notebook.
724
{: .callout}
725

726
Let's take a look at the average inflammation over time:
727

728
~~~
729
ave_inflammation = numpy.mean(data, axis=0)
730
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
731
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
732
~~~
733
{: .language-python}
734

735
![Average Inflammation Over Time](../fig/01-numpy_73_0.png)
736

737
738
Here,
we have put the average per day across all patients in the variable `ave_inflammation`,
739
then asked `matplotlib.pyplot` to create and display a line graph of those values.
Brian Jackson's avatar
Brian Jackson committed
740
The result is a roughly linear rise and fall,
741
which is suspicious:
Brian Jackson's avatar
Brian Jackson committed
742
we might instead expect a sharper rise and slower fall.
743
Let's have a look at two other statistics:
744

745
~~~
746
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
747
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
748
~~~
749
{: .language-python}
750

751
![Maximum Value Along The First Axis](../fig/01-numpy_75_1.png)
Greg Wilson's avatar
Greg Wilson committed
752

753
~~~
754
min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
755
matplotlib.pyplot.show()
Greg Wilson's avatar
Greg Wilson committed
756
~~~
757
{: .language-python}
758