Skip to content

Commit

Permalink
Add Python output for remaining chapters (#55)
Browse files Browse the repository at this point in the history
* add output formatting for ch07-pandas
* remove duplicate output code in ch08
* add python output for print statements for chapter 10-aggregations
* add python output for ch11-joins
* add python output for ch12-long-and-wide
* add python output to ch14-sqlite
  • Loading branch information
gboushey authored and sechilds committed May 19, 2018
1 parent dcfe235 commit 893a11c
Show file tree
Hide file tree
Showing 5 changed files with 155 additions and 1 deletion.
24 changes: 24 additions & 0 deletions _episodes/09-extracting-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ print(df_SN7577_some_cols.columns)
~~~
{: .language-python}

~~~
(1286, 6)
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
~~~
{: .output}

Let us assume for now that we read in the complete file which is now in the dataframe 'df_SN7577', how can we now refer to specific columns?

Expand All @@ -53,13 +59,31 @@ print(df_SN7577.Q1)
~~~
{: .language-python}

~~~
0 1
1 3
2 10
3 9
...
~~~
{: .output}

If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of 'Q1' we could also have provided a list (of strings). Remember that lists are enclosed in '[]'

~~~
print(df_SN7577[['Q1', 'Q2', 'Q3']])
~~~
{: .language-python}

~~~
Q1 Q2 Q3
0 1 -1 1
1 3 -1 1
2 10 3 2
3 9 -1 10
...
~~~
{: .language-python}
> ## Exercise
>
> What happens if you:
Expand Down
62 changes: 62 additions & 0 deletions _episodes/10-aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,16 @@ print(df_SAFI['B_no_membrs'].sum())
~~~
{: .language-python}

~~~
2
19
7.190839694656488
3.1722704895263734
131
942
~~~
{: .output}

Unlike the `describe` method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed.

We can do the same thing for the 'E19_period_use' variable
Expand All @@ -71,6 +81,16 @@ print(df_SAFI['E19_period_use'].sum())
~~~
{: .language-python}

~~~
1.0
45.0
12.043478260869565
8.583030848015385
92
1108.0
~~~
{: output}

> ## Exercise
>
> Compare the count values returned for the 'B_no_membrs' and the 'E19_period_use' variables.
Expand Down Expand Up @@ -101,13 +121,27 @@ df_SAFI.isnull().sum()
~~~
{: .language-python}
~~~
Column1 0
A01_interview_date 0
A03_quest_no 0
A04_start 0
...
~~~
{: output}
or for a specific variable
~~~
df_SAFI['E19_period_use'].isnull().sum()
~~~
{: .language-python}
~~~
39
~~~
{: output}
Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do.
We have been using data from two very different sources.
Expand Down Expand Up @@ -138,6 +172,12 @@ print(df_SAFI.shape)
~~~
{: .language-python}
~~~
(131, 55)
(0, 55)
~~~
{: output}
Because there are variables in the SAFI dataset which are all NaN using the `dropna` method effectively deletes all of the rows from the dataframe, probably not what you wanted. Instead we can use the `notnull()` method as a row selection criteria and delete the rows where a specific variable has NaN values.
~~~
Expand All @@ -148,6 +188,12 @@ print(df_SAFI.shape)
~~~
{: .language-python}
~~~
(131, 55)
(39, 55)
~~~
{: output}
### Replace NaN with a value of our choice
The 'E19_period_use' variable answers the question; 'For how many years have you been irrigating the land?'. In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run
Expand Down Expand Up @@ -194,6 +240,11 @@ pd.unique(df_SAFI['C01_respondent_roof_type'])
~~~
{: .language-python}
~~~
array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
~~~
{: output}
Knowing all of the unique values is useful but what is more useful is knowing how many occurences of each there are. In order to do this we can use the `groupby` method.
Having performed the `groupby` we can them `describe()` the results. The format is similar to that which we have seen before except that the 'grouped by' variable appears to the left and there is a set of statistics for each unique value of the variable.
Expand All @@ -220,6 +271,17 @@ A11_years_farm
~~~
{: .language-python}
~~~
C01_respondent_roof_type C02_respondent_wall_type
grass burntbricks 22
muddaub 42
sunbricks 9
mabatipitched burntbricks 6
muddaub 3
...
~~~
{: output}
> ## Exercise
>
> 1. Read in the SAFI_results.csv dataset.
Expand Down
18 changes: 18 additions & 0 deletions _episodes/11-joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,24 @@ print(df_SN7577i_b)
~~~
{: .language-python}


~~~
Id Q1 Q2 Q3 Q4
0 1 1 -1 1 8
1 2 3 -1 1 4
2 3 10 3 2 6
3 4 9 -1 10 10
...
Id Q1 Q2 Q3 Q4
0 1277 10 10 4 6
1 1278 2 -1 5 4
2 1279 2 -1 4 5
3 1280 1 -1 2 3
...
~~~
{: output}

The `concat` method appends the rows from the two dataframes to create the df_all_rows dataframe. When you list this out you can see that all of the data rows are there, however there is a problem with the `index`.

~~~
Expand Down
24 changes: 24 additions & 0 deletions _episodes/12-long-and-wide.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ len(df_SN7577.index) +1
~~~
{: .language-python}

~~~
1287
~~~
{: output}

We will create a 2nd dataframe, based on SN7577 but containing only the columns starting with the word 'daily'.

There are several ways odf doing this.
Expand Down Expand Up @@ -93,6 +98,17 @@ print(df_papers.columns)
~~~
{: .language-python}

~~~
RangeIndex(start=0, stop=1286, step=1)
Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6',
'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12',
'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18',
'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24',
'daily25'],
dtype='object')
~~~
{: output}

We use 'axis = 1' because we are joining by columns not rows which is the default.


Expand Down Expand Up @@ -124,6 +140,14 @@ a
~~~
{: .language-python}

~~~
Daily_paper
daily1 0
daily2 26
daily3 52
~~~
{: output}

## From Long to Wide

The process can be reversed by using the `pivot` method.
Expand Down
28 changes: 27 additions & 1 deletion _episodes/14-sqlite.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ The first thing we need to do is to make a connection to the database. An SQLite
The connection is assigned to a variable. You could use any variable name, but 'con' is quite commonly used for this purpose

~~~
import sqlite3
con = sqlite3.connect('SN7577.sqlite')
~~~
{: .language-python}
Expand All @@ -79,6 +78,11 @@ cur.execute("SELECT * FROM SN7577")
~~~
{: .language-python}

~~~
<sqlite3.Cursor at 0x115e10d50>
~~~
{: output}

The `execute` method doesn't actually return any data, it just indicates that we want the data provided by running the 'Select' statement.

> ## Exercise
Expand Down Expand Up @@ -113,6 +117,12 @@ for row in rows:
~~~
{: .language-python}
~~~
(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3
...
~~~
{: output}
The output is the data only, you do not get the column names.
The column names are available from the 'description' of the cursor.
Expand All @@ -126,6 +136,12 @@ print(colnames)
~~~
{: .language-python}
~~~
['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a',
...
~~~
{: output}
One reason for using a database is the size of the data involved. Consequently it may not be practial to use `fetchall` as this will return the the complete result of your query.
An alternative is to use the `fetchone` method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to `fetchone` will return the next record. When there are no more records it will return 'None'.
Expand All @@ -140,6 +156,11 @@ print(row)
~~~
{: .language-python}
~~~
(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
~~~
{: output}
> ## Exercise
>
> Can you write code to return the first 5 records from the SN7577 table in two different ways?
Expand Down Expand Up @@ -217,6 +238,11 @@ con.close()
~~~
{: .language-python}
~~~
(335, 202)
~~~
{: output}
## Deleting an SQLite table
If you have created tables in an SQLite database, you may also want to delete them.
Expand Down

0 comments on commit 893a11c

Please sign in to comment.