Add Python output for remaining chapters (#55)

* add output formatting for ch07-pandas * remove duplicate output code in ch08 * add python output for print statements for chapter 10-aggregations * add python output for ch11-joins * add python output for ch12-long-and-wide * add python output to ch14-sqlite
datacarpentry · May 19, 2018 · 893a11c · 893a11c
1 parent dcfe235
commit 893a11c
Show file tree

Hide file tree

Showing 5 changed files with 155 additions and 1 deletion.
diff --git a/_episodes/09-extracting-data.md b/_episodes/09-extracting-data.md
@@ -40,6 +40,12 @@ print(df_SN7577_some_cols.columns)
 ~~~
 {: .language-python}
 
+~~~
+(1286, 6)
+Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
+Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
+~~~
+{: .output}
 
 Let us assume for now that we read in the complete file which is now in the dataframe 'df_SN7577', how can we now refer to specific columns?
 
@@ -53,13 +59,31 @@ print(df_SN7577.Q1)
 ~~~
 {: .language-python}
 
+~~~
+0        1
+1        3
+2       10
+3        9
+...
+~~~
+{: .output}
+
 If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of 'Q1' we could also have provided a list (of strings). Remember that lists are enclosed in '[]'
 
 ~~~
 print(df_SN7577[['Q1', 'Q2', 'Q3']])
 ~~~
 {: .language-python}
 
+~~~
+Q1  Q2  Q3
+0      1  -1   1
+1      3  -1   1
+2     10   3   2
+3      9  -1  10
+...
+~~~
+{: .language-python}
 > ## Exercise  
 > 
 > What happens if you:

diff --git a/_episodes/10-aggregations.md b/_episodes/10-aggregations.md
@@ -57,6 +57,16 @@ print(df_SAFI['B_no_membrs'].sum())
 ~~~
 {: .language-python}
 
+~~~
+2
+19
+7.190839694656488
+3.1722704895263734
+131
+942
+~~~
+{: .output}
+
 Unlike the `describe` method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed. 
 
 We can do the same thing for the 'E19_period_use' variable
@@ -71,6 +81,16 @@ print(df_SAFI['E19_period_use'].sum())
 ~~~
 {: .language-python}
 
+~~~
+1.0
+45.0
+12.043478260869565
+8.583030848015385
+92
+1108.0
+~~~
+{: output}
+
 > ## Exercise
 > 
 > Compare the count values returned for the 'B_no_membrs' and the 'E19_period_use' variables. 
@@ -101,13 +121,27 @@ df_SAFI.isnull().sum()
 ~~~
 {: .language-python}
 
+~~~
+Column1                             0
+A01_interview_date                  0
+A03_quest_no                        0
+A04_start                           0
+...
+~~~
+{: output}
+
 or for a specific variable 
 
 ~~~
 df_SAFI['E19_period_use'].isnull().sum()
 ~~~
 {: .language-python}
 
+~~~
+39
+~~~
+{: output}
+
 Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do. 
 
 We have been using data from two very different sources. 
@@ -138,6 +172,12 @@ print(df_SAFI.shape)
 ~~~
 {: .language-python}
 
+~~~
+(131, 55)
+(0, 55)
+~~~
+{: output}
+
 Because there are variables in the SAFI dataset which are all NaN using the `dropna` method effectively deletes all of the rows from the dataframe, probably not what you wanted. Instead we can use the `notnull()` method as a row selection criteria and delete the rows where a specific variable has NaN values.
 
 ~~~
@@ -148,6 +188,12 @@ print(df_SAFI.shape)
 ~~~
 {: .language-python}
 
+~~~
+(131, 55)
+(39, 55)
+~~~
+{: output}
+
 ### Replace NaN with a value of our choice
 
 The 'E19_period_use' variable answers the question; 'For how many years have you been irrigating the land?'. In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run 
@@ -194,6 +240,11 @@ pd.unique(df_SAFI['C01_respondent_roof_type'])
 ~~~
 {: .language-python}
 
+~~~
+array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
+~~~
+{: output}
+
 Knowing all of the unique values is useful but what is more useful is knowing how many occurences of each there are. In order to do this we can use the `groupby` method. 
 
 Having performed the `groupby` we can them `describe()` the results. The format is similar to that which we have seen before except that the 'grouped by' variable appears to the left and there is a set of statistics for each unique value of the variable.
@@ -220,6 +271,17 @@ A11_years_farm
 ~~~
 {: .language-python}
 
+~~~
+C01_respondent_roof_type  C02_respondent_wall_type
+grass                     burntbricks                 22
+                          muddaub                     42
+                          sunbricks                    9
+mabatipitched             burntbricks                  6
+                          muddaub                      3
+...
+~~~
+{: output}
+
 > ## Exercise
 > 
 > 1. Read in the SAFI_results.csv dataset.

diff --git a/_episodes/11-joins.md b/_episodes/11-joins.md
@@ -47,6 +47,24 @@ print(df_SN7577i_b)
 ~~~
 {: .language-python}
 
+
+~~~
+  Id  Q1  Q2  Q3  Q4
+0   1   1  -1   1   8
+1   2   3  -1   1   4
+2   3  10   3   2   6
+3   4   9  -1  10  10
+...
+
+  Id  Q1  Q2  Q3  Q4
+0  1277  10  10   4   6
+1  1278   2  -1   5   4
+2  1279   2  -1   4   5
+3  1280   1  -1   2   3
+...
+~~~
+{: output}
+
 The `concat` method appends the rows from the two dataframes to create the df_all_rows dataframe. When you list this out you can see that all of the data rows are there, however there is a problem with the `index`.
 
 ~~~

diff --git a/_episodes/12-long-and-wide.md b/_episodes/12-long-and-wide.md
@@ -53,6 +53,11 @@ len(df_SN7577.index) +1
 ~~~
 {: .language-python}
 
+~~~
+1287
+~~~
+{: output}
+
 We will create a 2nd dataframe, based on SN7577 but containing only the columns starting with the word 'daily'. 
 
 There are several ways odf doing this.
@@ -93,6 +98,17 @@ print(df_papers.columns)
 ~~~
 {: .language-python}
 
+~~~
+RangeIndex(start=0, stop=1286, step=1)
+Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6',
+       'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12',
+       'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18',
+       'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24',
+       'daily25'],
+      dtype='object')
+~~~
+{: output}
+
 We use 'axis = 1' because we are joining by columns not rows which is the default.
 
 
@@ -124,6 +140,14 @@ a
 ~~~
 {: .language-python}
 
+~~~
+Daily_paper
+daily1     0
+daily2    26
+daily3    52
+~~~
+{: output}
+
 ## From Long to Wide 
 
 The process can be reversed by using the `pivot` method. 

diff --git a/_episodes/14-sqlite.md b/_episodes/14-sqlite.md
@@ -58,7 +58,6 @@ The first thing we need to do is to make a connection to the database. An SQLite
 The connection is assigned to a variable. You could use any variable name, but 'con' is quite commonly used for this purpose
 
 ~~~
-import sqlite3
 con = sqlite3.connect('SN7577.sqlite')
 ~~~
 {: .language-python}
@@ -79,6 +78,11 @@ cur.execute("SELECT * FROM SN7577")
 ~~~
 {: .language-python}
 
+~~~
+<sqlite3.Cursor at 0x115e10d50>
+~~~
+{: output}
+
 The `execute` method doesn't actually return any data, it just indicates that we want the data provided by running the 'Select' statement.
 
 > ## Exercise 
@@ -113,6 +117,12 @@ for row in rows:
 ~~~
 {: .language-python}
 
+~~~
+(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3
+...
+~~~
+{: output}
+
 The output is the data only, you do not get the column names.
 
 The column names are available from the 'description' of the cursor.
@@ -126,6 +136,12 @@ print(colnames)
 ~~~
 {: .language-python}
 
+~~~
+['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a',
+...
+~~~
+{: output}
+
 One reason for using a database is the size of the data involved. Consequently it may not be practial to use `fetchall` as this will return the the complete result of your query.
 
 An alternative is to use the `fetchone` method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to `fetchone` will return the next record. When there are no more records it will return 'None'.
@@ -140,6 +156,11 @@ print(row)
 ~~~
 {: .language-python}
 
+~~~
+(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
+~~~
+{: output}
+
 > ## Exercise 
 > 
 > Can you write code to return the first 5 records from the SN7577 table in two different ways?
@@ -217,6 +238,11 @@ con.close()
 ~~~
 {: .language-python}
 
+~~~
+(335, 202)
+~~~
+{: output}
+
 ## Deleting an SQLite table
 
 If you have created tables in an SQLite database, you may also want to delete them.