From b493dad2222dfbcc3c5d2c096581e25068eb54c9 Mon Sep 17 00:00:00 2001
From: Chanchal Kumar Maji
<31502077+ChanchalKumarMaji@users.noreply.github.com>
Date: Sun, 3 Feb 2019 22:39:37 +0530
Subject: [PATCH 1/5] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index a2a09d1..3a2ea3c 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Assignment 5
+# Assignment 5
This is the final Assignment of MLCC Study Jam, DSC Kolkata. In this Assignment, you're asked to solve the official MLCC Notebooks. These notebooks are a property of Google Inc.
To get accepted for final evaluation, complete all the notebooks and commit them to the branches having your github ID.
From 0fa63fdd118d772247aef08af5c07fdc7d3427c2 Mon Sep 17 00:00:00 2001
From: Chanchal Kumar Maji
<31502077+ChanchalKumarMaji@users.noreply.github.com>
Date: Sun, 3 Feb 2019 22:51:15 +0530
Subject: [PATCH 2/5] Created using Colaboratory
---
intro_to_pandas.ipynb | 660 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 660 insertions(+)
create mode 100644 intro_to_pandas.ipynb
diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb
new file mode 100644
index 0000000..942ea63
--- /dev/null
+++ b/intro_to_pandas.ipynb
@@ -0,0 +1,660 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "intro_to_pandas.ipynb",
+ "version": "0.3.2",
+ "provenance": [],
+ "collapsed_sections": [
+ "JndnmDMp66FL",
+ "YHIWvc9Ms-Ll",
+ "TJffr5_Jwqvd"
+ ],
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python2",
+ "display_name": "Python 2"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ " "
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "JndnmDMp66FL"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "#### Copyright 2017 Google LLC."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "hMqWDc_m6rUC",
+ "cellView": "both",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "rHLcriKWLRe4"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Intro to pandas"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "QvJBqX8_Bctk"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "**Learning Objectives:**\n",
+ " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n",
+ " * Access and manipulate data within a `DataFrame` and `Series`\n",
+ " * Import CSV data into a *pandas* `DataFrame`\n",
+ " * Reindex a `DataFrame` to shuffle data"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TIFJ83ZTBctl"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n",
+ "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "s_JOISVgmn9v"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Basic Concepts\n",
+ "\n",
+ "The following line imports the *pandas* API and prints the API version:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "aSRYu62xUi3g",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "from __future__ import print_function\n",
+ "\n",
+ "import pandas as pd\n",
+ "pd.__version__"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "daQreKXIUslr"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "The primary data structures in *pandas* are implemented as two classes:\n",
+ "\n",
+ " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n",
+ " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n",
+ "\n",
+ "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "fjnAk1xcU0yc"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "One way to create a `Series` is to construct a `Series` object. For example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "DFZ42Uq7UFDj",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "U5ouUp1cU6pC"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "avgr6GfiUh8t",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
+ "population = pd.Series([852469, 1015785, 485199])\n",
+ "\n",
+ "pd.DataFrame({ 'City name': city_names, 'Population': population })"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "oa5wfZT7VHJl"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "av6RYOraVG1V",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
+ "california_housing_dataframe.describe()"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "WrkBjfz5kEQu"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "s3ND3bgOkB5k",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe.head()"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "w9-Es5Y6laGd"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "nqndFVXVlbPN",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe.hist('housing_median_age')"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "XtYZ7114n3b-"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Accessing Data\n",
+ "\n",
+ "You can access `DataFrame` data using familiar Python dict/list operations:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "_TFm7-looBFF",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
+ "print(type(cities['City name']))\n",
+ "cities['City name']"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "V5L6xacLoxyv",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "print(type(cities['City name'][1]))\n",
+ "cities['City name'][1]"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "gcYX1tBPugZl",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "print(type(cities[0:2]))\n",
+ "cities[0:2]"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "65g1ZdGVjXsQ"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "RM1iaD-ka3Y1"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Manipulating Data\n",
+ "\n",
+ "You may apply Python's basic arithmetic operations to `Series`. For example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "XWmyCFJ5bOv-",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "population / 1000."
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TQzIVnbnmWGM"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "ko6pLK6JmkYP",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "np.log(population)"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "xmxFuQmurr6d"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n",
+ "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n",
+ "\n",
+ "The example below creates a new `Series` that indicates whether `population` is over one million:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "Fc1DvPAbstjI",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "population.apply(lambda val: val > 1000000)"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "ZeYYLoV9b9fB"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "0gCEX99Hb8LR",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
+ "cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
+ "cities"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "6qh63m-ayb-c"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Exercise #1\n",
+ "\n",
+ "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n",
+ "\n",
+ " * The city is named after a saint.\n",
+ " * The city has an area greater than 50 square miles.\n",
+ "\n",
+ "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n",
+ "\n",
+ "**Hint:** \"San\" in Spanish means \"saint.\""
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "zCOn8ftSyddH",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "# Your code here"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "YHIWvc9Ms-Ll"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Solution\n",
+ "\n",
+ "Click below for a solution."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "T5OlrqtdtCIb",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
+ "cities"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "f-xAOJeMiXFB"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Indexes\n",
+ "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n",
+ "\n",
+ "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "2684gsWNinq9",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "city_names.index"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "F_qPe2TBjfWd",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.index"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "hp2oWY9Slo_h"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "sN0zUzSAj-U1",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex([2, 0, 1])"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "-GQFz8NZuS06"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n",
+ "Try running the following cell multiple times!"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "mF8GC0k8uYhz",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex(np.random.permutation(cities.index))"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "fSso35fQmGKb"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "8UngIdVhz8C0"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Exercise #2\n",
+ "\n",
+ "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "PN55GrDX0jzO",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "# Your code here"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TJffr5_Jwqvd"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Solution\n",
+ "\n",
+ "Click below for the solution."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "8oSvi2QWwuDH"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "yBdkucKCwy4x",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex([0, 4, 5, 2])"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "2l82PhPbwz7g"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n",
+ "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n",
+ "in which the index values are browser names).\n",
+ "\n",
+ "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n",
+ "sanitizing the input."
+ ]
+ }
+ ]
+}
\ No newline at end of file
From 50fe890274003fa469ef03d3c44a6f02556f253a Mon Sep 17 00:00:00 2001
From: Chanchal Kumar Maji
<31502077+ChanchalKumarMaji@users.noreply.github.com>
Date: Sun, 3 Feb 2019 23:00:18 +0530
Subject: [PATCH 3/5] Created using Colaboratory
From f282d373f15c0d27ff7f8fdf341bba5bfe30782f Mon Sep 17 00:00:00 2001
From: Chanchal Kumar Maji
<31502077+ChanchalKumarMaji@users.noreply.github.com>
Date: Sun, 3 Feb 2019 23:01:02 +0530
Subject: [PATCH 4/5] Delete intro_to_pandas.ipynb
---
intro_to_pandas.ipynb | 660 ------------------------------------------
1 file changed, 660 deletions(-)
delete mode 100644 intro_to_pandas.ipynb
diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb
deleted file mode 100644
index 942ea63..0000000
--- a/intro_to_pandas.ipynb
+++ /dev/null
@@ -1,660 +0,0 @@
-{
- "nbformat": 4,
- "nbformat_minor": 0,
- "metadata": {
- "colab": {
- "name": "intro_to_pandas.ipynb",
- "version": "0.3.2",
- "provenance": [],
- "collapsed_sections": [
- "JndnmDMp66FL",
- "YHIWvc9Ms-Ll",
- "TJffr5_Jwqvd"
- ],
- "include_colab_link": true
- },
- "kernelspec": {
- "name": "python2",
- "display_name": "Python 2"
- }
- },
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "view-in-github",
- "colab_type": "text"
- },
- "source": [
- " "
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "JndnmDMp66FL"
- },
- "cell_type": "markdown",
- "source": [
- "#### Copyright 2017 Google LLC."
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "hMqWDc_m6rUC",
- "cellView": "both",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "rHLcriKWLRe4"
- },
- "cell_type": "markdown",
- "source": [
- "# Intro to pandas"
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "QvJBqX8_Bctk"
- },
- "cell_type": "markdown",
- "source": [
- "**Learning Objectives:**\n",
- " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n",
- " * Access and manipulate data within a `DataFrame` and `Series`\n",
- " * Import CSV data into a *pandas* `DataFrame`\n",
- " * Reindex a `DataFrame` to shuffle data"
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "TIFJ83ZTBctl"
- },
- "cell_type": "markdown",
- "source": [
- "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n",
- "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials."
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "s_JOISVgmn9v"
- },
- "cell_type": "markdown",
- "source": [
- "## Basic Concepts\n",
- "\n",
- "The following line imports the *pandas* API and prints the API version:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "aSRYu62xUi3g",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "from __future__ import print_function\n",
- "\n",
- "import pandas as pd\n",
- "pd.__version__"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "daQreKXIUslr"
- },
- "cell_type": "markdown",
- "source": [
- "The primary data structures in *pandas* are implemented as two classes:\n",
- "\n",
- " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n",
- " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n",
- "\n",
- "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)."
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "fjnAk1xcU0yc"
- },
- "cell_type": "markdown",
- "source": [
- "One way to create a `Series` is to construct a `Series` object. For example:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "DFZ42Uq7UFDj",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "U5ouUp1cU6pC"
- },
- "cell_type": "markdown",
- "source": [
- "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "avgr6GfiUh8t",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
- "population = pd.Series([852469, 1015785, 485199])\n",
- "\n",
- "pd.DataFrame({ 'City name': city_names, 'Population': population })"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "oa5wfZT7VHJl"
- },
- "cell_type": "markdown",
- "source": [
- "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "av6RYOraVG1V",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
- "california_housing_dataframe.describe()"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "WrkBjfz5kEQu"
- },
- "cell_type": "markdown",
- "source": [
- "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "s3ND3bgOkB5k",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "california_housing_dataframe.head()"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "w9-Es5Y6laGd"
- },
- "cell_type": "markdown",
- "source": [
- "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "nqndFVXVlbPN",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "california_housing_dataframe.hist('housing_median_age')"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "XtYZ7114n3b-"
- },
- "cell_type": "markdown",
- "source": [
- "## Accessing Data\n",
- "\n",
- "You can access `DataFrame` data using familiar Python dict/list operations:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "_TFm7-looBFF",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
- "print(type(cities['City name']))\n",
- "cities['City name']"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "V5L6xacLoxyv",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "print(type(cities['City name'][1]))\n",
- "cities['City name'][1]"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "gcYX1tBPugZl",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "print(type(cities[0:2]))\n",
- "cities[0:2]"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "65g1ZdGVjXsQ"
- },
- "cell_type": "markdown",
- "source": [
- "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here."
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "RM1iaD-ka3Y1"
- },
- "cell_type": "markdown",
- "source": [
- "## Manipulating Data\n",
- "\n",
- "You may apply Python's basic arithmetic operations to `Series`. For example:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "XWmyCFJ5bOv-",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "population / 1000."
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "TQzIVnbnmWGM"
- },
- "cell_type": "markdown",
- "source": [
- "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "ko6pLK6JmkYP",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "import numpy as np\n",
- "\n",
- "np.log(population)"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "xmxFuQmurr6d"
- },
- "cell_type": "markdown",
- "source": [
- "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n",
- "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n",
- "\n",
- "The example below creates a new `Series` that indicates whether `population` is over one million:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "Fc1DvPAbstjI",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "population.apply(lambda val: val > 1000000)"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "ZeYYLoV9b9fB"
- },
- "cell_type": "markdown",
- "source": [
- "\n",
- "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "0gCEX99Hb8LR",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
- "cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
- "cities"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "6qh63m-ayb-c"
- },
- "cell_type": "markdown",
- "source": [
- "## Exercise #1\n",
- "\n",
- "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n",
- "\n",
- " * The city is named after a saint.\n",
- " * The city has an area greater than 50 square miles.\n",
- "\n",
- "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n",
- "\n",
- "**Hint:** \"San\" in Spanish means \"saint.\""
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "zCOn8ftSyddH",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "# Your code here"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "YHIWvc9Ms-Ll"
- },
- "cell_type": "markdown",
- "source": [
- "### Solution\n",
- "\n",
- "Click below for a solution."
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "T5OlrqtdtCIb",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
- "cities"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "f-xAOJeMiXFB"
- },
- "cell_type": "markdown",
- "source": [
- "## Indexes\n",
- "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n",
- "\n",
- "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered."
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "2684gsWNinq9",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "city_names.index"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "F_qPe2TBjfWd",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities.index"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "hp2oWY9Slo_h"
- },
- "cell_type": "markdown",
- "source": [
- "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "sN0zUzSAj-U1",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities.reindex([2, 0, 1])"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "-GQFz8NZuS06"
- },
- "cell_type": "markdown",
- "source": [
- "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n",
- "Try running the following cell multiple times!"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "mF8GC0k8uYhz",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities.reindex(np.random.permutation(cities.index))"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "fSso35fQmGKb"
- },
- "cell_type": "markdown",
- "source": [
- "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "8UngIdVhz8C0"
- },
- "cell_type": "markdown",
- "source": [
- "## Exercise #2\n",
- "\n",
- "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "PN55GrDX0jzO",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "# Your code here"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "TJffr5_Jwqvd"
- },
- "cell_type": "markdown",
- "source": [
- "### Solution\n",
- "\n",
- "Click below for the solution."
- ]
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "8oSvi2QWwuDH"
- },
- "cell_type": "markdown",
- "source": [
- "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:"
- ]
- },
- {
- "metadata": {
- "colab_type": "code",
- "id": "yBdkucKCwy4x",
- "colab": {}
- },
- "cell_type": "code",
- "source": [
- "cities.reindex([0, 4, 5, 2])"
- ],
- "execution_count": 0,
- "outputs": []
- },
- {
- "metadata": {
- "colab_type": "text",
- "id": "2l82PhPbwz7g"
- },
- "cell_type": "markdown",
- "source": [
- "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n",
- "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n",
- "in which the index values are browser names).\n",
- "\n",
- "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n",
- "sanitizing the input."
- ]
- }
- ]
-}
\ No newline at end of file
From 693b53a3e972c663557431f4c4ce4fa25f99042c Mon Sep 17 00:00:00 2001
From: Chanchal Kumar Maji
<31502077+ChanchalKumarMaji@users.noreply.github.com>
Date: Mon, 4 Feb 2019 00:41:09 +0530
Subject: [PATCH 5/5] Created using Colaboratory
---
intro_to_pandas.ipynb | 1870 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 1870 insertions(+)
create mode 100644 intro_to_pandas.ipynb
diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb
new file mode 100644
index 0000000..472d30f
--- /dev/null
+++ b/intro_to_pandas.ipynb
@@ -0,0 +1,1870 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "intro_to_pandas.ipynb",
+ "version": "0.3.2",
+ "provenance": [],
+ "collapsed_sections": [],
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "accelerator": "GPU"
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ " "
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "JndnmDMp66FL"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "#### Copyright 2017 Google LLC."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "hMqWDc_m6rUC",
+ "cellView": "both",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "rHLcriKWLRe4"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Intro to pandas"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "QvJBqX8_Bctk"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "**Learning Objectives:**\n",
+ " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n",
+ " * Access and manipulate data within a `DataFrame` and `Series`\n",
+ " * Import CSV data into a *pandas* `DataFrame`\n",
+ " * Reindex a `DataFrame` to shuffle data"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TIFJ83ZTBctl"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n",
+ "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "s_JOISVgmn9v"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Basic Concepts\n",
+ "\n",
+ "The following line imports the *pandas* API and prints the API version:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "aSRYu62xUi3g",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "outputId": "b69fa038-4215-44f9-c5dd-2acc82452a93"
+ },
+ "cell_type": "code",
+ "source": [
+ "from __future__ import print_function\n",
+ "\n",
+ "import pandas as pd\n",
+ "pd.__version__"
+ ],
+ "execution_count": 2,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'0.22.0'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 2
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "daQreKXIUslr"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "The primary data structures in *pandas* are implemented as two classes:\n",
+ "\n",
+ " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n",
+ " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n",
+ "\n",
+ "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "fjnAk1xcU0yc"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "One way to create a `Series` is to construct a `Series` object. For example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "DFZ42Uq7UFDj",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "outputId": "8acbe90f-6151-48ef-9870-ce0a5cc32a9c"
+ },
+ "cell_type": "code",
+ "source": [
+ "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 San Francisco\n",
+ "1 San Jose\n",
+ "2 Sacramento\n",
+ "dtype: object"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "U5ouUp1cU6pC"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "avgr6GfiUh8t",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "53e13300-254c-4d87-c4ca-5e021d787a12"
+ },
+ "cell_type": "code",
+ "source": [
+ "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
+ "population = pd.Series([852469, 1015785, 485199])\n",
+ "\n",
+ "pd.DataFrame({ 'City name': city_names, 'Population': population })"
+ ],
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population\n",
+ "0 San Francisco 852469\n",
+ "1 San Jose 1015785\n",
+ "2 Sacramento 485199"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 4
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "oa5wfZT7VHJl"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "av6RYOraVG1V",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 320
+ },
+ "outputId": "4b8ef629-7bcd-449f-b0d6-c13bce927f2e"
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
+ "california_housing_dataframe.describe()"
+ ],
+ "execution_count": 5,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " longitude \n",
+ " latitude \n",
+ " housing_median_age \n",
+ " total_rooms \n",
+ " total_bedrooms \n",
+ " population \n",
+ " households \n",
+ " median_income \n",
+ " median_house_value \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " 17000.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " -119.562108 \n",
+ " 35.625225 \n",
+ " 28.589353 \n",
+ " 2643.664412 \n",
+ " 539.410824 \n",
+ " 1429.573941 \n",
+ " 501.221941 \n",
+ " 3.883578 \n",
+ " 207300.912353 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 2.005166 \n",
+ " 2.137340 \n",
+ " 12.586937 \n",
+ " 2179.947071 \n",
+ " 421.499452 \n",
+ " 1147.852959 \n",
+ " 384.520841 \n",
+ " 1.908157 \n",
+ " 115983.764387 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " -124.350000 \n",
+ " 32.540000 \n",
+ " 1.000000 \n",
+ " 2.000000 \n",
+ " 1.000000 \n",
+ " 3.000000 \n",
+ " 1.000000 \n",
+ " 0.499900 \n",
+ " 14999.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " -121.790000 \n",
+ " 33.930000 \n",
+ " 18.000000 \n",
+ " 1462.000000 \n",
+ " 297.000000 \n",
+ " 790.000000 \n",
+ " 282.000000 \n",
+ " 2.566375 \n",
+ " 119400.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " -118.490000 \n",
+ " 34.250000 \n",
+ " 29.000000 \n",
+ " 2127.000000 \n",
+ " 434.000000 \n",
+ " 1167.000000 \n",
+ " 409.000000 \n",
+ " 3.544600 \n",
+ " 180400.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " -118.000000 \n",
+ " 37.720000 \n",
+ " 37.000000 \n",
+ " 3151.250000 \n",
+ " 648.250000 \n",
+ " 1721.000000 \n",
+ " 605.250000 \n",
+ " 4.767000 \n",
+ " 265000.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " -114.310000 \n",
+ " 41.950000 \n",
+ " 52.000000 \n",
+ " 37937.000000 \n",
+ " 6445.000000 \n",
+ " 35682.000000 \n",
+ " 6082.000000 \n",
+ " 15.000100 \n",
+ " 500001.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " longitude latitude housing_median_age total_rooms \\\n",
+ "count 17000.000000 17000.000000 17000.000000 17000.000000 \n",
+ "mean -119.562108 35.625225 28.589353 2643.664412 \n",
+ "std 2.005166 2.137340 12.586937 2179.947071 \n",
+ "min -124.350000 32.540000 1.000000 2.000000 \n",
+ "25% -121.790000 33.930000 18.000000 1462.000000 \n",
+ "50% -118.490000 34.250000 29.000000 2127.000000 \n",
+ "75% -118.000000 37.720000 37.000000 3151.250000 \n",
+ "max -114.310000 41.950000 52.000000 37937.000000 \n",
+ "\n",
+ " total_bedrooms population households median_income \\\n",
+ "count 17000.000000 17000.000000 17000.000000 17000.000000 \n",
+ "mean 539.410824 1429.573941 501.221941 3.883578 \n",
+ "std 421.499452 1147.852959 384.520841 1.908157 \n",
+ "min 1.000000 3.000000 1.000000 0.499900 \n",
+ "25% 297.000000 790.000000 282.000000 2.566375 \n",
+ "50% 434.000000 1167.000000 409.000000 3.544600 \n",
+ "75% 648.250000 1721.000000 605.250000 4.767000 \n",
+ "max 6445.000000 35682.000000 6082.000000 15.000100 \n",
+ "\n",
+ " median_house_value \n",
+ "count 17000.000000 \n",
+ "mean 207300.912353 \n",
+ "std 115983.764387 \n",
+ "min 14999.000000 \n",
+ "25% 119400.000000 \n",
+ "50% 180400.000000 \n",
+ "75% 265000.000000 \n",
+ "max 500001.000000 "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 5
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "WrkBjfz5kEQu"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "s3ND3bgOkB5k",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 226
+ },
+ "outputId": "a0227da6-863c-453e-c83e-4e69b4657896"
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe.head()"
+ ],
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " longitude \n",
+ " latitude \n",
+ " housing_median_age \n",
+ " total_rooms \n",
+ " total_bedrooms \n",
+ " population \n",
+ " households \n",
+ " median_income \n",
+ " median_house_value \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " -114.31 \n",
+ " 34.19 \n",
+ " 15.0 \n",
+ " 5612.0 \n",
+ " 1283.0 \n",
+ " 1015.0 \n",
+ " 472.0 \n",
+ " 1.4936 \n",
+ " 66900.0 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " -114.47 \n",
+ " 34.40 \n",
+ " 19.0 \n",
+ " 7650.0 \n",
+ " 1901.0 \n",
+ " 1129.0 \n",
+ " 463.0 \n",
+ " 1.8200 \n",
+ " 80100.0 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " -114.56 \n",
+ " 33.69 \n",
+ " 17.0 \n",
+ " 720.0 \n",
+ " 174.0 \n",
+ " 333.0 \n",
+ " 117.0 \n",
+ " 1.6509 \n",
+ " 85700.0 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " -114.57 \n",
+ " 33.64 \n",
+ " 14.0 \n",
+ " 1501.0 \n",
+ " 337.0 \n",
+ " 515.0 \n",
+ " 226.0 \n",
+ " 3.1917 \n",
+ " 73400.0 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " -114.57 \n",
+ " 33.57 \n",
+ " 20.0 \n",
+ " 1454.0 \n",
+ " 326.0 \n",
+ " 624.0 \n",
+ " 262.0 \n",
+ " 1.9250 \n",
+ " 65500.0 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n",
+ "0 -114.31 34.19 15.0 5612.0 1283.0 \n",
+ "1 -114.47 34.40 19.0 7650.0 1901.0 \n",
+ "2 -114.56 33.69 17.0 720.0 174.0 \n",
+ "3 -114.57 33.64 14.0 1501.0 337.0 \n",
+ "4 -114.57 33.57 20.0 1454.0 326.0 \n",
+ "\n",
+ " population households median_income median_house_value \n",
+ "0 1015.0 472.0 1.4936 66900.0 \n",
+ "1 1129.0 463.0 1.8200 80100.0 \n",
+ "2 333.0 117.0 1.6509 85700.0 \n",
+ "3 515.0 226.0 3.1917 73400.0 \n",
+ "4 624.0 262.0 1.9250 65500.0 "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 6
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "w9-Es5Y6laGd"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "nqndFVXVlbPN",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 397
+ },
+ "outputId": "edaf5f93-5bce-44a6-c30f-4435f896ea23"
+ },
+ "cell_type": "code",
+ "source": [
+ "california_housing_dataframe.hist('housing_median_age')"
+ ],
+ "execution_count": 7,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[]],\n",
+ " dtype=object)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 7
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAFZCAYAAABXM2zhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X1UlHX+//HXMDAH0UEEGTfLarf0\naEmaa5l4U0Iokp7IVRPWdU3q6Iqtlql499WTlajRmmZZmunRU7GNtofcAjJxyyRanT0uuu0p2VOr\neTejKCqgSPP7o9Os/FRguP1Az8dfcTEz1+d6H+3pdQ1zYfF6vV4BAAAjBTT3AgAAwPURagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaogVo6cuSI7rjjjkbdxz//+U+lpKQ06j4a0h133KEjR47o448/\n1ty5c5t7OUCrZOFz1EDtHDlyREOHDtW//vWv5l6KMe644w7l5ubqpptuau6lAK0WZ9SAn5xOp0aO\nHKn7779f27dv1w8//KA//elPio+PV3x8vNLS0lRaWipJiomJ0d69e33P/enry5cva/78+Ro2bJji\n4uI0bdo0nT9/XgUFBYqLi5MkrV69Ws8++6xSU1MVGxur0aNH6+TJk5KkgwcPaujQoRo6dKheeeUV\njRw5UgUFBdWue/Xq1Vq0aJEmT56sgQMHatasWcrLy9OoUaM0cOBA5eXlSZIuXbqk5557TsOGDVNM\nTIzWrl3re42//e1viouL0/Dhw7V+/Xrf9m3btmnixImSJI/Ho5SUFMXHxysmJkZvvfVWleN/9913\nNXr0aA0cOFDp6ek1zrusrEwzZszwrWfZsmW+71U3hx07dmjkyJGKjY3VpEmTdPr06Rr3BZiIUAN+\n+OGHH1RRUaEPPvhAc+fO1cqVK/XRRx/p008/1bZt2/TXv/5VJSUl2rhxY7Wvs3v3bh05ckTZ2dnK\nzc3V7bffrn/84x9XPS47O1vz5s3Tjh07FBERoa1bt0qSFi5cqIkTJyo3N1ft2rXTt99+W6v179q1\nSy+88II++OADZWdn+9Y9ZcoUrVu3TpK0bt06HTp0SB988IG2b9+unJwc5eXlqbKyUvPnz9eiRYv0\n0UcfKSAgQJWVlVft47XXXtNNN92k7Oxsbdq0SRkZGTp27Jjv+3//+9+VmZmprVu3asuWLTp+/Hi1\na37nnXd04cIFZWdn6/3339e2bdt8//i53hwOHz6s2bNnKyMjQ5988on69eunxYsX12pGgGkINeAH\nr9erxMREST9e9j1+/Lh27dqlxMREhYSEyGq1atSoUfr888+rfZ3w8HAVFRXp448/9p0xDho06KrH\n9e3bVzfeeKMsFot69OihY8eOqby8XAcPHtSIESMkSb/97W9V23ew7r77bkVERKhDhw6KjIzU4MGD\nJUndunXzna3n5eUpOTlZNptNISEhevjhh5Wbm6tvv/1Wly5d0sCBAyVJjzzyyDX3sWDBAi1cuFCS\n1KVLF0VGRurIkSO+748cOVJWq1WdOnVSRERElYhfy6RJk/Tqq6/KYrGoffv26tq1q44cOVLtHD79\n9FPde++96tatmyRp3Lhx2rlz5zX/YQGYLrC5FwC0JFarVW3atJEkBQQE6IcfftDp06fVvn1732Pa\nt2+vU6dOVfs6d911lxYsWKDNmzdrzpw5iomJ0aJFi656nN1ur7LvyspKnT17VhaLRaGhoZKkoKAg\nRURE1Gr9bdu2rfJ6ISEhVY5Fks6dO6elS5fqpZdekvTjpfC77rpLZ8+eVbt27aoc57UUFhb6zqID\nAgLkdrt9ry2pymv8dEzV+fbbb5Wenq7//Oc/CggI0PHjxzVq1Khq53Du3Dnt3btX8fHxVfZ75syZ\nWs8KMAWhBuqpY8eOOnPmjO/rM2fOqGPHjpKqBlCSzp496/vvn97TPnPmjObNm6c333xT0dHRNe6v\nXbt28nq9KisrU5s2bXT58uUGff/V4XBo0qRJGjJkSJXtRUVFOn/+vO/r6+1z1qxZ+v3vf6+kpCRZ\nLJZrXinwx7PPPqs777xTa9askdVq1bhx4yRVPweHw6Ho6GitWrWqXvsGTMClb6CeHnjgAWVlZams\nrEyXL1+W0+nU/fffL0mKjIzUv//9b0nShx9+qIsXL0qStm7dqjVr1kiSwsLC9Ktf/arW+2vbtq1u\nu+02ffTRR5KkzMxMWSyWBjue2NhYvffee6qsrJTX69Wrr76qTz/9VDfffLOsVqvvh7W2bdt2zf2e\nOnVKPXv2lMVi0fvvv6+ysjLfD9fVxalTp9SjRw9ZrVZ9/vnn+u6771RaWlrtHAYOHKi9e/fq8OHD\nkn782Ntzzz1X5zUAzYlQA/UUHx+vwYMHa9SoURoxYoR+8YtfaMKECZKkqVOnauPGjRoxYoSKiop0\n++23S/oxhj/9xPLw4cN16NAhPfbYY7Xe56JFi7R27Vo99NBDKi0tVadOnRos1snJyercubMeeugh\nxcfHq6ioSL/+9a8VFBSkJUuWaN68eRo+fLgsFovv0vmVpk+frtTUVI0cOVKlpaV69NFHtXDhQv33\nv/+t03r+8Ic/aNmyZRoxYoS+/PJLTZs2TatXr9a+ffuuOweHw6ElS5YoNTVVw4cP17PPPquEhIT6\njgZoFnyOGmihvF6vL8733XefNm7cqO7duzfzqpoec0Brxxk10AL98Y9/9H2cKj8/X16vV7feemvz\nLqoZMAf8HHBGDbRARUVFmjt3rs6ePaugoCDNmjVLN910k1JTU6/5+Ntuu833nrhpioqK6rzua83h\np58PAFoLQg0AgMG49A0AgMEINQAABjPyhidu9zm/Ht+hQ4iKi+v+Oc2fO+ZXd8yufphf3TG7+jFt\nfpGR9ut+r1WcUQcGWpt7CS0a86s7Zlc/zK/umF39tKT5tYpQAwDQWhFqAAAMRqgBADBYjT9MVlZW\nprS0NJ06dUoXL17U1KlT1b17d82ePVuVlZWKjIzUihUrZLPZlJWVpU2bNikgIEBjx47VmDFjVFFR\nobS0NB09elRWq1VLly5Vly5dmuLYAABo8Wo8o87Ly1PPnj21ZcsWrVy5Uunp6Vq1apWSk5P19ttv\n65ZbbpHT6VRpaanWrFmjjRs3avPmzdq0aZPOnDmj7du3KzQ0VO+8846mTJmijIyMpjguAABahRpD\nnZCQoCeeeEKSdOzYMXXq1EkFBQWKjY2VJA0ZMkT5+fnav3+/oqKiZLfbFRwcrD59+sjlcik/P19x\ncXGSpOjoaLlcrkY8HAAAWpdaf4563LhxOn78uNauXavHHntMNptNkhQRESG32y2Px6Pw8HDf48PD\nw6/aHhAQIIvFokuXLvmeDwAArq/WoX733Xf11VdfadasWbry9uDXu1W4v9uv1KFDiN+fcavuw+Ko\nGfOrO2ZXP8yv7phd/bSU+dUY6gMHDigiIkI33HCDevToocrKSrVt21bl5eUKDg7WiRMn5HA45HA4\n5PF4fM87efKkevfuLYfDIbfbre7du6uiokJer7fGs2l/7xYTGWn3+25m+B/mV3fMrn6YX90xu/ox\nbX71ujPZ3r17tWHDBkmSx+NRaWmpoqOjlZOTI0nKzc3VoEGD1KtXLxUWFqqkpEQXLlyQy+VS3759\nNWDAAGVnZ0v68QfT+vXr1xDHBADAz0KNZ9Tjxo3T/PnzlZycrPLycv3f//2fevbsqTlz5igzM1Od\nO3dWYmKigoKCNHPmTKWkpMhisSg1NVV2u10JCQnas2ePkpKSZLPZlJ6e3hTHBQBAq2Dk76P293KE\naZcwWhrmV3fMrn6YX90xu/oxbX7VXfo28rdnAcC1TErf2dxLqNGGtJjmXgJaGW4hCgCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QA\nABiMUAMAYDBCDQCAwQg1AAAGC6zNg5YvX659+/bp8uXLmjx5snbu3KmDBw8qLCxMkpSSkqIHHnhA\nWVlZ2rRpkwICAjR27FiNGTNGFRUVSktL09GjR2W1WrV06VJ16dKlUQ8KAIDWosZQf/HFF/rmm2+U\nmZmp4uJiPfLII7rvvvv09NNPa8iQIb7HlZaWas2aNXI6nQoKCtLo0aMVFxenvLw8hYaGKiMjQ7t3\n71ZGRoZWrlzZqAcFAEBrUeOl73vuuUcvv/yyJCk0NFRlZWWqrKy86nH79+9XVFSU7Ha7goOD1adP\nH7lcLuXn5ysuLk6SFB0dLZfL1cCHAABA61VjqK1Wq0JCQiRJTqdTgwcPltVq1ZYtWzRhwgQ99dRT\nOn36tDwej8LDw33PCw8Pl9vtrrI9ICBAFotFly5daqTDAQCgdanVe9SStGPHDjmdTm3YsEEHDhxQ\nWFiYevTooTfeeEOvvPKK7r777iqP93q913yd622/UocOIQoMtNZ2aZKkyEi7X49HVcyv7phd/bS2\n+TXl8bS22TW1ljK/WoX6s88+09q1a7V+/XrZ7Xb179/f972YmBgtXrxYw4YNk8fj8W0/efKkevfu\nLYfDIbfbre7du6uiokJer1c2m63a/RUXl/p1EJGRdrnd5/x6Dv6H+dUds6uf1ji/pjqe1ji7pmTa\n/Kr7R0ONl77PnTun5cuX6/XXX/f9lPeTTz6pw4cPS5IKCgrUtWtX9erVS4WFhSopKdGFCxfkcrnU\nt29fDRgwQNnZ2ZKkvLw89evXryGOCQCAn4Uaz6g//PBDFRcXa8aMGb5to0aN0owZM9SmTRuFhIRo\n6dKlCg4O1syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArYnFW5s3jZuYv5cj\nTLuE0dIwv7pjdvXj7/wmpe9sxNU0jA1pMU2yH/7s1Y9p86vXpW8AANB8CDUAAAYj1AAAGIxQAwBg\nMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAA\nGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYLbO4FAA1lUvrO5l5CtTakxTT3\nEgC0QJxRAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDB\nCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAbj91EDTcT035ct8TuzARNxRg0AgMFqdUa9fPly7du3\nT5cvX9bkyZMVFRWl2bNnq7KyUpGRkVqxYoVsNpuysrK0adMmBQQEaOzYsRozZowqKiqUlpamo0eP\nymq1aunSperSpUtjHxcAAK1CjaH+4osv9M033ygzM1PFxcV65JFH1L9/fyUnJ2v48OF66aWX5HQ6\nlZiYqDVr1sjpdCooKEijR49WXFyc8vLyFBoaqoyMDO3evVsZGRlauXJlUxwbAAAtXo2Xvu+55x69\n/PLLkqTQ0FCVlZWpoKBAsbGxkqQhQ4YoPz9f+/fvV1RUlOx2u4KDg9WnTx+5XC7l5+crLi5OkhQd\nHS2Xy9WIhwMAQOtS4xm11WpVSEiIJMnpdGrw4MHavXu3bDabJCkiIkJut1sej0fh4eG+54WHh1+1\nPSAgQBaLRZcuXfI9/1o6dAhRYKDVrwOJjLT79XhUxfwgNc+fg9b2Z68pj6e1za6ptZT51fqnvnfs\n2CGn06kNGzZo6NChvu1er/eaj/d3+5WKi0truyxJPw7b7T7n13PwP8wPP2nqPwet8c9eUx1Pa5xd\nUzJtftX9o6FWP/X92Wefae3atVq3bp3sdrtCQkJUXl4uSTpx4oQcDoccDoc8Ho/vOSdPnvRtd7vd\nkqSKigp5vd5qz6YBAMD/1Bjqc+fOafny5Xr99dcVFhYm6cf3mnNyciRJubm5GjRokHr16qXCwkKV\nlJTowoULcrlc6tu3rwYMGKDs7GxJUl5envr169eIhwMAQOtS46XvDz/8UMXFxZoxY4ZvW3p6uhYs\nWKDMzEx17txZiYmJCgoK0syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArUmN\noX700Uf16KOPXrX9rbfeumpbfHy84uPjq2z76bPTAADAf9xCFIBPS7jNKfBzwy1EAQAwGKEGAMBg\nhBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGHcmQ61wxyoAaB6cUQMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssLkXAADAlSal72zuJdRoQ1pM\nk+2LM2oAAAxGqAEAMBihBgDAYIQaAACDEWoAAAxGqAEAMBihBgDAYLX6HPXXX3+tqVOnauLEiRo/\nfrzS0tJ08OBBhYWFSZJSUlL0wAMPKCsrS5s2bVJAQIDGjh2rMWPGqKKiQmlpaTp69KisVquWLl2q\nLl26NOpBAUBz4TPAaGg1hrq0tFRLlixR//79q2x/+umnNWTIkCqPW7NmjZxOp4KCgjR69GjFxcUp\nLy9PoaGhysjI0O7du5WRkaGVK1c2/JEAANAK1Xjp22azad26dXI4HNU+bv/+/YqKipLdbldwcLD6\n9Okjl8ul/Px8xcXFSZKio6PlcrkaZuUAAPwM1BjqwMBABQcHX7V9y5YtmjBhgp566imdPn1aHo9H\n4eHhvu+Hh4fL7XZX2R4QECCLxaJLly414CEAANB61ele3w8//LDCwsLUo0cPvfHGG3rllVd09913\nV3mM1+u95nOvt/1KHTqEKDDQ6teaIiPtfj0eVTE/4OeDv+/115QzrFOor3y/OiYmRosXL9awYcPk\n8Xh820+ePKnevXvL4XDI7Xare/fuqqiokNfrlc1mq/b1i4tL/VpPZKRdbvc5/w4CPswP+Hnh73v9\nNfQMqwt/nT6e9eSTT+rw4cOSpIKCAnXt2lW9evVSYWGhSkpKdOHCBblcLvXt21cDBgxQdna2JCkv\nL0/9+vWryy4BAPhZqvGM+sCBA1q2bJm+//57BQYGKicnR+PHj9eMGTPUpk0bhYSEaOnSpQoODtbM\nmTOVkpIii8Wi1NRU2e12JSQkaM+ePUpKSpLNZlN6enpTHBcAAK1CjaHu2bOnNm/efNX2YcOGXbUt\nPj5e8fHxVbb99NlpAADgP+5MBgCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYLA6/T5qAEDLNSl9Z3MvAX7gjBoAAIMRagAADEaoAQAwGKEG\nAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMR\nagAADFarUH/99dd68MEHtWXLFknSsWPH9Lvf/U7JycmaPn26Ll26JEnKysrSb37zG40ZM0bvvfee\nJKmiokIzZ85UUlKSxo8fr8OHDzfSoQAA0PrUGOrS0lItWbJE/fv3921btWqVkpOT9fbbb+uWW26R\n0+lUaWmp1qxZo40bN2rz5s3atGmTzpw5o+3btys0NFTvvPOOpkyZooyMjEY9IAAAWpMaQ22z2bRu\n3To5HA7ftoKCAsXGxkqShgwZovz8fO3fv19RUVGy2+0KDg5Wnz595HK5lJ+fr7i4OElSdHS0XC5X\nIx0KAACtT42hDgwMVHBwcJVtZWVlstlskqSIiAi53W55PB6Fh4f7HhMeHn7V9oCAAFksFt+lcgAA\nUL3A+r6A1+ttkO1X6tAhRIGBVr/WERlp9+vxqIr5AUDtNeX/M+sU6pCQEJWXlys4OFgnTpyQw+GQ\nw+GQx+PxPebkyZPq3bu3HA6H3G63unfvroqKCnm9Xt/Z+PUUF5f6tZ7ISLvc7nN1ORSI+QGAvxr6\n/5nVhb9OH8+Kjo5WTk6OJCk3N1eDBg1Sr169VFhYqJKSEl24cEEul0t9+/bVgAEDlJ2dLUnKy8tT\nv3796rJLAAB+lmo8oz5w4ICWLVum77//XoGBgcrJydGLL76otLQ0ZWZmqnPnzkpMTFRQUJBmzpyp\nlJQUWSwWpaamym63KyEhQXv27FFSUpJsNpvS09Ob4rgAAGgVLN7avGncxPy9pMCl2/qpzfwmpe9s\notUAgPk2pMU06Os1+KVvAADQNOr9U99oGJyxAgCuhTNqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAM\nRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAA\ngxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYA\nwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMFtjcC2gKk9J3NvcSAACoE86oAQAwGKEG\nAMBghBoAAIMRagAADFanHyYrKCjQ9OnT1bVrV0lSt27d9Pjjj2v27NmqrKxUZGSkVqxYIZvNpqys\nLG3atEkBAQEaO3asxowZ06AHAABAa1bnn/q+9957tWrVKt/Xc+fOVXJysoYPH66XXnpJTqdTiYmJ\nWrNmjZxOp4KCgjR69GjFxcUpLCysQRYPAEBr12CXvgsKChQbGytJGjJkiPLz87V//35FRUXJbrcr\nODhYffr0kcvlaqhdAgDQ6tX5jPrQoUOaMmWKzp49q2nTpqmsrEw2m02SFBERIbfbLY/Ho/DwcN9z\nwsPD5Xa7a3ztDh1CFBho9Ws9kZF2/w4AAIA6asrm1CnUt956q6ZNm6bhw4fr8OHDmjBhgiorK33f\n93q913ze9bb//4qLS/1aT2SkXW73Ob+eAwBAXTV0c6oLf50ufXfq1EkJCQmyWCy6+eab1bFjR509\ne1bl5eWSpBMnTsjhcMjhcMjj8fied/LkSTkcjrrsEgCAn6U6hTorK0tvvvmmJMntduvUqVMaNWqU\ncnJyJEm5ubkaNGiQevXqpcLCQpWUlOjChQtyuVzq27dvw60eAIBWrk6XvmNiYvTMM8/ok08+UUVF\nhRYvXqwePXpozpw5yszMVOfOnZWYmKigoCDNnDlTKSkpslgsSk1Nld3Oe8kAANSWxVvbN46bkL/X\n/mt6j5pfygEAaEgb0mIa9PUa/D1qAADQNAg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiM\nUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssCl2\n8sILL2j//v2yWCyaN2+e7rrrrqbYLQAALV6jh/rLL7/Ud999p8zMTBUVFWnevHnKzMxs7N0CANAq\nNPql7/z8fD344IOSpNtuu01nz57V+fPnG3u3AAC0Co0eao/How4dOvi+Dg8Pl9vtbuzdAgDQKjTJ\ne9RX8nq9NT4mMtLu9+tW95wPMh72+/UAADBBo59ROxwOeTwe39cnT55UZGRkY+8WAIBWodFDPWDA\nAOXk5EiSDh48KIfDoXbt2jX2bgEAaBUa/dJ3nz59dOedd2rcuHGyWCxatGhRY+8SAIBWw+KtzZvG\nAACgWXBnMgAADEaoAQAwWJN/PKuhcXtS/3399deaOnWqJk6cqPHjx+vYsWOaPXu2KisrFRkZqRUr\nVshmszX3Mo20fPly7du3T5cvX9bkyZMVFRXF7GqhrKxMaWlpOnXqlC5evKipU6eqe/fuzM5P5eXl\nGjFihKZOnar+/fszv1oqKCjQ9OnT1bVrV0lSt27d9Pjjj7eY+bXoM+orb0/6/PPP6/nnn2/uJRmv\ntLRUS5YsUf/+/X3bVq1apeTkZL399tu65ZZb5HQ6m3GF5vriiy/0zTffKDMzU+vXr9cLL7zA7Gop\nLy9PPXv21JYtW7Ry5Uqlp6czuzp47bXX1L59e0n8vfXXvffeq82bN2vz5s1auHBhi5pfiw41tyf1\nn81m07p16+RwOHzbCgoKFBsbK0kaMmSI8vPzm2t5Rrvnnnv08ssvS5JCQ0NVVlbG7GopISFBTzzx\nhCTp2LFj6tSpE7PzU1FRkQ4dOqQHHnhAEn9v66slza9Fh5rbk/ovMDBQwcHBVbaVlZX5LvlEREQw\nw+uwWq0KCQmRJDmdTg0ePJjZ+WncuHF65plnNG/ePGbnp2XLliktLc33NfPzz6FDhzRlyhQlJSXp\n888/b1Hza/HvUV+JT5rVHzOs2Y4dO+R0OrVhwwYNHTrUt53Z1ezdd9/VV199pVmzZlWZF7Or3l/+\n8hf17t1bXbp0ueb3mV/1br31Vk2bNk3Dhw/X4cOHNWHCBFVWVvq+b/r8WnSouT1pwwgJCVF5ebmC\ng4N14sSJKpfFUdVnn32mtWvXav369bLb7cyulg4cOKCIiAjdcMMN6tGjhyorK9W2bVtmV0u7du3S\n4cOHtWvXLh0/flw2m40/e37o1KmTEhISJEk333yzOnbsqMLCwhYzvxZ96ZvbkzaM6Oho3xxzc3M1\naNCgZl6Rmc6dO6fly5fr9ddfV1hYmCRmV1t79+7Vhg0bJP34llVpaSmz88PKlSu1detW/fnPf9aY\nMWM0depU5ueHrKwsvfnmm5Ikt9utU6dOadSoUS1mfi3+zmQvvvii9u7d67s9affu3Zt7SUY7cOCA\nli1bpu+//16BgYHq1KmTXnytKYqYAAAArElEQVTxRaWlpenixYvq3Lmzli5dqqCgoOZeqnEyMzO1\nevVq/fKXv/RtS09P14IFC5hdDcrLyzV//nwdO3ZM5eXlmjZtmnr27Kk5c+YwOz+tXr1aN954owYO\nHMj8aun8+fN65plnVFJSooqKCk2bNk09evRoMfNr8aEGAKA1a9GXvgEAaO0INQAABiPUAAAYjFAD\nAGAwQg0AgMEINQAABiPUAAAYjFADAGCw/wdkB5RjykY3PgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "XtYZ7114n3b-"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Accessing Data\n",
+ "\n",
+ "You can access `DataFrame` data using familiar Python dict/list operations:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "_TFm7-looBFF",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 104
+ },
+ "outputId": "57aede33-3c05-4f1c-8509-78060101057d"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
+ "print(type(cities['City name']))\n",
+ "cities['City name']"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 San Francisco\n",
+ "1 San Jose\n",
+ "2 Sacramento\n",
+ "Name: City name, dtype: object"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 8
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "V5L6xacLoxyv",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 52
+ },
+ "outputId": "c6c3f30a-067e-4fef-ca63-51aae3939f62"
+ },
+ "cell_type": "code",
+ "source": [
+ "print(type(cities['City name'][1]))\n",
+ "cities['City name'][1]"
+ ],
+ "execution_count": 9,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'San Jose'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 9
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "gcYX1tBPugZl",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 129
+ },
+ "outputId": "41ecc9c0-2935-4afe-a530-97b64431fc29"
+ },
+ "cell_type": "code",
+ "source": [
+ "print(type(cities[0:2]))\n",
+ "cities[0:2]"
+ ],
+ "execution_count": 10,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population\n",
+ "0 San Francisco 852469\n",
+ "1 San Jose 1015785"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 10
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "65g1ZdGVjXsQ"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "RM1iaD-ka3Y1"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Manipulating Data\n",
+ "\n",
+ "You may apply Python's basic arithmetic operations to `Series`. For example:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "XWmyCFJ5bOv-",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "outputId": "daaa5c1b-3eda-4b83-df5c-0c25a24f3b89"
+ },
+ "cell_type": "code",
+ "source": [
+ "population / 1000."
+ ],
+ "execution_count": 11,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 852.469\n",
+ "1 1015.785\n",
+ "2 485.199\n",
+ "dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 11
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TQzIVnbnmWGM"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "ko6pLK6JmkYP",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "outputId": "f4e256ae-81a6-4f43-eed7-80a0cb371482"
+ },
+ "cell_type": "code",
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "np.log(population)"
+ ],
+ "execution_count": 12,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 13.655892\n",
+ "1 13.831172\n",
+ "2 13.092314\n",
+ "dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "xmxFuQmurr6d"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n",
+ "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n",
+ "\n",
+ "The example below creates a new `Series` that indicates whether `population` is over one million:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "Fc1DvPAbstjI",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "outputId": "a53dfe27-2832-4927-acb4-c097eefc16a5"
+ },
+ "cell_type": "code",
+ "source": [
+ "population.apply(lambda val: val > 1000000)"
+ ],
+ "execution_count": 13,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 False\n",
+ "1 True\n",
+ "2 False\n",
+ "dtype: bool"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 13
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "ZeYYLoV9b9fB"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "0gCEX99Hb8LR",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "b093cd85-f4e2-4c35-a936-8cce335552c3"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
+ "cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
+ "cities"
+ ],
+ "execution_count": 14,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " 176.53 \n",
+ " 5754.177760 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density\n",
+ "0 San Francisco 852469 46.87 18187.945381\n",
+ "1 San Jose 1015785 176.53 5754.177760\n",
+ "2 Sacramento 485199 97.92 4955.055147"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 14
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "6qh63m-ayb-c"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Exercise #1\n",
+ "\n",
+ "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n",
+ "\n",
+ " * The city is named after a saint.\n",
+ " * The city has an area greater than 50 square miles.\n",
+ "\n",
+ "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n",
+ "\n",
+ "**Hint:** \"San\" in Spanish means \"saint.\""
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "zCOn8ftSyddH",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "59689928-0c91-43d2-963f-6e5d7c16a527"
+ },
+ "cell_type": "code",
+ "source": [
+ "# Your code here\n",
+ "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
+ "cities"
+ ],
+ "execution_count": 15,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " 176.53 \n",
+ " 5754.177760 \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "0 San Francisco 852469 46.87 18187.945381 \n",
+ "1 San Jose 1015785 176.53 5754.177760 \n",
+ "2 Sacramento 485199 97.92 4955.055147 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "0 False \n",
+ "1 True \n",
+ "2 False "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 15
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "YHIWvc9Ms-Ll"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Solution\n",
+ "\n",
+ "Click below for a solution."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "T5OlrqtdtCIb",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "68302cf2-5454-4e78-c2a4-de3617cfd167"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
+ "cities"
+ ],
+ "execution_count": 16,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " 176.53 \n",
+ " 5754.177760 \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "0 San Francisco 852469 46.87 18187.945381 \n",
+ "1 San Jose 1015785 176.53 5754.177760 \n",
+ "2 Sacramento 485199 97.92 4955.055147 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "0 False \n",
+ "1 True \n",
+ "2 False "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 16
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "f-xAOJeMiXFB"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Indexes\n",
+ "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n",
+ "\n",
+ "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "2684gsWNinq9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "outputId": "b11186b3-1c6a-4aa4-ebd7-00818a7394e7"
+ },
+ "cell_type": "code",
+ "source": [
+ "city_names.index"
+ ],
+ "execution_count": 17,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "RangeIndex(start=0, stop=3, step=1)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 17
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "F_qPe2TBjfWd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "outputId": "be2d6e8d-d7ed-4259-e79b-7011b121c3ea"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.index"
+ ],
+ "execution_count": 18,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "RangeIndex(start=0, stop=3, step=1)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 18
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "hp2oWY9Slo_h"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "sN0zUzSAj-U1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "a2f5d075-3435-4209-9e7b-c6bd3a1317b7"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex([2, 0, 1])"
+ ],
+ "execution_count": 19,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " 176.53 \n",
+ " 5754.177760 \n",
+ " True \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "2 Sacramento 485199 97.92 4955.055147 \n",
+ "0 San Francisco 852469 46.87 18187.945381 \n",
+ "1 San Jose 1015785 176.53 5754.177760 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "2 False \n",
+ "0 False \n",
+ "1 True "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 19
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "-GQFz8NZuS06"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n",
+ "Try running the following cell multiple times!"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "mF8GC0k8uYhz",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "outputId": "d073772b-41d3-4260-d9d4-12520da345ec"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex(np.random.permutation(cities.index))"
+ ],
+ "execution_count": 20,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " San Jose \n",
+ " 1015785 \n",
+ " 176.53 \n",
+ " 5754.177760 \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "0 San Francisco 852469 46.87 18187.945381 \n",
+ "1 San Jose 1015785 176.53 5754.177760 \n",
+ "2 Sacramento 485199 97.92 4955.055147 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "0 False \n",
+ "1 True \n",
+ "2 False "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 20
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "fSso35fQmGKb"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "8UngIdVhz8C0"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Exercise #2\n",
+ "\n",
+ "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "PN55GrDX0jzO",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 175
+ },
+ "outputId": "6d125adc-a36f-4717-b964-8b204a63b250"
+ },
+ "cell_type": "code",
+ "source": [
+ "# Your code here\n",
+ "cities.reindex([0, 4, 5, 2])"
+ ],
+ "execution_count": 21,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469.0 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199.0 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "0 San Francisco 852469.0 46.87 18187.945381 \n",
+ "4 NaN NaN NaN NaN \n",
+ "5 NaN NaN NaN NaN \n",
+ "2 Sacramento 485199.0 97.92 4955.055147 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "0 False \n",
+ "4 NaN \n",
+ "5 NaN \n",
+ "2 False "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 21
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "TJffr5_Jwqvd"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Solution\n",
+ "\n",
+ "Click below for the solution."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "8oSvi2QWwuDH"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "yBdkucKCwy4x",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 175
+ },
+ "outputId": "6ef4d8ea-5e17-4387-c0f5-3f5ff8f28cb6"
+ },
+ "cell_type": "code",
+ "source": [
+ "cities.reindex([0, 4, 5, 2])"
+ ],
+ "execution_count": 22,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " City name \n",
+ " Population \n",
+ " Area square miles \n",
+ " Population density \n",
+ " Is wide and has saint name \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " San Francisco \n",
+ " 852469.0 \n",
+ " 46.87 \n",
+ " 18187.945381 \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " NaN \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " Sacramento \n",
+ " 485199.0 \n",
+ " 97.92 \n",
+ " 4955.055147 \n",
+ " False \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " City name Population Area square miles Population density \\\n",
+ "0 San Francisco 852469.0 46.87 18187.945381 \n",
+ "4 NaN NaN NaN NaN \n",
+ "5 NaN NaN NaN NaN \n",
+ "2 Sacramento 485199.0 97.92 4955.055147 \n",
+ "\n",
+ " Is wide and has saint name \n",
+ "0 False \n",
+ "4 NaN \n",
+ "5 NaN \n",
+ "2 False "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 22
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "2l82PhPbwz7g"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n",
+ "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n",
+ "in which the index values are browser names).\n",
+ "\n",
+ "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n",
+ "sanitizing the input."
+ ]
+ }
+ ]
+}
\ No newline at end of file