From b493dad2222dfbcc3c5d2c096581e25068eb54c9 Mon Sep 17 00:00:00 2001 From: Chanchal Kumar Maji <31502077+ChanchalKumarMaji@users.noreply.github.com> Date: Sun, 3 Feb 2019 22:39:37 +0530 Subject: [PATCH 1/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a2a09d1..3a2ea3c 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Assignment 5 +# Assignment 5 This is the final Assignment of MLCC Study Jam, DSC Kolkata. In this Assignment, you're asked to solve the official MLCC Notebooks. These notebooks are a property of Google Inc. To get accepted for final evaluation, complete all the notebooks and commit them to the branches having your github ID. From 0fa63fdd118d772247aef08af5c07fdc7d3427c2 Mon Sep 17 00:00:00 2001 From: Chanchal Kumar Maji <31502077+ChanchalKumarMaji@users.noreply.github.com> Date: Sun, 3 Feb 2019 22:51:15 +0530 Subject: [PATCH 2/5] Created using Colaboratory --- intro_to_pandas.ipynb | 660 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 660 insertions(+) create mode 100644 intro_to_pandas.ipynb diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb new file mode 100644 index 0000000..942ea63 --- /dev/null +++ b/intro_to_pandas.ipynb @@ -0,0 +1,660 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "intro_to_pandas.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [ + "JndnmDMp66FL", + "YHIWvc9Ms-Ll", + "TJffr5_Jwqvd" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python2", + "display_name": "Python 2" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "JndnmDMp66FL" + }, + "cell_type": "markdown", + "source": [ + "#### Copyright 2017 Google LLC." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "hMqWDc_m6rUC", + "cellView": "both", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "rHLcriKWLRe4" + }, + "cell_type": "markdown", + "source": [ + "# Intro to pandas" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "QvJBqX8_Bctk" + }, + "cell_type": "markdown", + "source": [ + "**Learning Objectives:**\n", + " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n", + " * Access and manipulate data within a `DataFrame` and `Series`\n", + " * Import CSV data into a *pandas* `DataFrame`\n", + " * Reindex a `DataFrame` to shuffle data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "TIFJ83ZTBctl" + }, + "cell_type": "markdown", + "source": [ + "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n", + "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "s_JOISVgmn9v" + }, + "cell_type": "markdown", + "source": [ + "## Basic Concepts\n", + "\n", + "The following line imports the *pandas* API and prints the API version:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "aSRYu62xUi3g", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from __future__ import print_function\n", + "\n", + "import pandas as pd\n", + "pd.__version__" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "daQreKXIUslr" + }, + "cell_type": "markdown", + "source": [ + "The primary data structures in *pandas* are implemented as two classes:\n", + "\n", + " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n", + " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n", + "\n", + "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "fjnAk1xcU0yc" + }, + "cell_type": "markdown", + "source": [ + "One way to create a `Series` is to construct a `Series` object. For example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "DFZ42Uq7UFDj", + "colab": {} + }, + "cell_type": "code", + "source": [ + "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "U5ouUp1cU6pC" + }, + "cell_type": "markdown", + "source": [ + "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "avgr6GfiUh8t", + "colab": {} + }, + "cell_type": "code", + "source": [ + "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n", + "population = pd.Series([852469, 1015785, 485199])\n", + "\n", + "pd.DataFrame({ 'City name': city_names, 'Population': population })" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "oa5wfZT7VHJl" + }, + "cell_type": "markdown", + "source": [ + "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "av6RYOraVG1V", + "colab": {} + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n", + "california_housing_dataframe.describe()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "WrkBjfz5kEQu" + }, + "cell_type": "markdown", + "source": [ + "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "s3ND3bgOkB5k", + "colab": {} + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe.head()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "w9-Es5Y6laGd" + }, + "cell_type": "markdown", + "source": [ + "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "nqndFVXVlbPN", + "colab": {} + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe.hist('housing_median_age')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "XtYZ7114n3b-" + }, + "cell_type": "markdown", + "source": [ + "## Accessing Data\n", + "\n", + "You can access `DataFrame` data using familiar Python dict/list operations:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "_TFm7-looBFF", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n", + "print(type(cities['City name']))\n", + "cities['City name']" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "code", + "id": "V5L6xacLoxyv", + "colab": {} + }, + "cell_type": "code", + "source": [ + "print(type(cities['City name'][1]))\n", + "cities['City name'][1]" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "code", + "id": "gcYX1tBPugZl", + "colab": {} + }, + "cell_type": "code", + "source": [ + "print(type(cities[0:2]))\n", + "cities[0:2]" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "65g1ZdGVjXsQ" + }, + "cell_type": "markdown", + "source": [ + "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "RM1iaD-ka3Y1" + }, + "cell_type": "markdown", + "source": [ + "## Manipulating Data\n", + "\n", + "You may apply Python's basic arithmetic operations to `Series`. For example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "XWmyCFJ5bOv-", + "colab": {} + }, + "cell_type": "code", + "source": [ + "population / 1000." + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "TQzIVnbnmWGM" + }, + "cell_type": "markdown", + "source": [ + "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "ko6pLK6JmkYP", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import numpy as np\n", + "\n", + "np.log(population)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "xmxFuQmurr6d" + }, + "cell_type": "markdown", + "source": [ + "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n", + "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n", + "\n", + "The example below creates a new `Series` that indicates whether `population` is over one million:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "Fc1DvPAbstjI", + "colab": {} + }, + "cell_type": "code", + "source": [ + "population.apply(lambda val: val > 1000000)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "ZeYYLoV9b9fB" + }, + "cell_type": "markdown", + "source": [ + "\n", + "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "0gCEX99Hb8LR", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n", + "cities['Population density'] = cities['Population'] / cities['Area square miles']\n", + "cities" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "6qh63m-ayb-c" + }, + "cell_type": "markdown", + "source": [ + "## Exercise #1\n", + "\n", + "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n", + "\n", + " * The city is named after a saint.\n", + " * The city has an area greater than 50 square miles.\n", + "\n", + "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n", + "\n", + "**Hint:** \"San\" in Spanish means \"saint.\"" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "zCOn8ftSyddH", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# Your code here" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "YHIWvc9Ms-Ll" + }, + "cell_type": "markdown", + "source": [ + "### Solution\n", + "\n", + "Click below for a solution." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "T5OlrqtdtCIb", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n", + "cities" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "f-xAOJeMiXFB" + }, + "cell_type": "markdown", + "source": [ + "## Indexes\n", + "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n", + "\n", + "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "2684gsWNinq9", + "colab": {} + }, + "cell_type": "code", + "source": [ + "city_names.index" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "code", + "id": "F_qPe2TBjfWd", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities.index" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "hp2oWY9Slo_h" + }, + "cell_type": "markdown", + "source": [ + "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "sN0zUzSAj-U1", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities.reindex([2, 0, 1])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "-GQFz8NZuS06" + }, + "cell_type": "markdown", + "source": [ + "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n", + "Try running the following cell multiple times!" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "mF8GC0k8uYhz", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities.reindex(np.random.permutation(cities.index))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "fSso35fQmGKb" + }, + "cell_type": "markdown", + "source": [ + "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "8UngIdVhz8C0" + }, + "cell_type": "markdown", + "source": [ + "## Exercise #2\n", + "\n", + "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "PN55GrDX0jzO", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# Your code here" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "TJffr5_Jwqvd" + }, + "cell_type": "markdown", + "source": [ + "### Solution\n", + "\n", + "Click below for the solution." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "8oSvi2QWwuDH" + }, + "cell_type": "markdown", + "source": [ + "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "yBdkucKCwy4x", + "colab": {} + }, + "cell_type": "code", + "source": [ + "cities.reindex([0, 4, 5, 2])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "2l82PhPbwz7g" + }, + "cell_type": "markdown", + "source": [ + "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n", + "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n", + "in which the index values are browser names).\n", + "\n", + "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n", + "sanitizing the input." + ] + } + ] +} \ No newline at end of file From 50fe890274003fa469ef03d3c44a6f02556f253a Mon Sep 17 00:00:00 2001 From: Chanchal Kumar Maji <31502077+ChanchalKumarMaji@users.noreply.github.com> Date: Sun, 3 Feb 2019 23:00:18 +0530 Subject: [PATCH 3/5] Created using Colaboratory From f282d373f15c0d27ff7f8fdf341bba5bfe30782f Mon Sep 17 00:00:00 2001 From: Chanchal Kumar Maji <31502077+ChanchalKumarMaji@users.noreply.github.com> Date: Sun, 3 Feb 2019 23:01:02 +0530 Subject: [PATCH 4/5] Delete intro_to_pandas.ipynb --- intro_to_pandas.ipynb | 660 ------------------------------------------ 1 file changed, 660 deletions(-) delete mode 100644 intro_to_pandas.ipynb diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb deleted file mode 100644 index 942ea63..0000000 --- a/intro_to_pandas.ipynb +++ /dev/null @@ -1,660 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "intro_to_pandas.ipynb", - "version": "0.3.2", - "provenance": [], - "collapsed_sections": [ - "JndnmDMp66FL", - "YHIWvc9Ms-Ll", - "TJffr5_Jwqvd" - ], - "include_colab_link": true - }, - "kernelspec": { - "name": "python2", - "display_name": "Python 2" - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "JndnmDMp66FL" - }, - "cell_type": "markdown", - "source": [ - "#### Copyright 2017 Google LLC." - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "hMqWDc_m6rUC", - "cellView": "both", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "rHLcriKWLRe4" - }, - "cell_type": "markdown", - "source": [ - "# Intro to pandas" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "QvJBqX8_Bctk" - }, - "cell_type": "markdown", - "source": [ - "**Learning Objectives:**\n", - " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n", - " * Access and manipulate data within a `DataFrame` and `Series`\n", - " * Import CSV data into a *pandas* `DataFrame`\n", - " * Reindex a `DataFrame` to shuffle data" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "TIFJ83ZTBctl" - }, - "cell_type": "markdown", - "source": [ - "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n", - "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "s_JOISVgmn9v" - }, - "cell_type": "markdown", - "source": [ - "## Basic Concepts\n", - "\n", - "The following line imports the *pandas* API and prints the API version:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "aSRYu62xUi3g", - "colab": {} - }, - "cell_type": "code", - "source": [ - "from __future__ import print_function\n", - "\n", - "import pandas as pd\n", - "pd.__version__" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "daQreKXIUslr" - }, - "cell_type": "markdown", - "source": [ - "The primary data structures in *pandas* are implemented as two classes:\n", - "\n", - " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n", - " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n", - "\n", - "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "fjnAk1xcU0yc" - }, - "cell_type": "markdown", - "source": [ - "One way to create a `Series` is to construct a `Series` object. For example:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "DFZ42Uq7UFDj", - "colab": {} - }, - "cell_type": "code", - "source": [ - "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "U5ouUp1cU6pC" - }, - "cell_type": "markdown", - "source": [ - "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "avgr6GfiUh8t", - "colab": {} - }, - "cell_type": "code", - "source": [ - "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n", - "population = pd.Series([852469, 1015785, 485199])\n", - "\n", - "pd.DataFrame({ 'City name': city_names, 'Population': population })" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "oa5wfZT7VHJl" - }, - "cell_type": "markdown", - "source": [ - "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "av6RYOraVG1V", - "colab": {} - }, - "cell_type": "code", - "source": [ - "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n", - "california_housing_dataframe.describe()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "WrkBjfz5kEQu" - }, - "cell_type": "markdown", - "source": [ - "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "s3ND3bgOkB5k", - "colab": {} - }, - "cell_type": "code", - "source": [ - "california_housing_dataframe.head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "w9-Es5Y6laGd" - }, - "cell_type": "markdown", - "source": [ - "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "nqndFVXVlbPN", - "colab": {} - }, - "cell_type": "code", - "source": [ - "california_housing_dataframe.hist('housing_median_age')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "XtYZ7114n3b-" - }, - "cell_type": "markdown", - "source": [ - "## Accessing Data\n", - "\n", - "You can access `DataFrame` data using familiar Python dict/list operations:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "_TFm7-looBFF", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n", - "print(type(cities['City name']))\n", - "cities['City name']" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "code", - "id": "V5L6xacLoxyv", - "colab": {} - }, - "cell_type": "code", - "source": [ - "print(type(cities['City name'][1]))\n", - "cities['City name'][1]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "code", - "id": "gcYX1tBPugZl", - "colab": {} - }, - "cell_type": "code", - "source": [ - "print(type(cities[0:2]))\n", - "cities[0:2]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "65g1ZdGVjXsQ" - }, - "cell_type": "markdown", - "source": [ - "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "RM1iaD-ka3Y1" - }, - "cell_type": "markdown", - "source": [ - "## Manipulating Data\n", - "\n", - "You may apply Python's basic arithmetic operations to `Series`. For example:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "XWmyCFJ5bOv-", - "colab": {} - }, - "cell_type": "code", - "source": [ - "population / 1000." - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "TQzIVnbnmWGM" - }, - "cell_type": "markdown", - "source": [ - "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "ko6pLK6JmkYP", - "colab": {} - }, - "cell_type": "code", - "source": [ - "import numpy as np\n", - "\n", - "np.log(population)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "xmxFuQmurr6d" - }, - "cell_type": "markdown", - "source": [ - "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n", - "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n", - "\n", - "The example below creates a new `Series` that indicates whether `population` is over one million:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "Fc1DvPAbstjI", - "colab": {} - }, - "cell_type": "code", - "source": [ - "population.apply(lambda val: val > 1000000)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "ZeYYLoV9b9fB" - }, - "cell_type": "markdown", - "source": [ - "\n", - "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "0gCEX99Hb8LR", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n", - "cities['Population density'] = cities['Population'] / cities['Area square miles']\n", - "cities" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "6qh63m-ayb-c" - }, - "cell_type": "markdown", - "source": [ - "## Exercise #1\n", - "\n", - "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n", - "\n", - " * The city is named after a saint.\n", - " * The city has an area greater than 50 square miles.\n", - "\n", - "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n", - "\n", - "**Hint:** \"San\" in Spanish means \"saint.\"" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "zCOn8ftSyddH", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Your code here" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "YHIWvc9Ms-Ll" - }, - "cell_type": "markdown", - "source": [ - "### Solution\n", - "\n", - "Click below for a solution." - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "T5OlrqtdtCIb", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n", - "cities" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "f-xAOJeMiXFB" - }, - "cell_type": "markdown", - "source": [ - "## Indexes\n", - "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n", - "\n", - "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered." - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "2684gsWNinq9", - "colab": {} - }, - "cell_type": "code", - "source": [ - "city_names.index" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "code", - "id": "F_qPe2TBjfWd", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities.index" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "hp2oWY9Slo_h" - }, - "cell_type": "markdown", - "source": [ - "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "sN0zUzSAj-U1", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities.reindex([2, 0, 1])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "-GQFz8NZuS06" - }, - "cell_type": "markdown", - "source": [ - "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n", - "Try running the following cell multiple times!" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "mF8GC0k8uYhz", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities.reindex(np.random.permutation(cities.index))" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "fSso35fQmGKb" - }, - "cell_type": "markdown", - "source": [ - "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "8UngIdVhz8C0" - }, - "cell_type": "markdown", - "source": [ - "## Exercise #2\n", - "\n", - "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "PN55GrDX0jzO", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Your code here" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "TJffr5_Jwqvd" - }, - "cell_type": "markdown", - "source": [ - "### Solution\n", - "\n", - "Click below for the solution." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "8oSvi2QWwuDH" - }, - "cell_type": "markdown", - "source": [ - "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:" - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "yBdkucKCwy4x", - "colab": {} - }, - "cell_type": "code", - "source": [ - "cities.reindex([0, 4, 5, 2])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "2l82PhPbwz7g" - }, - "cell_type": "markdown", - "source": [ - "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n", - "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n", - "in which the index values are browser names).\n", - "\n", - "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n", - "sanitizing the input." - ] - } - ] -} \ No newline at end of file From 693b53a3e972c663557431f4c4ce4fa25f99042c Mon Sep 17 00:00:00 2001 From: Chanchal Kumar Maji <31502077+ChanchalKumarMaji@users.noreply.github.com> Date: Mon, 4 Feb 2019 00:41:09 +0530 Subject: [PATCH 5/5] Created using Colaboratory --- intro_to_pandas.ipynb | 1870 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1870 insertions(+) create mode 100644 intro_to_pandas.ipynb diff --git a/intro_to_pandas.ipynb b/intro_to_pandas.ipynb new file mode 100644 index 0000000..472d30f --- /dev/null +++ b/intro_to_pandas.ipynb @@ -0,0 +1,1870 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "intro_to_pandas.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "JndnmDMp66FL" + }, + "cell_type": "markdown", + "source": [ + "#### Copyright 2017 Google LLC." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "hMqWDc_m6rUC", + "cellView": "both", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "rHLcriKWLRe4" + }, + "cell_type": "markdown", + "source": [ + "# Intro to pandas" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "QvJBqX8_Bctk" + }, + "cell_type": "markdown", + "source": [ + "**Learning Objectives:**\n", + " * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n", + " * Access and manipulate data within a `DataFrame` and `Series`\n", + " * Import CSV data into a *pandas* `DataFrame`\n", + " * Reindex a `DataFrame` to shuffle data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "TIFJ83ZTBctl" + }, + "cell_type": "markdown", + "source": [ + "[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n", + "Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "s_JOISVgmn9v" + }, + "cell_type": "markdown", + "source": [ + "## Basic Concepts\n", + "\n", + "The following line imports the *pandas* API and prints the API version:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "aSRYu62xUi3g", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "b69fa038-4215-44f9-c5dd-2acc82452a93" + }, + "cell_type": "code", + "source": [ + "from __future__ import print_function\n", + "\n", + "import pandas as pd\n", + "pd.__version__" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'0.22.0'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "daQreKXIUslr" + }, + "cell_type": "markdown", + "source": [ + "The primary data structures in *pandas* are implemented as two classes:\n", + "\n", + " * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n", + " * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n", + "\n", + "The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "fjnAk1xcU0yc" + }, + "cell_type": "markdown", + "source": [ + "One way to create a `Series` is to construct a `Series` object. For example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "DFZ42Uq7UFDj", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "outputId": "8acbe90f-6151-48ef-9870-ce0a5cc32a9c" + }, + "cell_type": "code", + "source": [ + "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 San Francisco\n", + "1 San Jose\n", + "2 Sacramento\n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "U5ouUp1cU6pC" + }, + "cell_type": "markdown", + "source": [ + "`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "avgr6GfiUh8t", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "53e13300-254c-4d87-c4ca-5e021d787a12" + }, + "cell_type": "code", + "source": [ + "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n", + "population = pd.Series([852469, 1015785, 485199])\n", + "\n", + "pd.DataFrame({ 'City name': city_names, 'Population': population })" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulation
0San Francisco852469
1San Jose1015785
2Sacramento485199
\n", + "
" + ], + "text/plain": [ + " City name Population\n", + "0 San Francisco 852469\n", + "1 San Jose 1015785\n", + "2 Sacramento 485199" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "oa5wfZT7VHJl" + }, + "cell_type": "markdown", + "source": [ + "But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "av6RYOraVG1V", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 320 + }, + "outputId": "4b8ef629-7bcd-449f-b0d6-c13bce927f2e" + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n", + "california_housing_dataframe.describe()" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count17000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.00000017000.000000
mean-119.56210835.62522528.5893532643.664412539.4108241429.573941501.2219413.883578207300.912353
std2.0051662.13734012.5869372179.947071421.4994521147.852959384.5208411.908157115983.764387
min-124.35000032.5400001.0000002.0000001.0000003.0000001.0000000.49990014999.000000
25%-121.79000033.93000018.0000001462.000000297.000000790.000000282.0000002.566375119400.000000
50%-118.49000034.25000029.0000002127.000000434.0000001167.000000409.0000003.544600180400.000000
75%-118.00000037.72000037.0000003151.250000648.2500001721.000000605.2500004.767000265000.000000
max-114.31000041.95000052.00000037937.0000006445.00000035682.0000006082.00000015.000100500001.000000
\n", + "
" + ], + "text/plain": [ + " longitude latitude housing_median_age total_rooms \\\n", + "count 17000.000000 17000.000000 17000.000000 17000.000000 \n", + "mean -119.562108 35.625225 28.589353 2643.664412 \n", + "std 2.005166 2.137340 12.586937 2179.947071 \n", + "min -124.350000 32.540000 1.000000 2.000000 \n", + "25% -121.790000 33.930000 18.000000 1462.000000 \n", + "50% -118.490000 34.250000 29.000000 2127.000000 \n", + "75% -118.000000 37.720000 37.000000 3151.250000 \n", + "max -114.310000 41.950000 52.000000 37937.000000 \n", + "\n", + " total_bedrooms population households median_income \\\n", + "count 17000.000000 17000.000000 17000.000000 17000.000000 \n", + "mean 539.410824 1429.573941 501.221941 3.883578 \n", + "std 421.499452 1147.852959 384.520841 1.908157 \n", + "min 1.000000 3.000000 1.000000 0.499900 \n", + "25% 297.000000 790.000000 282.000000 2.566375 \n", + "50% 434.000000 1167.000000 409.000000 3.544600 \n", + "75% 648.250000 1721.000000 605.250000 4.767000 \n", + "max 6445.000000 35682.000000 6082.000000 15.000100 \n", + "\n", + " median_house_value \n", + "count 17000.000000 \n", + "mean 207300.912353 \n", + "std 115983.764387 \n", + "min 14999.000000 \n", + "25% 119400.000000 \n", + "50% 180400.000000 \n", + "75% 265000.000000 \n", + "max 500001.000000 " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "WrkBjfz5kEQu" + }, + "cell_type": "markdown", + "source": [ + "The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "s3ND3bgOkB5k", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 226 + }, + "outputId": "a0227da6-863c-453e-c83e-4e69b4657896" + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe.head()" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
0-114.3134.1915.05612.01283.01015.0472.01.493666900.0
1-114.4734.4019.07650.01901.01129.0463.01.820080100.0
2-114.5633.6917.0720.0174.0333.0117.01.650985700.0
3-114.5733.6414.01501.0337.0515.0226.03.191773400.0
4-114.5733.5720.01454.0326.0624.0262.01.925065500.0
\n", + "
" + ], + "text/plain": [ + " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", + "0 -114.31 34.19 15.0 5612.0 1283.0 \n", + "1 -114.47 34.40 19.0 7650.0 1901.0 \n", + "2 -114.56 33.69 17.0 720.0 174.0 \n", + "3 -114.57 33.64 14.0 1501.0 337.0 \n", + "4 -114.57 33.57 20.0 1454.0 326.0 \n", + "\n", + " population households median_income median_house_value \n", + "0 1015.0 472.0 1.4936 66900.0 \n", + "1 1129.0 463.0 1.8200 80100.0 \n", + "2 333.0 117.0 1.6509 85700.0 \n", + "3 515.0 226.0 3.1917 73400.0 \n", + "4 624.0 262.0 1.9250 65500.0 " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "w9-Es5Y6laGd" + }, + "cell_type": "markdown", + "source": [ + "Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "nqndFVXVlbPN", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 397 + }, + "outputId": "edaf5f93-5bce-44a6-c30f-4435f896ea23" + }, + "cell_type": "code", + "source": [ + "california_housing_dataframe.hist('housing_median_age')" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[]],\n", + " dtype=object)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAFZCAYAAABXM2zhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X1UlHX+//HXMDAH0UEEGTfLarf0\naEmaa5l4U0Iokp7IVRPWdU3q6Iqtlql499WTlajRmmZZmunRU7GNtofcAjJxyyRanT0uuu0p2VOr\neTejKCqgSPP7o9Os/FRguP1Az8dfcTEz1+d6H+3pdQ1zYfF6vV4BAAAjBTT3AgAAwPURagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaogVo6cuSI7rjjjkbdxz//+U+lpKQ06j4a0h133KEjR47o448/\n1ty5c5t7OUCrZOFz1EDtHDlyREOHDtW//vWv5l6KMe644w7l5ubqpptuau6lAK0WZ9SAn5xOp0aO\nHKn7779f27dv1w8//KA//elPio+PV3x8vNLS0lRaWipJiomJ0d69e33P/enry5cva/78+Ro2bJji\n4uI0bdo0nT9/XgUFBYqLi5MkrV69Ws8++6xSU1MVGxur0aNH6+TJk5KkgwcPaujQoRo6dKheeeUV\njRw5UgUFBdWue/Xq1Vq0aJEmT56sgQMHatasWcrLy9OoUaM0cOBA5eXlSZIuXbqk5557TsOGDVNM\nTIzWrl3re42//e1viouL0/Dhw7V+/Xrf9m3btmnixImSJI/Ho5SUFMXHxysmJkZvvfVWleN/9913\nNXr0aA0cOFDp6ek1zrusrEwzZszwrWfZsmW+71U3hx07dmjkyJGKjY3VpEmTdPr06Rr3BZiIUAN+\n+OGHH1RRUaEPPvhAc+fO1cqVK/XRRx/p008/1bZt2/TXv/5VJSUl2rhxY7Wvs3v3bh05ckTZ2dnK\nzc3V7bffrn/84x9XPS47O1vz5s3Tjh07FBERoa1bt0qSFi5cqIkTJyo3N1ft2rXTt99+W6v179q1\nSy+88II++OADZWdn+9Y9ZcoUrVu3TpK0bt06HTp0SB988IG2b9+unJwc5eXlqbKyUvPnz9eiRYv0\n0UcfKSAgQJWVlVft47XXXtNNN92k7Oxsbdq0SRkZGTp27Jjv+3//+9+VmZmprVu3asuWLTp+/Hi1\na37nnXd04cIFZWdn6/3339e2bdt8//i53hwOHz6s2bNnKyMjQ5988on69eunxYsX12pGgGkINeAH\nr9erxMREST9e9j1+/Lh27dqlxMREhYSEyGq1atSoUfr888+rfZ3w8HAVFRXp448/9p0xDho06KrH\n9e3bVzfeeKMsFot69OihY8eOqby8XAcPHtSIESMkSb/97W9V23ew7r77bkVERKhDhw6KjIzU4MGD\nJUndunXzna3n5eUpOTlZNptNISEhevjhh5Wbm6tvv/1Wly5d0sCBAyVJjzzyyDX3sWDBAi1cuFCS\n1KVLF0VGRurIkSO+748cOVJWq1WdOnVSRERElYhfy6RJk/Tqq6/KYrGoffv26tq1q44cOVLtHD79\n9FPde++96tatmyRp3Lhx2rlz5zX/YQGYLrC5FwC0JFarVW3atJEkBQQE6IcfftDp06fVvn1732Pa\nt2+vU6dOVfs6d911lxYsWKDNmzdrzpw5iomJ0aJFi656nN1ur7LvyspKnT17VhaLRaGhoZKkoKAg\nRURE1Gr9bdu2rfJ6ISEhVY5Fks6dO6elS5fqpZdekvTjpfC77rpLZ8+eVbt27aoc57UUFhb6zqID\nAgLkdrt9ry2pymv8dEzV+fbbb5Wenq7//Oc/CggI0PHjxzVq1Khq53Du3Dnt3btX8fHxVfZ75syZ\nWs8KMAWhBuqpY8eOOnPmjO/rM2fOqGPHjpKqBlCSzp496/vvn97TPnPmjObNm6c333xT0dHRNe6v\nXbt28nq9KisrU5s2bXT58uUGff/V4XBo0qRJGjJkSJXtRUVFOn/+vO/r6+1z1qxZ+v3vf6+kpCRZ\nLJZrXinwx7PPPqs777xTa9askdVq1bhx4yRVPweHw6Ho6GitWrWqXvsGTMClb6CeHnjgAWVlZams\nrEyXL1+W0+nU/fffL0mKjIzUv//9b0nShx9+qIsXL0qStm7dqjVr1kiSwsLC9Ktf/arW+2vbtq1u\nu+02ffTRR5KkzMxMWSyWBjue2NhYvffee6qsrJTX69Wrr76qTz/9VDfffLOsVqvvh7W2bdt2zf2e\nOnVKPXv2lMVi0fvvv6+ysjLfD9fVxalTp9SjRw9ZrVZ9/vnn+u6771RaWlrtHAYOHKi9e/fq8OHD\nkn782Ntzzz1X5zUAzYlQA/UUHx+vwYMHa9SoURoxYoR+8YtfaMKECZKkqVOnauPGjRoxYoSKiop0\n++23S/oxhj/9xPLw4cN16NAhPfbYY7Xe56JFi7R27Vo99NBDKi0tVadOnRos1snJyercubMeeugh\nxcfHq6ioSL/+9a8VFBSkJUuWaN68eRo+fLgsFovv0vmVpk+frtTUVI0cOVKlpaV69NFHtXDhQv33\nv/+t03r+8Ic/aNmyZRoxYoS+/PJLTZs2TatXr9a+ffuuOweHw6ElS5YoNTVVw4cP17PPPquEhIT6\njgZoFnyOGmihvF6vL8733XefNm7cqO7duzfzqpoec0Brxxk10AL98Y9/9H2cKj8/X16vV7feemvz\nLqoZMAf8HHBGDbRARUVFmjt3rs6ePaugoCDNmjVLN910k1JTU6/5+Ntuu833nrhpioqK6rzua83h\np58PAFoLQg0AgMG49A0AgMEINQAABjPyhidu9zm/Ht+hQ4iKi+v+Oc2fO+ZXd8yufphf3TG7+jFt\nfpGR9ut+r1WcUQcGWpt7CS0a86s7Zlc/zK/umF39tKT5tYpQAwDQWhFqAAAMRqgBADBYjT9MVlZW\nprS0NJ06dUoXL17U1KlT1b17d82ePVuVlZWKjIzUihUrZLPZlJWVpU2bNikgIEBjx47VmDFjVFFR\nobS0NB09elRWq1VLly5Vly5dmuLYAABo8Wo8o87Ly1PPnj21ZcsWrVy5Uunp6Vq1apWSk5P19ttv\n65ZbbpHT6VRpaanWrFmjjRs3avPmzdq0aZPOnDmj7du3KzQ0VO+8846mTJmijIyMpjguAABahRpD\nnZCQoCeeeEKSdOzYMXXq1EkFBQWKjY2VJA0ZMkT5+fnav3+/oqKiZLfbFRwcrD59+sjlcik/P19x\ncXGSpOjoaLlcrkY8HAAAWpdaf4563LhxOn78uNauXavHHntMNptNkhQRESG32y2Px6Pw8HDf48PD\nw6/aHhAQIIvFokuXLvmeDwAArq/WoX733Xf11VdfadasWbry9uDXu1W4v9uv1KFDiN+fcavuw+Ko\nGfOrO2ZXP8yv7phd/bSU+dUY6gMHDigiIkI33HCDevToocrKSrVt21bl5eUKDg7WiRMn5HA45HA4\n5PF4fM87efKkevfuLYfDIbfbre7du6uiokJer7fGs2l/7xYTGWn3+25m+B/mV3fMrn6YX90xu/ox\nbX71ujPZ3r17tWHDBkmSx+NRaWmpoqOjlZOTI0nKzc3VoEGD1KtXLxUWFqqkpEQXLlyQy+VS3759\nNWDAAGVnZ0v68QfT+vXr1xDHBADAz0KNZ9Tjxo3T/PnzlZycrPLycv3f//2fevbsqTlz5igzM1Od\nO3dWYmKigoKCNHPmTKWkpMhisSg1NVV2u10JCQnas2ePkpKSZLPZlJ6e3hTHBQBAq2Dk76P293KE\naZcwWhrmV3fMrn6YX90xu/oxbX7VXfo28rdnAcC1TErf2dxLqNGGtJjmXgJaGW4hCgCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QA\nABiMUAMAYDBCDQCAwQg1AAAGC6zNg5YvX659+/bp8uXLmjx5snbu3KmDBw8qLCxMkpSSkqIHHnhA\nWVlZ2rRpkwICAjR27FiNGTNGFRUVSktL09GjR2W1WrV06VJ16dKlUQ8KAIDWosZQf/HFF/rmm2+U\nmZmp4uJiPfLII7rvvvv09NNPa8iQIb7HlZaWas2aNXI6nQoKCtLo0aMVFxenvLw8hYaGKiMjQ7t3\n71ZGRoZWrlzZqAcFAEBrUeOl73vuuUcvv/yyJCk0NFRlZWWqrKy86nH79+9XVFSU7Ha7goOD1adP\nH7lcLuXn5ysuLk6SFB0dLZfL1cCHAABA61VjqK1Wq0JCQiRJTqdTgwcPltVq1ZYtWzRhwgQ99dRT\nOn36tDwej8LDw33PCw8Pl9vtrrI9ICBAFotFly5daqTDAQCgdanVe9SStGPHDjmdTm3YsEEHDhxQ\nWFiYevTooTfeeEOvvPKK7r777iqP93q913yd622/UocOIQoMtNZ2aZKkyEi7X49HVcyv7phd/bS2\n+TXl8bS22TW1ljK/WoX6s88+09q1a7V+/XrZ7Xb179/f972YmBgtXrxYw4YNk8fj8W0/efKkevfu\nLYfDIbfbre7du6uiokJer1c2m63a/RUXl/p1EJGRdrnd5/x6Dv6H+dUds6uf1ji/pjqe1ji7pmTa\n/Kr7R0ONl77PnTun5cuX6/XXX/f9lPeTTz6pw4cPS5IKCgrUtWtX9erVS4WFhSopKdGFCxfkcrnU\nt29fDRgwQNnZ2ZKkvLw89evXryGOCQCAn4Uaz6g//PBDFRcXa8aMGb5to0aN0owZM9SmTRuFhIRo\n6dKlCg4O1syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArYnFW5s3jZuYv5cj\nTLuE0dIwv7pjdvXj7/wmpe9sxNU0jA1pMU2yH/7s1Y9p86vXpW8AANB8CDUAAAYj1AAAGIxQAwBg\nMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAA\nGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYLbO4FAA1lUvrO5l5CtTakxTT3\nEgC0QJxRAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDB\nCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAbj91EDTcT035ct8TuzARNxRg0AgMFqdUa9fPly7du3\nT5cvX9bkyZMVFRWl2bNnq7KyUpGRkVqxYoVsNpuysrK0adMmBQQEaOzYsRozZowqKiqUlpamo0eP\nymq1aunSperSpUtjHxcAAK1CjaH+4osv9M033ygzM1PFxcV65JFH1L9/fyUnJ2v48OF66aWX5HQ6\nlZiYqDVr1sjpdCooKEijR49WXFyc8vLyFBoaqoyMDO3evVsZGRlauXJlUxwbAAAtXo2Xvu+55x69\n/PLLkqTQ0FCVlZWpoKBAsbGxkqQhQ4YoPz9f+/fvV1RUlOx2u4KDg9WnTx+5XC7l5+crLi5OkhQd\nHS2Xy9WIhwMAQOtS4xm11WpVSEiIJMnpdGrw4MHavXu3bDabJCkiIkJut1sej0fh4eG+54WHh1+1\nPSAgQBaLRZcuXfI9/1o6dAhRYKDVrwOJjLT79XhUxfwgNc+fg9b2Z68pj6e1za6ptZT51fqnvnfs\n2CGn06kNGzZo6NChvu1er/eaj/d3+5WKi0truyxJPw7b7T7n13PwP8wPP2nqPwet8c9eUx1Pa5xd\nUzJtftX9o6FWP/X92Wefae3atVq3bp3sdrtCQkJUXl4uSTpx4oQcDoccDoc8Ho/vOSdPnvRtd7vd\nkqSKigp5vd5qz6YBAMD/1Bjqc+fOafny5Xr99dcVFhYm6cf3mnNyciRJubm5GjRokHr16qXCwkKV\nlJTowoULcrlc6tu3rwYMGKDs7GxJUl5envr169eIhwMAQOtS46XvDz/8UMXFxZoxY4ZvW3p6uhYs\nWKDMzEx17txZiYmJCgoK0syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArUmN\noX700Uf16KOPXrX9rbfeumpbfHy84uPjq2z76bPTAADAf9xCFIBPS7jNKfBzwy1EAQAwGKEGAMBg\nhBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGHcmQ61wxyoAaB6cUQMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssLkXAADAlSal72zuJdRoQ1pM\nk+2LM2oAAAxGqAEAMBihBgDAYIQaAACDEWoAAAxGqAEAMBihBgDAYLX6HPXXX3+tqVOnauLEiRo/\nfrzS0tJ08OBBhYWFSZJSUlL0wAMPKCsrS5s2bVJAQIDGjh2rMWPGqKKiQmlpaTp69KisVquWLl2q\nLl26NOpBAUBz4TPAaGg1hrq0tFRLlixR//79q2x/+umnNWTIkCqPW7NmjZxOp4KCgjR69GjFxcUp\nLy9PoaGhysjI0O7du5WRkaGVK1c2/JEAANAK1Xjp22azad26dXI4HNU+bv/+/YqKipLdbldwcLD6\n9Okjl8ul/Px8xcXFSZKio6PlcrkaZuUAAPwM1BjqwMBABQcHX7V9y5YtmjBhgp566imdPn1aHo9H\n4eHhvu+Hh4fL7XZX2R4QECCLxaJLly414CEAANB61ele3w8//LDCwsLUo0cPvfHGG3rllVd09913\nV3mM1+u95nOvt/1KHTqEKDDQ6teaIiPtfj0eVTE/4OeDv+/115QzrFOor3y/OiYmRosXL9awYcPk\n8Xh820+ePKnevXvL4XDI7Xare/fuqqiokNfrlc1mq/b1i4tL/VpPZKRdbvc5/w4CPswP+Hnh73v9\nNfQMqwt/nT6e9eSTT+rw4cOSpIKCAnXt2lW9evVSYWGhSkpKdOHCBblcLvXt21cDBgxQdna2JCkv\nL0/9+vWryy4BAPhZqvGM+sCBA1q2bJm+//57BQYGKicnR+PHj9eMGTPUpk0bhYSEaOnSpQoODtbM\nmTOVkpIii8Wi1NRU2e12JSQkaM+ePUpKSpLNZlN6enpTHBcAAK1CjaHu2bOnNm/efNX2YcOGXbUt\nPj5e8fHxVbb99NlpAADgP+5MBgCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYLA6/T5qAEDLNSl9Z3MvAX7gjBoAAIMRagAADEaoAQAwGKEG\nAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMR\nagAADFarUH/99dd68MEHtWXLFknSsWPH9Lvf/U7JycmaPn26Ll26JEnKysrSb37zG40ZM0bvvfee\nJKmiokIzZ85UUlKSxo8fr8OHDzfSoQAA0PrUGOrS0lItWbJE/fv3921btWqVkpOT9fbbb+uWW26R\n0+lUaWmp1qxZo40bN2rz5s3atGmTzpw5o+3btys0NFTvvPOOpkyZooyMjEY9IAAAWpMaQ22z2bRu\n3To5HA7ftoKCAsXGxkqShgwZovz8fO3fv19RUVGy2+0KDg5Wnz595HK5lJ+fr7i4OElSdHS0XC5X\nIx0KAACtT42hDgwMVHBwcJVtZWVlstlskqSIiAi53W55PB6Fh4f7HhMeHn7V9oCAAFksFt+lcgAA\nUL3A+r6A1+ttkO1X6tAhRIGBVr/WERlp9+vxqIr5AUDtNeX/M+sU6pCQEJWXlys4OFgnTpyQw+GQ\nw+GQx+PxPebkyZPq3bu3HA6H3G63unfvroqKCnm9Xt/Z+PUUF5f6tZ7ISLvc7nN1ORSI+QGAvxr6\n/5nVhb9OH8+Kjo5WTk6OJCk3N1eDBg1Sr169VFhYqJKSEl24cEEul0t9+/bVgAEDlJ2dLUnKy8tT\nv3796rJLAAB+lmo8oz5w4ICWLVum77//XoGBgcrJydGLL76otLQ0ZWZmqnPnzkpMTFRQUJBmzpyp\nlJQUWSwWpaamym63KyEhQXv27FFSUpJsNpvS09Ob4rgAAGgVLN7avGncxPy9pMCl2/qpzfwmpe9s\notUAgPk2pMU06Os1+KVvAADQNOr9U99oGJyxAgCuhTNqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAM\nRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAA\ngxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYA\nwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMFtjcC2gKk9J3NvcSAACoE86oAQAwGKEG\nAMBghBoAAIMRagAADFanHyYrKCjQ9OnT1bVrV0lSt27d9Pjjj2v27NmqrKxUZGSkVqxYIZvNpqys\nLG3atEkBAQEaO3asxowZ06AHAABAa1bnn/q+9957tWrVKt/Xc+fOVXJysoYPH66XXnpJTqdTiYmJ\nWrNmjZxOp4KCgjR69GjFxcUpLCysQRYPAEBr12CXvgsKChQbGytJGjJkiPLz87V//35FRUXJbrcr\nODhYffr0kcvlaqhdAgDQ6tX5jPrQoUOaMmWKzp49q2nTpqmsrEw2m02SFBERIbfbLY/Ho/DwcN9z\nwsPD5Xa7a3ztDh1CFBho9Ws9kZF2/w4AAIA6asrm1CnUt956q6ZNm6bhw4fr8OHDmjBhgiorK33f\n93q913ze9bb//4qLS/1aT2SkXW73Ob+eAwBAXTV0c6oLf50ufXfq1EkJCQmyWCy6+eab1bFjR509\ne1bl5eWSpBMnTsjhcMjhcMjj8fied/LkSTkcjrrsEgCAn6U6hTorK0tvvvmmJMntduvUqVMaNWqU\ncnJyJEm5ubkaNGiQevXqpcLCQpWUlOjChQtyuVzq27dvw60eAIBWrk6XvmNiYvTMM8/ok08+UUVF\nhRYvXqwePXpozpw5yszMVOfOnZWYmKigoCDNnDlTKSkpslgsSk1Nld3Oe8kAANSWxVvbN46bkL/X\n/mt6j5pfygEAaEgb0mIa9PUa/D1qAADQNAg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiM\nUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssCl2\n8sILL2j//v2yWCyaN2+e7rrrrqbYLQAALV6jh/rLL7/Ud999p8zMTBUVFWnevHnKzMxs7N0CANAq\nNPql7/z8fD344IOSpNtuu01nz57V+fPnG3u3AAC0Co0eao/How4dOvi+Dg8Pl9vtbuzdAgDQKjTJ\ne9RX8nq9NT4mMtLu9+tW95wPMh72+/UAADBBo59ROxwOeTwe39cnT55UZGRkY+8WAIBWodFDPWDA\nAOXk5EiSDh48KIfDoXbt2jX2bgEAaBUa/dJ3nz59dOedd2rcuHGyWCxatGhRY+8SAIBWw+KtzZvG\nAACgWXBnMgAADEaoAQAwWJN/PKuhcXtS/3399deaOnWqJk6cqPHjx+vYsWOaPXu2KisrFRkZqRUr\nVshmszX3Mo20fPly7du3T5cvX9bkyZMVFRXF7GqhrKxMaWlpOnXqlC5evKipU6eqe/fuzM5P5eXl\nGjFihKZOnar+/fszv1oqKCjQ9OnT1bVrV0lSt27d9Pjjj7eY+bXoM+orb0/6/PPP6/nnn2/uJRmv\ntLRUS5YsUf/+/X3bVq1apeTkZL399tu65ZZb5HQ6m3GF5vriiy/0zTffKDMzU+vXr9cLL7zA7Gop\nLy9PPXv21JYtW7Ry5Uqlp6czuzp47bXX1L59e0n8vfXXvffeq82bN2vz5s1auHBhi5pfiw41tyf1\nn81m07p16+RwOHzbCgoKFBsbK0kaMmSI8vPzm2t5Rrvnnnv08ssvS5JCQ0NVVlbG7GopISFBTzzx\nhCTp2LFj6tSpE7PzU1FRkQ4dOqQHHnhAEn9v66slza9Fh5rbk/ovMDBQwcHBVbaVlZX5LvlEREQw\nw+uwWq0KCQmRJDmdTg0ePJjZ+WncuHF65plnNG/ePGbnp2XLliktLc33NfPzz6FDhzRlyhQlJSXp\n888/b1Hza/HvUV+JT5rVHzOs2Y4dO+R0OrVhwwYNHTrUt53Z1ezdd9/VV199pVmzZlWZF7Or3l/+\n8hf17t1bXbp0ueb3mV/1br31Vk2bNk3Dhw/X4cOHNWHCBFVWVvq+b/r8WnSouT1pwwgJCVF5ebmC\ng4N14sSJKpfFUdVnn32mtWvXav369bLb7cyulg4cOKCIiAjdcMMN6tGjhyorK9W2bVtmV0u7du3S\n4cOHtWvXLh0/flw2m40/e37o1KmTEhISJEk333yzOnbsqMLCwhYzvxZ96ZvbkzaM6Oho3xxzc3M1\naNCgZl6Rmc6dO6fly5fr9ddfV1hYmCRmV1t79+7Vhg0bJP34llVpaSmz88PKlSu1detW/fnPf9aY\nMWM0depU5ueHrKwsvfnmm5Ikt9utU6dOadSoUS1mfi3+zmQvvvii9u7d67s9affu3Zt7SUY7cOCA\nli1bpu+//16BgYHq1KmTXnytKYqYAAAArElEQVTxRaWlpenixYvq3Lmzli5dqqCgoOZeqnEyMzO1\nevVq/fKXv/RtS09P14IFC5hdDcrLyzV//nwdO3ZM5eXlmjZtmnr27Kk5c+YwOz+tXr1aN954owYO\nHMj8aun8+fN65plnVFJSooqKCk2bNk09evRoMfNr8aEGAKA1a9GXvgEAaO0INQAABiPUAAAYjFAD\nAGAwQg0AgMEINQAABiPUAAAYjFADAGCw/wdkB5RjykY3PgAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "XtYZ7114n3b-" + }, + "cell_type": "markdown", + "source": [ + "## Accessing Data\n", + "\n", + "You can access `DataFrame` data using familiar Python dict/list operations:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "_TFm7-looBFF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 104 + }, + "outputId": "57aede33-3c05-4f1c-8509-78060101057d" + }, + "cell_type": "code", + "source": [ + "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n", + "print(type(cities['City name']))\n", + "cities['City name']" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 San Francisco\n", + "1 San Jose\n", + "2 Sacramento\n", + "Name: City name, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "V5L6xacLoxyv", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "outputId": "c6c3f30a-067e-4fef-ca63-51aae3939f62" + }, + "cell_type": "code", + "source": [ + "print(type(cities['City name'][1]))\n", + "cities['City name'][1]" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'San Jose'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "gcYX1tBPugZl", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 129 + }, + "outputId": "41ecc9c0-2935-4afe-a530-97b64431fc29" + }, + "cell_type": "code", + "source": [ + "print(type(cities[0:2]))\n", + "cities[0:2]" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulation
0San Francisco852469
1San Jose1015785
\n", + "
" + ], + "text/plain": [ + " City name Population\n", + "0 San Francisco 852469\n", + "1 San Jose 1015785" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "65g1ZdGVjXsQ" + }, + "cell_type": "markdown", + "source": [ + "In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "RM1iaD-ka3Y1" + }, + "cell_type": "markdown", + "source": [ + "## Manipulating Data\n", + "\n", + "You may apply Python's basic arithmetic operations to `Series`. For example:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "XWmyCFJ5bOv-", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "outputId": "daaa5c1b-3eda-4b83-df5c-0c25a24f3b89" + }, + "cell_type": "code", + "source": [ + "population / 1000." + ], + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 852.469\n", + "1 1015.785\n", + "2 485.199\n", + "dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "TQzIVnbnmWGM" + }, + "cell_type": "markdown", + "source": [ + "[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "ko6pLK6JmkYP", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "outputId": "f4e256ae-81a6-4f43-eed7-80a0cb371482" + }, + "cell_type": "code", + "source": [ + "import numpy as np\n", + "\n", + "np.log(population)" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 13.655892\n", + "1 13.831172\n", + "2 13.092314\n", + "dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "xmxFuQmurr6d" + }, + "cell_type": "markdown", + "source": [ + "For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n", + "`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n", + "\n", + "The example below creates a new `Series` that indicates whether `population` is over one million:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "Fc1DvPAbstjI", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "outputId": "a53dfe27-2832-4927-acb4-c097eefc16a5" + }, + "cell_type": "code", + "source": [ + "population.apply(lambda val: val > 1000000)" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 False\n", + "1 True\n", + "2 False\n", + "dtype: bool" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "ZeYYLoV9b9fB" + }, + "cell_type": "markdown", + "source": [ + "\n", + "Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "0gCEX99Hb8LR", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "b093cd85-f4e2-4c35-a936-8cce335552c3" + }, + "cell_type": "code", + "source": [ + "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n", + "cities['Population density'] = cities['Population'] / cities['Area square miles']\n", + "cities" + ], + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation density
0San Francisco85246946.8718187.945381
1San Jose1015785176.535754.177760
2Sacramento48519997.924955.055147
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density\n", + "0 San Francisco 852469 46.87 18187.945381\n", + "1 San Jose 1015785 176.53 5754.177760\n", + "2 Sacramento 485199 97.92 4955.055147" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "6qh63m-ayb-c" + }, + "cell_type": "markdown", + "source": [ + "## Exercise #1\n", + "\n", + "Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n", + "\n", + " * The city is named after a saint.\n", + " * The city has an area greater than 50 square miles.\n", + "\n", + "**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n", + "\n", + "**Hint:** \"San\" in Spanish means \"saint.\"" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "zCOn8ftSyddH", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "59689928-0c91-43d2-963f-6e5d7c16a527" + }, + "cell_type": "code", + "source": [ + "# Your code here\n", + "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n", + "cities" + ], + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
0San Francisco85246946.8718187.945381False
1San Jose1015785176.535754.177760True
2Sacramento48519997.924955.055147False
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "0 San Francisco 852469 46.87 18187.945381 \n", + "1 San Jose 1015785 176.53 5754.177760 \n", + "2 Sacramento 485199 97.92 4955.055147 \n", + "\n", + " Is wide and has saint name \n", + "0 False \n", + "1 True \n", + "2 False " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "YHIWvc9Ms-Ll" + }, + "cell_type": "markdown", + "source": [ + "### Solution\n", + "\n", + "Click below for a solution." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "T5OlrqtdtCIb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "68302cf2-5454-4e78-c2a4-de3617cfd167" + }, + "cell_type": "code", + "source": [ + "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n", + "cities" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
0San Francisco85246946.8718187.945381False
1San Jose1015785176.535754.177760True
2Sacramento48519997.924955.055147False
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "0 San Francisco 852469 46.87 18187.945381 \n", + "1 San Jose 1015785 176.53 5754.177760 \n", + "2 Sacramento 485199 97.92 4955.055147 \n", + "\n", + " Is wide and has saint name \n", + "0 False \n", + "1 True \n", + "2 False " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "f-xAOJeMiXFB" + }, + "cell_type": "markdown", + "source": [ + "## Indexes\n", + "Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n", + "\n", + "By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "2684gsWNinq9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "b11186b3-1c6a-4aa4-ebd7-00818a7394e7" + }, + "cell_type": "code", + "source": [ + "city_names.index" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=3, step=1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "F_qPe2TBjfWd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "be2d6e8d-d7ed-4259-e79b-7011b121c3ea" + }, + "cell_type": "code", + "source": [ + "cities.index" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=3, step=1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "hp2oWY9Slo_h" + }, + "cell_type": "markdown", + "source": [ + "Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "sN0zUzSAj-U1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "a2f5d075-3435-4209-9e7b-c6bd3a1317b7" + }, + "cell_type": "code", + "source": [ + "cities.reindex([2, 0, 1])" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
2Sacramento48519997.924955.055147False
0San Francisco85246946.8718187.945381False
1San Jose1015785176.535754.177760True
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "2 Sacramento 485199 97.92 4955.055147 \n", + "0 San Francisco 852469 46.87 18187.945381 \n", + "1 San Jose 1015785 176.53 5754.177760 \n", + "\n", + " Is wide and has saint name \n", + "2 False \n", + "0 False \n", + "1 True " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "-GQFz8NZuS06" + }, + "cell_type": "markdown", + "source": [ + "Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n", + "Try running the following cell multiple times!" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "mF8GC0k8uYhz", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "outputId": "d073772b-41d3-4260-d9d4-12520da345ec" + }, + "cell_type": "code", + "source": [ + "cities.reindex(np.random.permutation(cities.index))" + ], + "execution_count": 20, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
0San Francisco85246946.8718187.945381False
1San Jose1015785176.535754.177760True
2Sacramento48519997.924955.055147False
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "0 San Francisco 852469 46.87 18187.945381 \n", + "1 San Jose 1015785 176.53 5754.177760 \n", + "2 Sacramento 485199 97.92 4955.055147 \n", + "\n", + " Is wide and has saint name \n", + "0 False \n", + "1 True \n", + "2 False " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "fSso35fQmGKb" + }, + "cell_type": "markdown", + "source": [ + "For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "8UngIdVhz8C0" + }, + "cell_type": "markdown", + "source": [ + "## Exercise #2\n", + "\n", + "The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "PN55GrDX0jzO", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + }, + "outputId": "6d125adc-a36f-4717-b964-8b204a63b250" + }, + "cell_type": "code", + "source": [ + "# Your code here\n", + "cities.reindex([0, 4, 5, 2])" + ], + "execution_count": 21, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
0San Francisco852469.046.8718187.945381False
4NaNNaNNaNNaNNaN
5NaNNaNNaNNaNNaN
2Sacramento485199.097.924955.055147False
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "0 San Francisco 852469.0 46.87 18187.945381 \n", + "4 NaN NaN NaN NaN \n", + "5 NaN NaN NaN NaN \n", + "2 Sacramento 485199.0 97.92 4955.055147 \n", + "\n", + " Is wide and has saint name \n", + "0 False \n", + "4 NaN \n", + "5 NaN \n", + "2 False " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "TJffr5_Jwqvd" + }, + "cell_type": "markdown", + "source": [ + "### Solution\n", + "\n", + "Click below for the solution." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "8oSvi2QWwuDH" + }, + "cell_type": "markdown", + "source": [ + "If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "yBdkucKCwy4x", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + }, + "outputId": "6ef4d8ea-5e17-4387-c0f5-3f5ff8f28cb6" + }, + "cell_type": "code", + "source": [ + "cities.reindex([0, 4, 5, 2])" + ], + "execution_count": 22, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City namePopulationArea square milesPopulation densityIs wide and has saint name
0San Francisco852469.046.8718187.945381False
4NaNNaNNaNNaNNaN
5NaNNaNNaNNaNNaN
2Sacramento485199.097.924955.055147False
\n", + "
" + ], + "text/plain": [ + " City name Population Area square miles Population density \\\n", + "0 San Francisco 852469.0 46.87 18187.945381 \n", + "4 NaN NaN NaN NaN \n", + "5 NaN NaN NaN NaN \n", + "2 Sacramento 485199.0 97.92 4955.055147 \n", + "\n", + " Is wide and has saint name \n", + "0 False \n", + "4 NaN \n", + "5 NaN \n", + "2 False " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "2l82PhPbwz7g" + }, + "cell_type": "markdown", + "source": [ + "This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n", + "documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n", + "in which the index values are browser names).\n", + "\n", + "In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n", + "sanitizing the input." + ] + } + ] +} \ No newline at end of file