Progress bar implementation (ydataai#345)

* Progress bar implementation - Feature as requested in ydataai#224 - Test for ydataai#282 - Many thanks @marco-cardoso for your initial implementation ydataai#225 - Display no progress bar for disabled modules (e.g. individual correlations). - Update requirements, notebooks, docs, examples, linting * Decouple notebooks and notebook tests. One test hangs on issue in nbval: computationalmodelling/nbval#136 * Disable missing plots in minimal mode * Create additional demo with Chicago employees data * Compartmentalize column sorting in describe module
chanedwin · Feb 2, 2020 · 8dca684 · 8dca684
1 parent a25b9db
commit 8dca684
Show file tree

Hide file tree

Showing 38 changed files with 138,073 additions and 42,064 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -19,11 +19,6 @@ env:
   - TEST=examples
   - TEST=lint
 
-jobs:
-  exclude:
-    - python: "3.5"
-      env: TEST=examples
-
 install:
   - pip install --upgrade pip six
   - pip install -r requirements.txt
@@ -33,7 +28,7 @@ install:
 script:
   - if [ $TEST == 'unit' ]; then pytest --cov=. tests/unit/; fi
   - if [ $TEST == 'issue' ]; then pytest --cov=. tests/issues/; fi
-  - if [ $TEST == 'examples' ]; then pytest --cov=. --nbval --sanitize-with tests/sanitize-notebook.cfg examples/; fi
+  - if [ $TEST == 'examples' ]; then pytest --cov=. --nbval tests/notebooks/; fi
   - if [ $TEST == 'console' ]; then pandas_profiling -h; fi
   - if [ $TEST == 'lint' ]; then pytest --black -m black src/; flake8 . --select=E9,F63,F7,F82 --show-source --statistics; fi
 

diff --git a/Makefile b/Makefile
@@ -4,8 +4,9 @@ docs:
 	rmdir docs/pandas_profiling
 
 test:
-    pytest --nbval --cov=./ --black --sanitize-with tests/sanitize-notebook.cfg tests/unit/
-    pytest --nbval --cov=./ --black --sanitize-with tests/sanitize-notebook.cfg tests/issues/
+    pytest --black tests/unit/
+    pytest --black tests/issues/
+    pytest --nbval tests/notebooks/
     flake8 . --select=E9,F63,F7,F82 --show-source --statistics
 
 install:

diff --git a/README.md b/README.md
@@ -179,6 +179,7 @@ A set of options is available in order to adapt the report generated.
 
 * `title` (`str`): Title for the report ('Pandas Profiling Report' by default).
 * `pool_size` (`int`): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
+* `progress_bar` (`bool`): If True, `pandas-profiling` will display a progress bar.
 
 More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml), [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml) and [dark themed configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml).
 

diff --git a/docs/index.html b/docs/index.html
@@ -26,7 +26,7 @@ <h1 id="pandas-profiling">Pandas Profiling</h1>
 <p><a href="https://travis-ci.com/pandas-profiling/pandas-profiling"><img alt="Build Status" src="https://travis-ci.com/pandas-profiling/pandas-profiling.svg?branch=master"></a>
 <a href="https://codecov.io/gh/pandas-profiling/pandas-profiling"><img alt="Code Coverage" src="https://codecov.io/gh/pandas-profiling/pandas-profiling/branch/master/graph/badge.svg?token=gMptB4YUnF"></a>
 <a href="https://github.com/pandas-profiling/pandas-profiling/releases"><img alt="Release Version" src="https://img.shields.io/github/release/pandas-profiling/pandas-profiling.svg"></a>
-<a href="https://pypi.org/project/pandas-profiling/"><img alt="Python Version" src="https://img.shields.io/badge/python-3.5%20%7C%203.6%20%7C%203.7-blue.svg"></a>
+<a href="https://pypi.org/project/pandas-profiling/"><img alt="Python Version" src="https://img.shields.io/pypi/pyversions/pandas-profiling"></a>
 <a href="https://github.com/python/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a></p>
 <p>Generates profile reports from a pandas <code>DataFrame</code>.
 The pandas <code>df.describe()</code> function is great but a little basic for serious exploratory data analysis.
@@ -41,6 +41,7 @@ <h1 id="pandas-profiling">Pandas Profiling</h1>
 <li><strong>Histogram</strong></li>
 <li><strong>Correlations</strong> highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices</li>
 <li><strong>Missing values</strong> matrix, count, heatmap and dendrogram of missing values</li>
+<li><strong>Text analysis</strong> learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.</li>
 </ul>
 <h2 id="announcements">Announcements</h2>
 <p>With your help, we got approved for <a href="https://github.com/sponsors/sbrugman">GitHub Sponsors</a>!
@@ -71,6 +72,7 @@ <h2 id="examples">Examples</h2>
 <li><a href="http://pandas-profiling.github.io/pandas-profiling/examples/vektis/vektis_report.html">Vektis</a> (Vektis Dutch Healthcare data)</li>
 <li><a href="http://pandas-profiling.github.io/pandas-profiling/examples/website_inaccessibility/website_inaccessibility_report.html">Website Inaccessibility</a> (demonstrates the URL type)</li>
 <li><a href="http://pandas-profiling.github.io/pandas-profiling/examples/colors/colors_report.html">Colors</a> (a simple colors dataset)</li>
+<li><a href="http://pandas-profiling.github.io/pandas-profiling/examples/russian_vocabulary/russian_vocabulary.html">Russian Vocabulary</a> (demonstrates text analysis)</li>
 </ul>
 <h2 id="installation">Installation</h2>
 <h3 id="using-pip">Using pip</h3>
@@ -108,7 +110,7 @@ <h3 id="getting-started">Getting started</h3>
 )
 </code></pre>
 <p>To generate the report, run:</p>
-<pre><code class="python">profile = ProfileReport(df, title='Pandas Profiling Report', style={'full_width':True})
+<pre><code class="python">profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
 </code></pre>
 <h4 id="jupyter-notebook">Jupyter Notebook</h4>
 <p>We recommend generating reports interactively by using the Jupyter notebook.
@@ -150,6 +152,7 @@ <h3 id="advanced-usage">Advanced usage</h3>
 <ul>
 <li><code>title</code> (<code>str</code>): Title for the report ('Pandas Profiling Report' by default).</li>
 <li><code>pool_size</code> (<code>int</code>): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).</li>
+<li><code>progress_bar</code> (<code>bool</code>): If True, <code>pandas-profiling</code> will display a progress bar.</li>
 </ul>
 <p>More settings can be found in the <a href="https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml">default configuration file</a>, <a href="https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml">minimal configuration file</a> and <a href="https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml">dark themed configuration file</a>.</p>
 <p><strong>Example</strong></p>
@@ -261,13 +264,16 @@ <h2 id="dependencies">Dependencies</h2>
 
 import pandas as pd
 import numpy as np
+from tqdm.auto import tqdm
 
+from pandas_profiling.model.messages import MessageType
 from pandas_profiling.version import __version__
-from pandas_profiling.utils.dataframe import clean_column_names, rename_index
+from pandas_profiling.utils.dataframe import rename_index
 from pandas_profiling.utils.paths import get_config_default, get_config_minimal
 from pandas_profiling.config import config
 from pandas_profiling.controller import pandas_decorator
 from pandas_profiling.model.describe import describe as describe_df
+from pandas_profiling.model.messages import MessageType
 from pandas_profiling.report import get_report_structure
 
 
@@ -305,12 +311,8 @@ <h2 id="dependencies">Dependencies</h2>
         # Rename reserved column names
         df = rename_index(df)
 
-        # Remove spaces and colons from column names
-        df = clean_column_names(df)
-
-        # Sort names according to config (asc, desc, no sort)
-        df = self.sort_column_names(df)
-        config[&#34;column_order&#34;] = df.columns.tolist()
+        # Ensure that columns are strings
+        df.columns = df.columns.astype(&#34;str&#34;)
 
         # Get dataset statistics
         description_set = describe_df(df)
@@ -319,26 +321,17 @@ <h2 id="dependencies">Dependencies</h2>
         self.sample = self.get_sample(df)
         self.title = config[&#34;title&#34;].get(str)
         self.description_set = description_set
-
         self.date_end = datetime.utcnow()
-        self.report = get_report_structure(
-            self.date_start, self.date_end, self.sample, description_set
-        )
 
-    def sort_column_names(self, df):
-        sort = config[&#34;sort&#34;].get(str)
-        if sys.version_info[1] &lt;= 5 and sort != &#34;None&#34;:
-            warnings.warn(&#34;Sorting is supported from Python 3.6+&#34;)
+        disable_progress_bar = not config[&#34;progress_bar&#34;].get(bool)
 
-        if sort in [&#34;asc&#34;, &#34;ascending&#34;]:
-            df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
-        elif sort in [&#34;desc&#34;, &#34;descending&#34;]:
-            df = df.reindex(
-                reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
+        with tqdm(
+            total=1, desc=&#34;build report structure&#34;, disable=disable_progress_bar
+        ) as pbar:
+            self.report = get_report_structure(
+                self.date_start, self.date_end, self.sample, description_set
             )
-        elif sort != &#34;None&#34;:
-            raise ValueError(&#39;&#34;sort&#34; should be &#34;ascending&#34;, &#34;descending&#34; or None.&#39;)
-        return df
+            pbar.update(1)
 
     def get_sample(self, df: pd.DataFrame) -&gt; dict:
         sample = {}
@@ -360,7 +353,7 @@ <h2 id="dependencies">Dependencies</h2>
         &#34;&#34;&#34;
         return self.description_set
 
-    def get_rejected_variables() -&gt; list:
+    def get_rejected_variables(self) -&gt; list:
         return [
             message.column_name
             for message in self.description_set[&#34;messages&#34;]
@@ -592,12 +585,8 @@ <h2 class="section-title" id="header-classes">Classes</h2>
         # Rename reserved column names
         df = rename_index(df)
 
-        # Remove spaces and colons from column names
-        df = clean_column_names(df)
-
-        # Sort names according to config (asc, desc, no sort)
-        df = self.sort_column_names(df)
-        config[&#34;column_order&#34;] = df.columns.tolist()
+        # Ensure that columns are strings
+        df.columns = df.columns.astype(&#34;str&#34;)
 
         # Get dataset statistics
         description_set = describe_df(df)
@@ -606,26 +595,17 @@ <h2 class="section-title" id="header-classes">Classes</h2>
         self.sample = self.get_sample(df)
         self.title = config[&#34;title&#34;].get(str)
         self.description_set = description_set
-
         self.date_end = datetime.utcnow()
-        self.report = get_report_structure(
-            self.date_start, self.date_end, self.sample, description_set
-        )
 
-    def sort_column_names(self, df):
-        sort = config[&#34;sort&#34;].get(str)
-        if sys.version_info[1] &lt;= 5 and sort != &#34;None&#34;:
-            warnings.warn(&#34;Sorting is supported from Python 3.6+&#34;)
+        disable_progress_bar = not config[&#34;progress_bar&#34;].get(bool)
 
-        if sort in [&#34;asc&#34;, &#34;ascending&#34;]:
-            df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
-        elif sort in [&#34;desc&#34;, &#34;descending&#34;]:
-            df = df.reindex(
-                reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
+        with tqdm(
+            total=1, desc=&#34;build report structure&#34;, disable=disable_progress_bar
+        ) as pbar:
+            self.report = get_report_structure(
+                self.date_start, self.date_end, self.sample, description_set
             )
-        elif sort != &#34;None&#34;:
-            raise ValueError(&#39;&#34;sort&#34; should be &#34;ascending&#34;, &#34;descending&#34; or None.&#39;)
-        return df
+            pbar.update(1)
 
     def get_sample(self, df: pd.DataFrame) -&gt; dict:
         sample = {}
@@ -647,7 +627,7 @@ <h2 class="section-title" id="header-classes">Classes</h2>
         &#34;&#34;&#34;
         return self.description_set
 
-    def get_rejected_variables() -&gt; list:
+    def get_rejected_variables(self) -&gt; list:
         return [
             message.column_name
             for message in self.description_set[&#34;messages&#34;]
@@ -823,15 +803,15 @@ <h2 id="returns">Returns</h2>
 </details>
 </dd>
 <dt id="pandas_profiling.ProfileReport.get_rejected_variables"><code class="name flex">
-<span>def <span class="ident">get_rejected_variables</span></span>(<span>)</span>
+<span>def <span class="ident">get_rejected_variables</span></span>(<span>self)</span>
 </code></dt>
 <dd>
 <section class="desc"></section>
 <details class="source">
 <summary>
 <span>Expand source code</span>
 </summary>
-<pre><code class="python">def get_rejected_variables() -&gt; list:
+<pre><code class="python">def get_rejected_variables(self) -&gt; list:
     return [
         message.column_name
         for message in self.description_set[&#34;messages&#34;]
@@ -861,31 +841,6 @@ <h2 id="returns">Returns</h2>
     return sample</code></pre>
 </details>
 </dd>
-<dt id="pandas_profiling.ProfileReport.sort_column_names"><code class="name flex">
-<span>def <span class="ident">sort_column_names</span></span>(<span>self, df)</span>
-</code></dt>
-<dd>
-<section class="desc"></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def sort_column_names(self, df):
-    sort = config[&#34;sort&#34;].get(str)
-    if sys.version_info[1] &lt;= 5 and sort != &#34;None&#34;:
-        warnings.warn(&#34;Sorting is supported from Python 3.6+&#34;)
-
-    if sort in [&#34;asc&#34;, &#34;ascending&#34;]:
-        df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
-    elif sort in [&#34;desc&#34;, &#34;descending&#34;]:
-        df = df.reindex(
-            reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
-        )
-    elif sort != &#34;None&#34;:
-        raise ValueError(&#39;&#34;sort&#34; should be &#34;ascending&#34;, &#34;descending&#34; or None.&#39;)
-    return df</code></pre>
-</details>
-</dd>
 <dt id="pandas_profiling.ProfileReport.to_app"><code class="name flex">
 <span>def <span class="ident">to_app</span></span>(<span>self)</span>
 </code></dt>
@@ -1157,7 +1112,6 @@ <h4><code><a title="pandas_profiling.ProfileReport" href="#pandas_profiling.Prof
 <li><code><a title="pandas_profiling.ProfileReport.get_rejected_variables" href="#pandas_profiling.ProfileReport.get_rejected_variables">get_rejected_variables</a></code></li>
 <li><code><a title="pandas_profiling.ProfileReport.get_sample" href="#pandas_profiling.ProfileReport.get_sample">get_sample</a></code></li>
 <li><code><a title="pandas_profiling.ProfileReport.html" href="#pandas_profiling.ProfileReport.html">html</a></code></li>
-<li><code><a title="pandas_profiling.ProfileReport.sort_column_names" href="#pandas_profiling.ProfileReport.sort_column_names">sort_column_names</a></code></li>
 <li><code><a title="pandas_profiling.ProfileReport.to_app" href="#pandas_profiling.ProfileReport.to_app">to_app</a></code></li>
 <li><code><a title="pandas_profiling.ProfileReport.to_file" href="#pandas_profiling.ProfileReport.to_file">to_file</a></code></li>
 <li><code><a title="pandas_profiling.ProfileReport.to_html" href="#pandas_profiling.ProfileReport.to_html">to_html</a></code></li>
@@ -1177,4 +1131,4 @@ <h4><code><a title="pandas_profiling.ProfileReport" href="#pandas_profiling.Prof
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
 <script>hljs.initHighlightingOnLoad()</script>
 </body>
-</html>
+</html>