Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH adding write_html to TableReport #1190

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ Release 0.4.1

Changes
-------
* :class: `TableReport` has `write_html` method
:pr:`1190` by :user: `Mojdeh Rastgoo<mrastgoo>`.

* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
printing of progress information when a report is being generated.
:pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.
Expand Down
38 changes: 38 additions & 0 deletions skrub/_reporting/_table_report.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import codecs
import functools
import json
from pathlib import Path

from ._html import to_html
from ._serve import open_in_browser
Expand Down Expand Up @@ -197,6 +199,42 @@ def _repr_mimebundle_(self, include=None, exclude=None):
def _repr_html_(self):
return self._repr_mimebundle_()["text/html"]

def write_html(self, file):
"""Store the report into an HTML file.

Parameters
----------
file : str, pathlib.Path or file object
The file object or path of the file to store the HTML output.
"""
html = self.html()
if isinstance(file, (str, Path)):
with open(file, "w", encoding="utf8") as stream:
stream.write(html)
return

try:
# We don't have information about the write mode of the provided
# file-object. We start by writing bytes into it.
file.write(html.encode("utf-8"))
return
except TypeError:
# We end-up here if the file-object was open in text mode
# Let's give it another chance in this mode.
pass

if (encoding := getattr(file, "encoding", None)) is not None:
try:
assert codecs.lookup(encoding).name == "utf-8"
except (AssertionError, LookupError):
raise ValueError(
"If `file` is a text file it should use utf-8 encoding; got:"
f" {encoding!r}"
)
# We write into the file-object expecting it to be in text mode at this
# stage and with a UTF-8 encoding.
file.write(html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining what html is expected (or not expected) to be at line 230?
Additionally, light inline documentation/comments on the steps above would help readability :)


def open(self):
"""Open the HTML report in a web browser."""
open_in_browser(self.html())
55 changes: 55 additions & 0 deletions skrub/_reporting/tests/test_table_report.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
import contextlib
import datetime
import json
import re
import warnings
from pathlib import Path

import pytest

from skrub import TableReport, ToDatetime
from skrub import _dataframe as sbd
Expand Down Expand Up @@ -123,6 +127,57 @@ def test_duration(df_module):
assert re.search(r"2(\.0)?\s+days", TableReport(df).html())


@pytest.mark.parametrize(
"filename_type",
["str", "Path", "text_file_object", "binary_file_object"],
)
def test_write_html(tmp_path, pd_module, filename_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is looking great! the last thing we need to take care of is to make sure we close the file if we opened it (otherwise we "leak" a resource: the opened file handle. that could be a problem for example to clean up the temp directory on windows as it refuses to remove files that have an open file handle). Usually we ensure that with a simple context manager like:

with open(tmp_file_path, 'w', encoding='utf-8') as file:
    file.write('hello')

but here we have a tricky situation because in some cases we have a string or path (which require no closing) and sometimes we have a file object which does require closing.

The standard library module contextlib provides 2 ways to deal with that situation easily. The first is ExitStack: it creates a context and we can push as many context managers as we want to its stack; when it exits it unwinds the stack, calling each manager's __exit__ when it is popped. So we could use it like:

with contextlib.ExitStack() as stack:
    if file_type == 'str':
        file = str(tmp_file_path)
    elif file_type == 'text_file_object':
        file = stack.enter_context(open(tmp_file_path, 'w', encoding='utf-8'))
    # ...

    report.write_html(file)

# if we opened it the file is closed here when we exit the `with` block

This option using ExitStack is my favorite because the file is being managed by a context manager as soon as it is opened.

Another way is to use nullcontext in the cases where we do not open the file, so that later we can treat all options as if they were open files that implement the context manager protocol. nullcontext returns an object that implements the context manager protocol but whose __enter__ just returns the object we gave it and __exit__ does nothing:

if file_type == 'str':
    file = contextlib.nullcontext(str(tmp_file_path))
elif file_type == 'text_file_object':
    file = open(tmp_file_path, 'w', encoding='utf-8')

with file:
    report.write_html(file)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow contextlib.ExitStack() is very nice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?

df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)

tmp_file_path = tmp_path / Path("report.html")

# making sure we are closing the open files, and dealing with the first
# condition which doesn't require opening any file
with contextlib.ExitStack() as stack:
if filename_type == "str":
filename = str(tmp_file_path)
elif filename_type == "text_file_object":
filename = stack.enter_context(open(tmp_file_path, "w", encoding="utf-8"))
elif filename_type == "binary_file_object":
filename = stack.enter_context(open(tmp_file_path, "wb"))
else:
filename = tmp_file_path

report.write_html(filename)
assert tmp_file_path.exists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check the full content but maybe we could see that the file contains "</html>" to see that the full report has been written


with open(tmp_file_path, "r", encoding="utf-8") as file:
saved_content = file.read()
assert "</html>" in saved_content


def test_write_html_with_not_utf8_encoding(tmp_path, pd_module):
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)
tmp_file_path = tmp_path / Path("report.html")

with open(tmp_file_path, "w", encoding="latin-1") as file:
encoding = getattr(file, "encoding", None)
with pytest.raises(
ValueError,
match=(
"If `file` is a text file it should use utf-8 encoding; got:"
f" {encoding!r}"
),
):
report.write_html(file)

with open(tmp_file_path, "r", encoding="latin-1") as file:
saved_content = file.read()
assert "</html>" not in saved_content


def test_verbosity_parameter(df_module, capsys):
df = df_module.make_dataframe(
dict(
Expand Down
Loading