ENH adding `write_html` to `TableReport` #1190

mrastgoo · 2024-12-07T20:44:01Z

Adding write_html to TableReport.
The added test only checks for the raise error and if the file exist
The following code for checking the content failed and needs further investigation

    with open(f_name, 'r') as file:
        saved_content = file.read()  
    assert saved_content == report.html()

skrub/_reporting/_table_report.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

skrub/_reporting/_table_report.py

skrub/_reporting/tests/test_table_report.py

glemaitre · 2024-12-07T20:52:33Z

skrub/_reporting/tests/test_table_report.py

+    report = TableReport(df)
+
+    with TemporaryDirectory() as td:
+        f_name = Path(td) / Path("report.html")


Let's call it if filename

skrub/_reporting/tests/test_table_report.py

glemaitre · 2024-12-07T20:53:52Z

skrub/_reporting/tests/test_table_report.py

+        f_name = Path(td) / Path("report.html")
+
+        if filename_path == "str":
+            report.write_html(f_name.absolute())


Why do we need to pass .absolute here?

glemaitre · 2024-12-07T20:59:04Z

skrub/_reporting/tests/test_table_report.py

+        if filename_path == "str":
+            report.write_html(f_name.absolute())
+
+        if filename_path == "Path":
+            report.write_html(f_name)
+
+        if filename_path == "file_object":
+            file_object = open(f_name, "w", encoding="utf-8")
+            report.write_html(file_object)


I would prefer something like:

tmp_file_path = ... if filename_type == "str": filename = str(tmp_file_path) elif filename_type == "file_object": filename = open(tmp_file_path, "w", encoding="utf-8") else: filename = tmp_file_path report.write_html(filename) assert tmp_file_path.exists()

Co-authored-by: Guillaume Lemaitre <[email protected]>

jeromedockes

Hi @mrastgoo , thanks a lot for working on this! It's a great addition 🎉 🚀 . Even in the skrub tests and documentation scripts there are a few places we could use it :)

also the fact that the reports are self-contained and can be seen after the python process finishes is one of their main advantages (which required some compromises) and by adding this method you'll make that feature more visible and help users take advantage of it :)

jeromedockes · 2024-12-08T09:46:00Z

skrub/_reporting/_table_report.py

@@ -197,6 +198,30 @@ def _repr_mimebundle_(self, include=None, exclude=None):
    def _repr_html_(self):
        return self._repr_mimebundle_()["text/html"]

+    def write_html(self, filename):


This looks great! In the issue I forgot to mention a couple of things:

Sometimes generating the report can fail (unfortunately!). We want to make sure we close the file (if we opened it) in this case; also we may want to avoid truncating or creating it before we know we have something to write.

The report will always set the charset to utf-8 in the page's metadata. So if the user passes an open file we may optionally check the file's encoding if we want to be extra cautious. Also note unless the doctring we write forbids it the user could pass a binary stream so we may want to handle that case, too.

I'm not sure we need to enforce the filename suffix. I could imagine someone wanting to save to "report.htm" or "report.html.tmp" or something so I'd rather let the user name the file however they want.

If we allow both a path and a file object maybe the parameter could be just "file" instead of "filename"?

So taking those details into account I guess the function could be modified slightly to look similar to this:

def write_html(self, file): html = self.html() if isinstance(file, (str, Path)): with open(file, "w", encoding="utf8") as stream: stream.write(html) return try: file.write(html.encode("utf-8")) return except TypeError: pass if (encoding := getattr(file, "encoding", None)) is not None: try: assert codecs.lookup(encoding).name == "utf-8" except (AssertionError, LookupError): raise ValueError( f"If `file` is a text file it should use utf-8 encoding; got: {encoding!r}" ) file.write(html)

thank you comments @jeromedockes , I have couple of questions.

What are the reasons behind TypeError

if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

Hi @mrastgoo ,

What are the reasons behind TypeError

Trying to write bytes and catching the TypeError is an easy way to detect if we are dealing with a text stream or a binary stream.

A file object (a.k.a. stream) can be:

In binary mode, in which case it expects bytes that can be written to a file without transformation. If we want to write text we must encode it into bytes ourselves. Passing a str to write raises a TypeError.

In text mode, in which case it expects strings and it will transparently encode them into bytes (and optionally translate newline characters) before writing the resulting bytes to the underlying binary buffer. Passing a bytes to write raises a TypeError (the one we catch).

>>> 'é' 'é' >>> 'é'.encode('utf-8') b'\xc3\xa9'

Binary mode:

>>> binary_stream = open('thefile', 'wb') # 'wb' = 'write + binary' >>> binary_stream.write('é'.encode('utf-8')) # we can write bytes 2 >>> binary_stream.write(b'\xc3\xa9') 2 >>> binary_stream.write('é') # trying to write a string directly is a TypeError Traceback (most recent call last): ... TypeError: a bytes-like object is required, not 'str' >>> binary_stream.close()

Conversely, if we use text mode (the default):

>>> text_stream = open('thefile', 'wt', encoding='utf-8') # 'write + text', or 'w' as text is the default >>> text_stream.write('é') # we can write strings 1 >>> text_stream.write(b'\xc3\xa9'.decode('utf-8')) 1

and this is the error we catch in the current version of TableReport.write_html:

>>> text_stream.write(b'\xc3\xa9') # trying to write bytes to a text stream is a TypeError Traceback (most recent call last): ... TypeError: write() argument must be str, not bytes >>> text_stream.close()

So we need to know if the file object the user gave us is a text or binary stream. If we knew it was created with open we could inspect its mode:

>>> with open('thefile', 'w') as stream: ... print(stream.mode) w >>> with open('thefile', 'wb') as stream: ... print(stream.mode) wb

But that may not always work with other kinds of file objects:

>>> import io >>> stream = io.BytesIO() >>> stream.write(b'\xc3\xa9') 2 >>> stream.mode Traceback (most recent call last): ... AttributeError: '_io.BytesIO' object has no attribute 'mode'

All well-behaved file objects however will respect the rule of raising a TypeError if we give them the wrong type of data. So what we can do is first suppose it is in binary mode and try to write bytes, and if we get a TypeError try again with a string (or we could do it the other way around, too).

if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

If we pass encoding=None to open it will default to using locale.encoding, but AFAIK the resulting file object will always have an explicit encoding (not None).

>>> with open('thefile', 'w', encoding=None) as f: ... print(f.encoding) UTF-8

So I think it should be ok to inspect the encoding of the file object in this case.

Some other kinds of text stream may not have an explicit encoding though. For example io.StringIO writes to an in-memory python string, so it accepts strings but does not need to encode them (it doesn't need to convert to bytes because the backing storage is a unicode string)

>>> stream = io.StringIO() >>> stream.encoding is None True >>> stream.write('é') # ok 1

In the case of StringIO it's fine to copy the report string to the stream's buffer even if we couldn't figure out what encoding it uses (because it doesn't use any). In some other cases I guess looking at the encoding attribute may not give a result even if a (possibly wrong) encoding is being used. But I guess at this point we have to trust the user to know what they are doing and at least we have made a reasonable effort to prevent the most likely issue, which is doing this on windows:

>>> with open('thefile', 'w') as f: ... report.write_html(f)

(Even that will go away in python 3.15 when utf-8 becomes the default).
So my suggestion is that if we easily detect a wrong encoding is being used by checking the encoding attribute, we raise an error because we are sure that writing will result in a file with an incorrect <meta charset='utf-8'> tag which is likely to cause it to display incorrectly in a browser. If we cannot figure out for sure that a wrong encoding was used we just go ahead and write the report and hope for the best.

sorry for the long answer, hope this makes sense!

I fail to see the use case for allowing the user to pass a file instead of a filename. According to some path, this feature should address the simple use case of writing an HTML file to disk. Can you motivate this a little bit? It looks like plain over-engineering to me TBH.

I believe it was a suggestion from @glemaitre .

but if you look at functions offering to save some object as text such as matplotlib.savefig, pandas or polars to_csv etc, xml.etree.write, plotly write_html and others it's hard to find an example that does not allow passing a file object.

one example use case would be using a temporary file, testing, situations in which you may need to write either to an actual file or an in-memory buffer, ...

from personal experience a while ago in a completely unrelated context NiftiImages from the nibabel library could only be saved to a filesystem path and I can't remember what were the exact situation where it was a problem for me but I remember it was annoying

Thank you @jeromedockes for the explanation. it helped a lot.

jeromedockes · 2024-12-08T09:47:57Z

skrub/_reporting/tests/test_table_report.py

@@ -123,6 +127,42 @@ def test_duration(df_module):
    assert re.search(r"2(\.0)?\s+days", TableReport(df).html())


+@pytest.mark.parametrize("filename_type", ["str", "Path", "file_object"])


if we want to be super exhaustive maybe we could test files both in text and binary modes?

jeromedockes · 2024-12-08T09:49:10Z

skrub/_reporting/tests/test_table_report.py

+    df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
+    report = TableReport(df)
+
+    with TemporaryDirectory() as td:


maybe we could use the pytest tmp_path fixture?

jeromedockes · 2024-12-08T09:51:58Z

skrub/_reporting/tests/test_table_report.py

+            filename = tmp_file_path
+
+        report.write_html(filename)
+        assert tmp_file_path.exists()


we don't need to check the full content but maybe we could see that the file contains "</html>" to see that the full report has been written

jeromedockes · 2024-12-08T10:01:30Z

Users will be interested in this feature so we also need a changelog entry :) thanks!

mrastgoo · 2024-12-09T22:17:54Z

I am block to make the test for platform encoding. The following test doesn't creates the error message for the moment.

def test_write_html_platform_encoding_not_utf8(tmp_path, pd_module, monkeypatch):
    df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
    report = TableReport(df)
    monkeypatch.setattr(locale, "getencoding", lambda: "ascii")

    filename = open(tmp_path / Path("report.html"), "w", encoding=None)

    err_msg = f"Platform encoding is not utf-8; got {locale.getencoding()}"
    with pytest.raises(ValueError, match=err_msg):
        report.write_html(filename)

In addition the CI fails for the binary mode. Any suggestion @jeromedockes

jeromedockes · 2024-12-10T07:13:31Z

I am block to make the test for platform encoding. The following test doesn't creates the error message for the moment.

I don't think we can set the default encoding just by patching locale.getencoding:

>>> import locale
>>> locale.getencoding = lambda: 'ascii'

>>> with open("thefile", "w", encoding=None) as f:
...     print(f.encoding)
UTF-8

I think it's fine not to test this, just testing when we open a file for which we explicitly set a wrong encoding is enough IMO.

In addition the CI fails for the binary mode. Any suggestion @jeromedockes

from the log it seems it fails when trying to read back the report to check the content, right? I think it is because we saved the report in utf-8 (as we should) but when we open it for reading we don't set an explicit encoding so on windows it ends up using the incorrect default encoding. I think if we replace

    with open(tmp_file_path, "r") as file:

with

    with open(tmp_file_path, "r", encoding="utf-8") as file:

it should fix the issue

jeromedockes

all the tests are passing 🎉 a couple more details but we are almost there :)

skrub/_reporting/_table_report.py

CHANGES.rst

jeromedockes · 2024-12-16T06:56:13Z

skrub/_reporting/tests/test_table_report.py

+    "filename_type",
+    ["str", "Path", "file_object", "binary_mode"],
+)
+def test_write_html(tmp_path, pd_module, filename_type):


this test is looking great! the last thing we need to take care of is to make sure we close the file if we opened it (otherwise we "leak" a resource: the opened file handle. that could be a problem for example to clean up the temp directory on windows as it refuses to remove files that have an open file handle). Usually we ensure that with a simple context manager like:

with open(tmp_file_path, 'w', encoding='utf-8') as file: file.write('hello')

but here we have a tricky situation because in some cases we have a string or path (which require no closing) and sometimes we have a file object which does require closing.

The standard library module contextlib provides 2 ways to deal with that situation easily. The first is ExitStack: it creates a context and we can push as many context managers as we want to its stack; when it exits it unwinds the stack, calling each manager's __exit__ when it is popped. So we could use it like:

with contextlib.ExitStack() as stack: if file_type == 'str': file = str(tmp_file_path) elif file_type == 'text_file_object': file = stack.enter_context(open(tmp_file_path, 'w', encoding='utf-8')) # ... report.write_html(file) # if we opened it the file is closed here when we exit the `with` block

This option using ExitStack is my favorite because the file is being managed by a context manager as soon as it is opened.

Another way is to use nullcontext in the cases where we do not open the file, so that later we can treat all options as if they were open files that implement the context manager protocol. nullcontext returns an object that implements the context manager protocol but whose __enter__ just returns the object we gave it and __exit__ does nothing:

if file_type == 'str': file = contextlib.nullcontext(str(tmp_file_path)) elif file_type == 'text_file_object': file = open(tmp_file_path, 'w', encoding='utf-8') with file: report.write_html(file)

wow contextlib.ExitStack() is very nice

Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?

jeromedockes · 2024-12-16T06:56:54Z

skrub/_reporting/tests/test_table_report.py

+    df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
+    report = TableReport(df)
+
+    filename = open(tmp_path / Path("report.html"), "w", encoding="latin-1")


here also we want to use a context manager to close the file

jeromedockes · 2024-12-16T06:57:47Z

skrub/_reporting/tests/test_table_report.py

+        report.write_html(filename)
+        assert not filename.exists()


Suggested change

report.write_html(filename)

assert not filename.exists()

report.write_html(filename)

assert not filename.exists()

otherwise the assert is never executed because of the exception raised on the line above

I actually change this to check if the file doesn't contain html to make sure we don't modify it.

jeromedockes · 2024-12-16T07:03:02Z

The following code for checking the content failed and needs further investigation

sorry I forgot to answer that. The reason is that every time we generate the report with report.html() or report.html_snippet(), the report generates a new unique id attribute for the html div it creates, so that we don't have problems with duplicate IDs in the same html page if we insert the same report several times in a page (for example by having it as the output of several notebook cells). so the html will not be exactly the same because an id attribute changes but just checking the </html> tag as you do now in the text is enough

actually I'm thinking now that we don't need to vary the id in the report.html() case, only for html_snippet() but in any case if we decide to change that it will be in a different pull request

Co-authored-by: Jérôme Dockès <[email protected]>

skrub/_reporting/_table_report.py

jeromedockes

LGTM! thanks so much for this very nice addition @mrastgoo 🚀

@glemaitre and @Vincent-Maladiere LMK if you want to have another look otherwise I'll merge it

Vincent-Maladiere

Let's add a few lines of comment to explain the rational behind the steps of the function, from what has been discussed here

Vincent-Maladiere · 2024-12-17T09:03:06Z

skrub/_reporting/_table_report.py

+                    f" {encoding!r}"
+                )
+
+        file.write(html)


Could you add a comment explaining what html is expected (or not expected) to be at line 230?
Additionally, light inline documentation/comments on the steps above would help readability :)

Vincent-Maladiere · 2024-12-17T09:04:47Z

skrub/_reporting/tests/test_table_report.py

+    "filename_type",
+    ["str", "Path", "file_object", "binary_mode"],
+)
+def test_write_html(tmp_path, pd_module, filename_type):


Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?

mrastgoo · 2024-12-18T23:18:01Z

Let's add a few lines of comment to explain the rational behind the steps of the function, from what has been discussed here

I added some comments in the function, Let me know if this is ok

mrastgoo added 2 commits December 7, 2024 21:27

adding write_html for table_report

e8dfaf6

Merge branch 'main' into issue_1174

58b8a65

glemaitre reviewed Dec 7, 2024

View reviewed changes

skrub/_reporting/_table_report.py Outdated Show resolved Hide resolved

glemaitre reviewed Dec 7, 2024

View reviewed changes

skrub/_reporting/_table_report.py Outdated Show resolved Hide resolved

Update skrub/_reporting/_table_report.py

048cec5

Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre reviewed Dec 7, 2024

View reviewed changes

skrub/_reporting/_table_report.py Outdated Show resolved Hide resolved

glemaitre reviewed Dec 7, 2024

View reviewed changes

mrastgoo and others added 2 commits December 7, 2024 22:03

Apply suggestions from code review

15c7b06

Co-authored-by: Guillaume Lemaitre <[email protected]>

addressing the reviews

0debd0b

jeromedockes reviewed Dec 8, 2024

View reviewed changes

mrastgoo added 3 commits December 9, 2024 19:42

Merge branch 'main' into issue_1174

260d710

checking the encoding and TypeError

684a9c1

ignoring test on platform encoding

597bc0e

mrastgoo added 2 commits December 15, 2024 11:52

Merge branch 'main' into issue_1174

ef50a50

fixing the test

bbb50d9

jeromedockes reviewed Dec 16, 2024

View reviewed changes

mrastgoo and others added 3 commits December 16, 2024 18:50

Update skrub/_reporting/_table_report.py

f3c1e8b

Co-authored-by: Jérôme Dockès <[email protected]>

Update CHANGES.rst

760b7b3

Co-authored-by: Jérôme Dockès <[email protected]>

test modification - context manager

0789ac5

jeromedockes reviewed Dec 17, 2024

View reviewed changes

skrub/_reporting/_table_report.py Outdated Show resolved Hide resolved

Update skrub/_reporting/_table_report.py

911eb5a

jeromedockes approved these changes Dec 17, 2024

View reviewed changes

Vincent-Maladiere reviewed Dec 17, 2024

View reviewed changes

adding comment

036db59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH adding `write_html` to `TableReport` #1190

ENH adding `write_html` to `TableReport` #1190

mrastgoo commented Dec 7, 2024

glemaitre Dec 7, 2024

glemaitre Dec 7, 2024

glemaitre Dec 7, 2024

jeromedockes left a comment

jeromedockes Dec 8, 2024

mrastgoo Dec 9, 2024

jeromedockes Dec 10, 2024 •

edited

Loading

Vincent-Maladiere Dec 10, 2024 •

edited

Loading

jeromedockes Dec 10, 2024 •

edited

Loading

mrastgoo Dec 15, 2024

jeromedockes Dec 8, 2024

jeromedockes Dec 8, 2024

jeromedockes Dec 8, 2024

jeromedockes commented Dec 8, 2024

mrastgoo commented Dec 9, 2024

jeromedockes commented Dec 10, 2024

jeromedockes left a comment

jeromedockes Dec 16, 2024

Vincent-Maladiere Dec 16, 2024

Vincent-Maladiere Dec 17, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

mrastgoo Dec 16, 2024

jeromedockes commented Dec 16, 2024

jeromedockes left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere Dec 17, 2024

Vincent-Maladiere Dec 17, 2024

mrastgoo commented Dec 18, 2024

		@@ -123,6 +127,42 @@ def test_duration(df_module):
		assert re.search(r"2(\.0)?\s+days", TableReport(df).html())


		@pytest.mark.parametrize("filename_type", ["str", "Path", "file_object"])

ENH adding write_html to TableReport #1190

Are you sure you want to change the base?

ENH adding write_html to TableReport #1190

Conversation

mrastgoo commented Dec 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Vincent-Maladiere Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

jeromedockes Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Dec 8, 2024

mrastgoo commented Dec 9, 2024

jeromedockes commented Dec 10, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Dec 16, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrastgoo commented Dec 18, 2024

ENH adding `write_html` to `TableReport` #1190

ENH adding `write_html` to `TableReport` #1190

jeromedockes Dec 10, 2024 •

edited

Loading

Vincent-Maladiere Dec 10, 2024 •

edited

Loading

jeromedockes Dec 10, 2024 •

edited

Loading