`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file #54

craigastill · 2023-05-15T16:12:26Z

Not sure if this is a misconfiguration on my side, but I get the error: zipfile.BadZipFile: File is not a zip file both from reading an .xlsx from local/S3, or a pytest testcase saving and reloading an openpyxl created workbook.

Debugging/Workaround

My current workaround is to modify: https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/excel_handler.py#L64-L65 from:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle, read_only=True)

to:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle.name, read_only=True)

Reproduction:

Created an example testcase:

from openpyxl import Workbook
from tap_spreadsheets_anywhere.format_handler import get_row_iterator


def get_worksheet():
    """Create a basic workbook that can be manipulated for tests.
    See: https://openpyxl.readthedocs.io/en/stable/usage.html.
    """
    wb = Workbook()
    ws = wb.active
    tree_data = [
        ["Type", "Leaf Color", "Height"],
        ["Maple", "Red", 549],
        ["Oak", "Green", 783],
        ["Pine", "Green", 1204]
    ]
    exp_tree_data = [
        {'type': 'Maple', 'leaf_color': 'Red', 'height': 549},
        {'type': 'Oak', 'leaf_color': 'Green', 'height': 783},
        {'type': 'Pine', 'leaf_color': 'Green', 'height': 1204},
    ]
    [ws.append(row) for row in tree_data]
    return ws, wb, tree_data, exp_tree_data


class TestFormatHandlerExcelSkipInitial:
    """pytests to validate Skip Initial for Excel files works as expected."""
    def test_parse_data(self, tmpdir):
        xlsx = tmpdir / "fake_test.xlsx"
        uri = f"file://{xlsx}"
        worksheet, workbook, _, exp = get_worksheet()
        workbook.save(xlsx)

        iterator = get_row_iterator({"format": "excel"}, uri)
        assert next(iterator) == exp[0]

Current output:

cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
============================================= test session starts ==============================================
platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
collected 30 items / 29 deselected / 1 selected                                                                

tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data FAILED [100%]

=================================================== FAILURES ===================================================
______________________________ TestFormatHandlerExcelSkipInitial.test_parse_data _______________________________

self = <tap_spreadsheets_anywhere.test.test_format.TestFormatHandlerExcelSkipInitial object at 0x1055e9e50>
tmpdir = local('/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0')

    def test_parse_data(self, tmpdir):
        xlsx = tmpdir / "fake_test.xlsx"
        uri = f"file://{xlsx}"
        worksheet, workbook, _, exp = get_worksheet()
        workbook.save(xlsx)

>       iterator = get_row_iterator({"format": "excel"}, uri)

tap_spreadsheets_anywhere/test/test_format.py:218: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tap_spreadsheets_anywhere/format_handler.py:164: in get_row_iterator
    iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
tap_spreadsheets_anywhere/excel_handler.py:66: in get_row_iterator
    workbook = openpyxl.load_workbook(file_handle, read_only=True)
.venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:344: in load_workbook
    reader = ExcelReader(filename, read_only, keep_vba,
.venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:123: in __init__
    self.archive = _validate_archive(fn)
.venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:95: in _validate_archive
    archive = ZipFile(filename, 'r')
/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1301: in __init__
    self._RealGetContents()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <zipfile.ZipFile [closed]>

    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
        fp = self.fp
        try:
            endrec = _EndRecData(fp)
        except OSError:
            raise BadZipFile("File is not a zip file")
        if not endrec:
>           raise BadZipFile("File is not a zip file")
E           zipfile.BadZipFile: File is not a zip file

/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1368: BadZipFile
--------------------------------------------- Captured stdout call ---------------------------------------------
<_io.TextIOWrapper name='/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0/fake_test.xlsx' mode='r' encoding='utf-8'>
=========================================== short test summary info ============================================
FAILED tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data - zi...
======================================= 1 failed, 29 deselected in 1.22s =======================================

Workaround:

Adding the above mentioned workaround results in success, but not tested this in anger.

cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
============================= test session starts ==============================
platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
collected 30 items / 29 deselected / 1 selected                                

tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data PASSED [100%]

====================== 1 passed, 29 deselected in 23.37s =======================

The text was updated successfully, but these errors were encountered:

Use the name of the `.xlsx`, when opening the workbook, since the following error is thrown when attempting to read the `io.TextIOWrapper` instance: `zipfile.BadZipFile: File is not a zip file when loading an .xlsx file` Issue: ets#54.

craigastill · 2023-05-23T12:17:28Z

My quick .name hack works for local files, but is throwing: FileNotFoundError: [Errno 2] No such file or directory: <sub-folder>/<file>.xlsx, when running meltano invoke tap-<custom_tap> --dev when attempting to extract the same file from an S3 bucket.

Edit: redacted output:

  INFO Using supplied catalog /path/to/meltano_project/.meltano/run/tap-custom-tap/tap.properties.json.
  INFO Processing 1 selected streams from Catalog
  INFO Syncing stream:account_transactions
  {"type": "SCHEMA", "stream": "account_transactions", "schema": {"properties": {"date": {"type": ["null", "string"]}, "contact": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "invoice_number": {"type": ["null", "string"]}, "reference": {"type": ["null", "string"]}, "debit_gbp": {"type": ["null", "string"]}, "credit_gbp": {"type": ["null", "string"]}, "gross_gbp": {"type": ["null", "string"]}, "net_gbp": {"type": ["null", "string"]}, "vat_gbp": {"type": ["null", "string"]}, "account_code": {"type": ["null", "integer"]}, "account": {"type": ["null", "string"]}, "account_type": {"type": ["null", "string"]}, "revenue_type": {"type": ["null", "string"]}, "source": {"type": ["null", "string"]}, "contact_group": {"type": ["null", "string"]}, "debit": {"type": ["null", "string"]}, "credit": {"type": ["null", "string"]}, "gross": {"type": ["null", "string"]}, "net": {"type": ["null", "string"]}, "vat": {"type": ["null", "string"]}, "vat_rate": {"type": ["null", "integer"]}, "vat_rate_name": {"type": ["null", "string"]}, "region": {"type": ["null", "string"]}, "related_account": {"type": ["null", "string"]}, "_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": []}
  INFO Loading cached SSO token for default
  INFO Found 2 files.
  INFO Checking 2 resolved objects for any that match regular expression "account_transactions/.*xlsx$" and were modified since 1970-01-01 00:00:00+00:00
  INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
  INFO Syncing file "account_transactions/issue_52_bad_sample_file.xlsx".
  CRITICAL [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'
  Traceback (most recent call last):
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/singer/utils.py", line 235, in wrapped
      return fnc(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 162, in main
      sync(tables_config, args.state, catalog)
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 117, in sync
      records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 46, in write_file
      iterator = tap_spreadsheets_anywhere.format_handler.get_row_iterator(table_spec, target_uri)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/format_handler.py", line 164, in get_row_iterator
      iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/excel_handler.py", line 72, in get_row_iterator
      workbook = openpyxl.load_workbook(file_handle.name, read_only=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
      reader = ExcelReader(filename, read_only, keep_vba,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
      self.archive = _validate_archive(fn)
                     ^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
      archive = ZipFile(filename, 'r')
                ^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py", line 1283, in __init__
      self.fp = io.open(file, filemode)
                ^^^^^^^^^^^^^^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'

NOTE: Without this change, I get: zipfile.BadZipFile: File is not a zip file, in both local/S3 scenarios.

craigastill · 2023-05-23T12:17:48Z

Diving in with pdb. With the code in main and using my sample file in S3, this is hitting the exception in the builtin: zipfile (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L247). Which is throwing:

(Pdb) fpin.seek(-sizeEndCentDir, 2)
fpin.seek(-sizeEndCentDir, 2)
*** io.UnsupportedOperation: can't do nonzero end-relative seeks

Which is then re-raised on: https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L1367 as: BadZipFile("File is not a zip file"). zipfile Code:

    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
        fp = self.fp
        try:
            endrec = _EndRecData(fp)
        except OSError:
            raise BadZipFile("File is not a zip file")
        if not endrec:
            raise BadZipFile("File is not a zip file")

Was worried that my S3 sourced file was not completely streamed, but did the same check with the local .xlsx file as done by the built-in zipfile._EndRecData(fpin) (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L285-L293). ie.

>>> f = open("/path/to/file.xlsx", errors="surrogateescape")  # `errors` is used to match current code + avoid: `*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 10: invalid continuation byte` when read.
>>> f.seek(0, 2)
>>> f.tell()
<int>  # Confirmed same for local vs S3 file pulled file during pdb session.

craigastill · 2023-05-23T12:21:29Z

I don't know from the lack of details, but could this potentially be a duplicate of #26 ? With the problem being around: zipfile.ZipFile(<_io.TextWrapper>), returned by smart_open.open(<s3_uri>) ??

craigastill · 2023-05-26T09:22:45Z

I've got a dirty workaround that reads the file contents from the smart_open.open(), to a temporary local file before passing into: openpyxl.load_workbook(). I'm not happy with having a second copy of the file in memory, but it might be a starting point for a cleaner fix:

from pathlib import Path
from tempfile import TemporaryDirectory

import smart_open
from openpyxl import load_workbook

S3_URI = "s3://<s3_bucket>/<prefix>/<file>.xlsx"


# Pull down `.xlsx` from S3 and work around `BadZipFile` error thrown by
# the internals of `openpyxl.load_workbook()` when passed the
# `_io.TextIOWrapper` that `smart_open.open()` returns, by dumping to a
# temporary file.
with TemporaryDirectory() as tmp_dir:
    _tmp_dir_path = Path(tmp_dir)
    _tmp_file = _tmp_dir_path / "tmp.xlsx"
    with open(_tmp_file, "wb") as tf:
        with smart_open.open(S3_URI, "rb") as f:
            for line in f:
                tf.write(line)
    # Following no longer throws: `BadZipFile`.
    wb = load_workbook(ttf)

aminebeh · 2023-07-26T13:27:23Z

Hi @craigastill, stuble upon the same issue when using the tap for local files.

I wanted to extract a local .xlsx file, but I got the File is not a zip file error.

After digging in the code, I found that the smart_open lib overwrite the binary mode when reading a file when an encoding is specified:

PR with changes: Resolve Issue #145, add warnings for Issue #146 piskvorky/smart_open#147
Lines of code

A fix can be to update the code here with:

elif format == 'excel':
    if uri.lower().endswith(".xls"):
        reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb')
        iterator = tap_spreadsheets_anywhere.excel_handler.get_legacy_row_iterator(table_spec, reader)
    else:
        reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb', encoding=None) # Adding encoding `None` to ensure smart_open will use binary mode
        iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)

We can add a check to ensure file extension is xlsx as this format require to open the file in binary mode.

@craigastill Can you test this fix with your S3 use case ?

I'll write a PR as soon as you've tested it 😉

craigastill · 2023-07-27T08:05:16Z

Hi Amine. When I roll back onto the data project that is using tap-spreadsheets-anywhere, I'll try and give your fix a go.

Interesting find on the smart_open side!

radbrt · 2023-09-21T08:01:44Z

@aminebeh I stumbled across this fix (after I made a much less elegant fix myself), it works excellently for me. My own branch is https://github.com/radbrt/tap-spreadsheets-anywhere/tree/excelbinary, I don't see any reason not to make a PR with the fix. Any thoughts @menzenski ?

craigastill linked a pull request May 16, 2023 that will close this issue

Fix zipfile.BadZipFile error when reading .xlsx files. #56

Open

aminebeh mentioned this issue Jul 26, 2023

fix(xlsx): Fix error issue for xlsx #62

Merged

mjsqu added a commit to mjsqu/tap-spreadsheets-anywhere that referenced this issue Oct 3, 2023

Implementing fix for: ets#54

ea0a94d

mjsqu added a commit to mjsqu/tap-spreadsheets-anywhere that referenced this issue Oct 3, 2023

Implementing fix for ets#54

ae28881

amotl mentioned this issue Dec 13, 2023

Fix reading binary files/streams by adjusting encoding to match mode #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file #54

`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file #54

craigastill commented May 15, 2023

craigastill commented May 23, 2023

craigastill commented May 23, 2023

craigastill commented May 23, 2023

craigastill commented May 26, 2023

aminebeh commented Jul 26, 2023

craigastill commented Jul 27, 2023

radbrt commented Sep 21, 2023

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file #54

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file #54

Comments

craigastill commented May 15, 2023

Debugging/Workaround

Reproduction:

Workaround:

craigastill commented May 23, 2023

craigastill commented May 23, 2023

craigastill commented May 23, 2023

craigastill commented May 26, 2023

aminebeh commented Jul 26, 2023

craigastill commented Jul 27, 2023

radbrt commented Sep 21, 2023

`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file #54

`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file #54