-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipfile.BadZipFile: File is not a zip file
when loading an .xlsx
file
#54
Comments
Use the name of the `.xlsx`, when opening the workbook, since the following error is thrown when attempting to read the `io.TextIOWrapper` instance: `zipfile.BadZipFile: File is not a zip file when loading an .xlsx file` Issue: ets#54.
Use the name of the `.xlsx`, when opening the workbook, since the following error is thrown when attempting to read the `io.TextIOWrapper` instance: `zipfile.BadZipFile: File is not a zip file when loading an .xlsx file` Issue: ets#54.
My quick Edit: redacted output: INFO Using supplied catalog /path/to/meltano_project/.meltano/run/tap-custom-tap/tap.properties.json.
INFO Processing 1 selected streams from Catalog
INFO Syncing stream:account_transactions
{"type": "SCHEMA", "stream": "account_transactions", "schema": {"properties": {"date": {"type": ["null", "string"]}, "contact": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "invoice_number": {"type": ["null", "string"]}, "reference": {"type": ["null", "string"]}, "debit_gbp": {"type": ["null", "string"]}, "credit_gbp": {"type": ["null", "string"]}, "gross_gbp": {"type": ["null", "string"]}, "net_gbp": {"type": ["null", "string"]}, "vat_gbp": {"type": ["null", "string"]}, "account_code": {"type": ["null", "integer"]}, "account": {"type": ["null", "string"]}, "account_type": {"type": ["null", "string"]}, "revenue_type": {"type": ["null", "string"]}, "source": {"type": ["null", "string"]}, "contact_group": {"type": ["null", "string"]}, "debit": {"type": ["null", "string"]}, "credit": {"type": ["null", "string"]}, "gross": {"type": ["null", "string"]}, "net": {"type": ["null", "string"]}, "vat": {"type": ["null", "string"]}, "vat_rate": {"type": ["null", "integer"]}, "vat_rate_name": {"type": ["null", "string"]}, "region": {"type": ["null", "string"]}, "related_account": {"type": ["null", "string"]}, "_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": []}
INFO Loading cached SSO token for default
INFO Found 2 files.
INFO Checking 2 resolved objects for any that match regular expression "account_transactions/.*xlsx$" and were modified since 1970-01-01 00:00:00+00:00
INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Syncing file "account_transactions/issue_52_bad_sample_file.xlsx".
CRITICAL [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'
Traceback (most recent call last):
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
sys.exit(main())
^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/singer/utils.py", line 235, in wrapped
return fnc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 162, in main
sync(tables_config, args.state, catalog)
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 117, in sync
records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 46, in write_file
iterator = tap_spreadsheets_anywhere.format_handler.get_row_iterator(table_spec, target_uri)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/format_handler.py", line 164, in get_row_iterator
iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/excel_handler.py", line 72, in get_row_iterator
workbook = openpyxl.load_workbook(file_handle.name, read_only=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
reader = ExcelReader(filename, read_only, keep_vba,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
self.archive = _validate_archive(fn)
^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
archive = ZipFile(filename, 'r')
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py", line 1283, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx' NOTE: Without this change, I get: |
Diving in with (Pdb) fpin.seek(-sizeEndCentDir, 2)
fpin.seek(-sizeEndCentDir, 2)
*** io.UnsupportedOperation: can't do nonzero end-relative seeks Which is then re-raised on: https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L1367 as: def _RealGetContents(self):
"""Read in the table of contents for the ZIP file."""
fp = self.fp
try:
endrec = _EndRecData(fp)
except OSError:
raise BadZipFile("File is not a zip file")
if not endrec:
raise BadZipFile("File is not a zip file") Was worried that my S3 sourced file was not completely streamed, but did the same check with the local >>> f = open("/path/to/file.xlsx", errors="surrogateescape") # `errors` is used to match current code + avoid: `*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 10: invalid continuation byte` when read.
>>> f.seek(0, 2)
>>> f.tell()
<int> # Confirmed same for local vs S3 file pulled file during pdb session. |
I don't know from the lack of details, but could this potentially be a duplicate of #26 ? With the problem being around: |
I've got a dirty workaround that reads the file contents from the from pathlib import Path
from tempfile import TemporaryDirectory
import smart_open
from openpyxl import load_workbook
S3_URI = "s3://<s3_bucket>/<prefix>/<file>.xlsx"
# Pull down `.xlsx` from S3 and work around `BadZipFile` error thrown by
# the internals of `openpyxl.load_workbook()` when passed the
# `_io.TextIOWrapper` that `smart_open.open()` returns, by dumping to a
# temporary file.
with TemporaryDirectory() as tmp_dir:
_tmp_dir_path = Path(tmp_dir)
_tmp_file = _tmp_dir_path / "tmp.xlsx"
with open(_tmp_file, "wb") as tf:
with smart_open.open(S3_URI, "rb") as f:
for line in f:
tf.write(line)
# Following no longer throws: `BadZipFile`.
wb = load_workbook(ttf) |
Hi @craigastill, stuble upon the same issue when using the tap for local files. I wanted to extract a local After digging in the code, I found that the
A fix can be to update the code here with: elif format == 'excel':
if uri.lower().endswith(".xls"):
reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb')
iterator = tap_spreadsheets_anywhere.excel_handler.get_legacy_row_iterator(table_spec, reader)
else:
reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb', encoding=None) # Adding encoding `None` to ensure smart_open will use binary mode
iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader) We can add a check to ensure file extension is @craigastill Can you test this fix with your S3 use case ? I'll write a PR as soon as you've tested it 😉 |
Hi Amine. When I roll back onto the data project that is using Interesting find on the |
@aminebeh I stumbled across this fix (after I made a much less elegant fix myself), it works excellently for me. My own branch is https://github.com/radbrt/tap-spreadsheets-anywhere/tree/excelbinary, I don't see any reason not to make a PR with the fix. Any thoughts @menzenski ? |
Not sure if this is a misconfiguration on my side, but I get the error:
zipfile.BadZipFile: File is not a zip file
both from reading an.xlsx
from local/S3, or a pytest testcase saving and reloading anopenpyxl
created workbook.Debugging/Workaround
My current workaround is to modify: https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/excel_handler.py#L64-L65 from:
to:
Reproduction:
Workaround:
Adding the above mentioned workaround results in success, but not tested this in anger.
The text was updated successfully, but these errors were encountered: