-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zipfile.BadZipFile
error when reading .xlsx
files.
#56
base: main
Are you sure you want to change the base?
Fix zipfile.BadZipFile
error when reading .xlsx
files.
#56
Conversation
Use the name of the `.xlsx`, when opening the workbook, since the following error is thrown when attempting to read the `io.TextIOWrapper` instance: `zipfile.BadZipFile: File is not a zip file when loading an .xlsx file` Issue: ets#54.
My quick Edit: redacted output: INFO Using supplied catalog /path/to/meltano_project/.meltano/run/tap-custom-tap/tap.properties.json.
INFO Processing 1 selected streams from Catalog
INFO Syncing stream:account_transactions
{"type": "SCHEMA", "stream": "account_transactions", "schema": {"properties": {"date": {"type": ["null", "string"]}, "contact": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "invoice_number": {"type": ["null", "string"]}, "reference": {"type": ["null", "string"]}, "debit_gbp": {"type": ["null", "string"]}, "credit_gbp": {"type": ["null", "string"]}, "gross_gbp": {"type": ["null", "string"]}, "net_gbp": {"type": ["null", "string"]}, "vat_gbp": {"type": ["null", "string"]}, "account_code": {"type": ["null", "integer"]}, "account": {"type": ["null", "string"]}, "account_type": {"type": ["null", "string"]}, "revenue_type": {"type": ["null", "string"]}, "source": {"type": ["null", "string"]}, "contact_group": {"type": ["null", "string"]}, "debit": {"type": ["null", "string"]}, "credit": {"type": ["null", "string"]}, "gross": {"type": ["null", "string"]}, "net": {"type": ["null", "string"]}, "vat": {"type": ["null", "string"]}, "vat_rate": {"type": ["null", "integer"]}, "vat_rate_name": {"type": ["null", "string"]}, "region": {"type": ["null", "string"]}, "related_account": {"type": ["null", "string"]}, "_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": []}
INFO Loading cached SSO token for default
INFO Found 2 files.
INFO Checking 2 resolved objects for any that match regular expression "account_transactions/.*xlsx$" and were modified since 1970-01-01 00:00:00+00:00
INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Syncing file "account_transactions/issue_52_bad_sample_file.xlsx".
CRITICAL [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'
Traceback (most recent call last):
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
sys.exit(main())
^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/singer/utils.py", line 235, in wrapped
return fnc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 162, in main
sync(tables_config, args.state, catalog)
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 117, in sync
records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 46, in write_file
iterator = tap_spreadsheets_anywhere.format_handler.get_row_iterator(table_spec, target_uri)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/format_handler.py", line 164, in get_row_iterator
iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/excel_handler.py", line 72, in get_row_iterator
workbook = openpyxl.load_workbook(file_handle.name, read_only=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
reader = ExcelReader(filename, read_only, keep_vba,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
self.archive = _validate_archive(fn)
^^^^^^^^^^^^^^^^^^^^^
File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
archive = ZipFile(filename, 'r')
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py", line 1283, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx' NOTE: Without this change, I get: |
Diving in with (Pdb) fpin.seek(-sizeEndCentDir, 2)
fpin.seek(-sizeEndCentDir, 2)
*** io.UnsupportedOperation: can't do nonzero end-relative seeks Which is then re-raised on: https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L1367 as: def _RealGetContents(self):
"""Read in the table of contents for the ZIP file."""
fp = self.fp
try:
endrec = _EndRecData(fp)
except OSError:
raise BadZipFile("File is not a zip file")
if not endrec:
raise BadZipFile("File is not a zip file") Was worried that my S3 sourced file was not completely streamed, but did the same check with the local >>> f = open("/path/to/file.xlsx", errors="surrogateescape") # `errors` is used to match current code + avoid: `*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 10: invalid continuation byte` when read.
>>> f.seek(0, 2)
>>> f.tell()
<int> # Confirmed same for local vs S3 file pulled file during pdb session. |
Hi there, the same error also happens when running the test cases. With kind regards, |
Use the name of the
.xlsx
, when opening the workbook, since the following error is thrown when attempting to read theio.TextIOWrapper
instance:zipfile.BadZipFile: File is not a zip file when loading an .xlsx file
Fixes #54.