Dagster SQLite EventLog failures #3868

zaneselvans · 2024-09-26T04:04:53Z

As previously addressed in e.g. #2417, #2996, #3003, #3208, and #3211, SQLite can't handle multiple concurrent writes. Our SQLite IO Manager has worked around this for the PUDL DB, but we seem to be hitting a new limit of some kind with Dagster's event logging DB, which still uses SQLite locally by default.

@cmgosnell recently encountered the issue in attempting to debug some integration test failures through the Dagster UI, in which it failed basically 100% of the time. I have had the problem on and off with maybe a 50-75% failure rate on the full ETL.

For me, the failure always seems to happen right after the execution of the raw_eia860m__all_dfs asset.

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

Full stack trace here:

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 266, in start_run_in_subprocess
    _run_in_subprocess(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 250, in _run_in_subprocess
    instance.report_engine_event(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2497, in report_engine_event
    self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2519, in report_dagster_event
    self.handle_new_event(event_record)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2417, in handle_new_event
    self._event_storage.store_event(events[0])
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/storage/event_log/sqlite/sqlite_event_log.py", line 242, in store_event
    conn.execute(insert_event_statement)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file
[SQL: INSERT INTO event_logs (run_id, event, dagster_event_type, timestamp, step_key, asset_key, partition) VALUES (?, ?, ?, ?, ?, ?, ?)]
[parameters: ('3162fc18-e15b-4284-81dd-b46c063d8d35', '{"__class__": "EventLogEntry", "dagster_event": {"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null,  ... (350 characters truncated) ... "pipeline_name": "etl_full", "run_id": "3162fc18-e15b-4284-81dd-b46c063d8d35", "step_key": null, "timestamp": 1727322130.5879638, "user_message": ""}', 'ENGINE_EVENT', '2024-09-26 03:42:10.587964', None, None, None)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

This "unable to open database file" error seems somewhat different than the locked DB error that we were getting before due to attempted concurrent writes.

So, is there a new workaround to squeeze some more life out of SQLite? Or what is the easiest way to use Postgres locally for development?

@jdangerx suggested Orbstack as a containerized Postgres solution for local use that doesn't run like 10x slower on MacOS/arm64.
Is there some way to make Dagster log fewer events? Can we turn off DEBUG level logging? I don't think we really look at it!
How does this intersect with our development environment setup, which is already challenging for new contributors? Is there an "easy" way to use dev containers and integrate this Postgres setup into that?
Is there a possibility of making our tests not use the in-process (single threaded) executor if we're messing around with containers anyway?

The text was updated successfully, but these errors were encountered:

zaneselvans · 2024-09-26T13:52:44Z

I may just be a superstitious pigeon on this but it seems like sometimes I get this failure 100% reliably, and other times it's maybe just a 50% chance, so if it fails, and I try to run the ETL again, it might work. But I don't know what influences the probability. Last night I tried to run it 10 times in a row, and they all failed within 2.5 minutes.

Some things I've explored in the past without success:

Does the probability vary with the python environment? I.e. is there some combination of packages that aren't playing well together that causes the problem to be more likely? Sometimes updating the pudl-dev environment seems to fix things. But it's not clear if that's real or me imagining things.
Are there too many files & sockets open at once? Using lsof I can see that Dagster often has thousands of "files" open at once, and my system as a whole might have 20,000 open when the ETL is running.

zaneselvans · 2024-09-26T18:23:43Z

Looking at the logging databases themselves:

they exist, and Dagster has been able to write data to them before it starts failing
Recent DBs from failures are 4.0 MB in size
They appear to contain only Dagster logging messages in the event_logs table, and that is the only table that contains any records.
In the failed logging DBs, there are about 2000 event logging records.

zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator labels Sep 26, 2024

github-project-automation bot added this to Catalyst Megaproject Sep 26, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Sep 26, 2024

e-belfer mentioned this issue Oct 18, 2024

Add a Dagster RetryPolicy to PUDL's ETL #3913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster SQLite EventLog failures #3868

Dagster SQLite EventLog failures #3868

zaneselvans commented Sep 26, 2024

zaneselvans commented Sep 26, 2024

zaneselvans commented Sep 26, 2024

Dagster SQLite EventLog failures #3868

Dagster SQLite EventLog failures #3868

Comments

zaneselvans commented Sep 26, 2024

zaneselvans commented Sep 26, 2024

zaneselvans commented Sep 26, 2024