Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: [ingest-file]: database-related errors when processing PDF files #4002

Closed
sjinks opened this issue Nov 13, 2024 · 4 comments
Closed

BUG: [ingest-file]: database-related errors when processing PDF files #4002

sjinks opened this issue Nov 13, 2024 · 4 comments
Labels
bug Things that should work, but don’t Moderate Issue that may require attention

Comments

@sjinks
Copy link

sjinks commented Nov 13, 2024

Describe the bug
docker compose logs ingest-file shows errors like

{
  "logger": "sqlalchemy.pool.impl.SingletonThreadPool",
  "timestamp": "2024-11-13 00:00:02.935097",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py\", line 376, in _close_connection\n    self._dialect.do_close(connection)\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 708, in do_close\n    dbapi_connection.close()\nsqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140639094179584 and this is thread id 140639274534656.",
  "dataset": "1",
  "trace_id": "ed8f2d0dcb5d4fe0ac1aa29bea62fdec",
  "v": "4.0.0",
  "start_time": 1731456002.9097664,
  "stage": "ingest",
  "job_id": "4:553b21da-a2d0-4704-bb79-0b7a68df3c1f",
  "message": "Exception closing connection <sqlite3.Connection object at 0x7fe8fe072a80>",
  "severity": "ERROR"
}

Human-readable backtrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 376, in _close_connection
    self._dialect.do_close(connection)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 708, in do_close
    dbapi_connection.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140639094179584 and this is thread id 140639274534656.
{
  "logger": "servicelayer.taskqueue",
  "timestamp": "2024-11-13 00:06:36.803818",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1967, in _exec_single_context\n    self.dialect.do_execute(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n    cursor.execute(statement, parameters)\nsqlite3.OperationalError: no such table: ingest_cache\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/servicelayer/taskqueue.py\", line 576, in handle\n    task = self.dispatch_task(task)\n  File \"/ingestors/ingestors/worker.py\", line 63, in dispatch_task\n    entity_ids = self._ingest(ftmstore_dataset, task)\n  File \"/ingestors/ingestors/worker.py\", line 36, in _ingest\n    manager.ingest_entity(entity)\n  File \"/ingestors/ingestors/manager.py\", line 189, in ingest_entity\n    self.ingest(file_path, entity)\n  File \"/ingestors/ingestors/manager.py\", line 220, in ingest\n    self.delegate(ingestor_class, file_path, entity)\n  File \"/ingestors/ingestors/manager.py\", line 245, in delegate\n    ingestor_class(self).ingest(file_path, entity)\n  File \"/ingestors/ingestors/documents/office.py\", line 171, in ingest\n    pdf_path = self.document_to_pdf(unique_tmpdir, file_path, entity)\n  File \"/ingestors/ingestors/support/convert.py\", line 29, in document_to_pdf\n    pdf_hash = self.tags.get(key)\n  File \"/usr/local/lib/python3.8/dist-packages/servicelayer/tags.py\", line 52, in get\n    rp = conn.execute(stmt)\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1418, in execute\n    return meth(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py\", line 515, in _execute_on_connection\n    return connection._execute_clauseelement(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1640, in _execute_clauseelement\n    ret = self._execute_context(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1846, in _execute_context\n    return self._exec_single_context(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1986, in _exec_single_context\n    self._handle_dbapi_exception(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 2355, in _handle_dbapi_exception\n    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1967, in _exec_single_context\n    self.dialect.do_execute(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n    cursor.execute(statement, parameters)\nsqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: ingest_cache\n[SQL: SELECT ingest_cache.value \nFROM ingest_cache \nWHERE ingest_cache.\"key\" = ?]\n[parameters: (pdf:9dc3d66e9a33042a94a7a7aeda434fcdeb828ce3,)]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)",
  "dataset": "1",
  "trace_id": "ed8f2d0dcb5d4fe0ac1aa29bea62fdec",
  "v": "4.0.0",
  "start_time": 1731456396.776213,
  "stage": "ingest",
  "job_id": "4:553b21da-a2d0-4704-bb79-0b7a68df3c1f",
  "message": "Error in task handling",
  "severity": "ERROR"
}

Human-readable backtrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such table: ingest_cache

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/taskqueue.py", line 576, in handle
    task = self.dispatch_task(task)
  File "/ingestors/ingestors/worker.py", line 63, in dispatch_task
    entity_ids = self._ingest(ftmstore_dataset, task)
  File "/ingestors/ingestors/worker.py", line 36, in _ingest
    manager.ingest_entity(entity)
  File "/ingestors/ingestors/manager.py", line 189, in ingest_entity
    self.ingest(file_path, entity)
  File "/ingestors/ingestors/manager.py", line 220, in ingest
    self.delegate(ingestor_class, file_path, entity)
  File "/ingestors/ingestors/manager.py", line 245, in delegate
    ingestor_class(self).ingest(file_path, entity)
  File "/ingestors/ingestors/documents/office.py", line 171, in ingest
    pdf_path = self.document_to_pdf(unique_tmpdir, file_path, entity)
  File "/ingestors/ingestors/support/convert.py", line 29, in document_to_pdf
    pdf_hash = self.tags.get(key)
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/tags.py", line 52, in get
    rp = conn.execute(stmt)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: ingest_cache
[SQL: SELECT ingest_cache.value 
FROM ingest_cache 
WHERE ingest_cache."key" = ?]
[parameters: (pdf:9dc3d66e9a33042a94a7a7aeda434fcdeb828ce3,)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

To Reproduce

  1. Download docker-compose.yml and aleph.env.tmpl, configure alpeh.env
  2. Run docker compose run --rm shell aleph upgrade
  3. Run docker compose up -d
  4. Create an admin user
  5. Log in, create a new investigation, upload a directory. As far as I understand, it should contain PDF files.
  6. While the files are being uploaded, run docker compose logs -f in a terminal and watch.

Expected behavior
Documents are processed without any errors, and I can see their content in the UI.

Aleph version
4.0.1

Screenshots
Screenshot_20241113_021352

Additional context
"no such table: ingest_cache" is not a new error, it also happens in the stable 3.17.0 branch:
Screenshot_20241113_021558

Related: #3816; I use the default ALEPH_DATABASE_URI

@sjinks sjinks added bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team labels Nov 13, 2024
@catileptic catileptic added Moderate Issue that may require attention and removed triage These issues need to be reviewed by the Aleph team labels Nov 19, 2024
@brassy-endomorph
Copy link
Contributor

Noting that we are also getting this error in version ingest-file:3.22.0

@brassy-endomorph
Copy link
Contributor

Relates to #3872 and tagging @stchris as per the request in the comment here #3872 (comment)

stchris added a commit that referenced this issue Jan 24, 2025
ingest-file uses the Tags functionality from servicelayer to cache certain document types
during ingest to speed up its operation. Normally the settings file defaults to trying
the ALEPH_DATABASE_URI followed by the FTM_STORE_URI, which in most cases should be set
to non-default values.

But if one were to just follow our prod environment docs blindly and not uncomment these
two settings then the TAGS_DATABASE_URI would be unset, causing servicelayer to pick the
default (sqlite:///) which breaks now that we have multiple threads writing to the tags
database.

This change documents how this setting works and encourages one to set it if the above
two settings are left to defaults.

Relates to #4002, #3816
@stchris
Copy link
Contributor

stchris commented Jan 24, 2025

Looking at the setup everyone who is reporting this issue the one common thing is following a typical environment setup (little or no changes to aleph.env.tmpl) so the issue at hand is that servicelayer tries to set the TAGS_DATABASE_URI env var by defaulting to either ALEPH_DATABASE_URI or FTM_STORE_URI which means that having those variable commented out (so not set) leaves you with an empty/unset Tags db path, which then defaults to sqlite:///. This in turn does not work as of recently, because we are writing from at least two threads to that db.

Long story short, to fix this issue:

  • set your ALEPH_DATABASE_URI and FTM_DATABASE_URI env vars (even if they just point to the defaults)

OR

stchris added a commit that referenced this issue Jan 24, 2025
ingest-file uses the Tags functionality from servicelayer to cache certain document types
during ingest to speed up its operation. Normally the settings file defaults to trying
the ALEPH_DATABASE_URI followed by the FTM_STORE_URI, which in most cases should be set
to non-default values.

But if one were to just follow our prod environment docs blindly and not uncomment these
two settings then the TAGS_DATABASE_URI would be unset, causing servicelayer to pick the
default (sqlite:///) which breaks now that we have multiple threads writing to the tags
database.

This change documents how this setting works and encourages one to set it if the above
two settings are left to defaults.

Relates to #4002, #3816
@stchris stchris closed this as completed Jan 24, 2025
@zzzepezzz
Copy link

I can confirm that it works for my instance. Thanks for that !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

5 participants