488 corpus cli #491

nenb · 2024-08-14T21:47:13Z

PoC that addresses #488.

Based on a gist produced by @pmeier, as well as comments from #488.

An example usage from the current branch is:

ragna corpus ingest $(find MY_DOCUMENT_DIRECTORY -type f) --metadata-fields metadata.json --corpus-name MY_CORPUS_NAME --user UNAME --report-failures 2>> failures.txt

cat metadata.json
# {"MY_DOCUMENT_DIRECTORY/MY_FILE1": {"test_field": true}, "MY_DOCUMENT_DIRECTORY/MY_FILE2": {"test_field": false}}

nenb · 2024-08-14T21:49:59Z

ragna/deploy/_cli/core.py

+        typer.Option(
+            "--document-filesystem-handler",
+            metavar="DOCUMENT_FILESYSTEM_HANDLER",
+            default_factory=lambda: "ragna.core.LocalDocument",


From #488: default to ingest files from local disk unless user supplies a different document filesystem handler.

nenb · 2024-08-14T21:50:37Z

ragna/deploy/_cli/core.py

+            help="Name of the corpus to ingest the documents into.",
+        ),
+    ],
+    documents: list[Path],


From #488: Allow the user to pass an arbitrary amount of files.

nenb · 2024-08-14T21:51:13Z

ragna/deploy/_cli/core.py

+        ),
+    ],
+    documents: list[Path],
+    verbose: Annotated[bool, typer.Option("--verbose")] = False,


This is only used at the moment to capture files that were not ingested, but can be extended in various ways.

nenb · 2024-08-14T21:52:06Z

ragna/deploy/_cli/core.py

+        )
+
+    try:
+        getattr(document_filesystem_handler_class, "from_path")


From #488: Enforce that every Document class has a from_path classmethod.

nenb · 2024-08-14T21:52:36Z

ragna/deploy/_cli/core.py

+    with tqdm(total=len(documents) // BATCH_SIZE) as pbar:
+        for batch_number in range(0, len(documents), BATCH_SIZE):
+            document_collection = []
+            document_metadata_collection = []


Messy, and can probably be tidied a bit.

nenb · 2024-08-14T21:53:17Z

ragna/deploy/_cli/core.py

+            if not document_collection:
+                continue
+
+            # TODO: add a check to see if the documents already exist in the database


Flagging a TODO

Update: Relevant if the user aborts the ingestion and returns later, or if there is a fatal error at some point, and the user needs to restart the ingestion.

Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.

Or am I being paranoid here?

I'v added a short-term checkpointing solution that stores all paths in a local file.

ragna/deploy/_cli/core.py

pmeier · 2024-08-14T22:16:30Z

ragna/deploy/_cli/core.py

+            if not document_collection:
+                continue
+
+            # TODO: add a check to see if the documents already exist in the database


Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.

Or am I being paranoid here?

ragna/deploy/_cli/core.py

nenb · 2024-08-16T18:31:06Z

ragna/deploy/_cli/core.py

+    corpus_name: Optional[str] = typer.Option(
+        None, help="Name of the corpus to ingest the documents into."
+    ),
+    config: ConfigOption = "./ragna.toml",  # type: ignore[assignment]


Have now added config option as you suggested in #488

nenb · 2024-08-16T18:33:40Z

ragna/deploy/_cli/core.py

+    verbose: bool = typer.Option(
+        False, help="Print the documents that could not be ingested."
+    ),
+    ignore_log: bool = typer.Option(


Short-term solution for checkpointing on files which have already been ingested. This option ignores the checkpoint logic.

Longer-term, I think this needs to be incorporated into the ragna database. Otherwise, it's very easy to add the same file to the vector database multiple times, which reduces performance (instead of returning 10 sources, you just get 10 copies of the same source), unless we have some logic to deal with this already.

nenb · 2024-08-16T18:34:15Z

ragna/deploy/_cli/core.py

+    with make_session() as session:
+        user_id = database._get_user_id(session, user)
+
+    # Log (JSONL) for recording which files previously added to vector database.


checkpoint logic, stored in a JSONL file on filesystem where CLI is run

nenb · 2024-08-26T18:41:58Z

ragna/deploy/_cli/core.py

+            f"Could not connect to the database: {config.api.database_url}"
+        )
+
+    if metadata_fields:


This file should contain a mapping from individual filepaths to a dictionary of metadata fields for each filepath.

pmeier

I don't love it, but I don't want to spend more time on this initial feature. I added a few small changes:

Factor out the corpus commands in a separate module
Switch the input arguments to use typing.Annotated as is recommended and we are doing so in our CLI already
Add a banner whenever some uses a subcommand of ragna corpus informing them that this part of the CLI is experimental and subject to change.

I'll open an issue with all of the comments I have about the functionality that we don't need to do here. Thanks Nick!

pmeier · 2024-08-27T08:27:53Z

I'll open an issue with all of the comments I have about the functionality that we don't need to do here.

Actually, I think most of it is already documented in #488.

nenb requested a review from pmeier August 14, 2024 21:48

nenb marked this pull request as ready for review August 14, 2024 21:48

nenb changed the base branch from corpus-dev to 487-corpus-name-as-protocol August 14, 2024 21:48

nenb commented Aug 14, 2024

View reviewed changes

ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved

nenb commented Aug 14, 2024

View reviewed changes

ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved

pmeier reviewed Aug 14, 2024

View reviewed changes

Base automatically changed from 487-corpus-name-as-protocol to corpus-dev August 14, 2024 22:58

nenb force-pushed the 488-corpus-cli-poc branch from f20eb6b to ac5c9ea Compare August 16, 2024 18:26

PoC for ingestion with CLI

198f35d

nenb force-pushed the 488-corpus-cli-poc branch from ac5c9ea to 198f35d Compare August 16, 2024 18:30

nenb commented Aug 16, 2024

View reviewed changes

nenb and others added 4 commits August 16, 2024 15:44

Fix

c32e4c1

Tidy

fa8068c

Merge branch 'corpus-dev' into 488-corpus-cli-poc

b29ec83

Add option to pass metadata file

208b69d

nenb commented Aug 26, 2024

View reviewed changes

pmeier changed the title ~~488 corpus cli poc~~ 488 corpus cli Aug 27, 2024

pmeier added 4 commits August 27, 2024 09:10

factor out corpus CLI in separate module

2ab2bb4

switch to annotated

6dfe29b

minor cleanup

573ca91

add experimental note

76d1080

pmeier approved these changes Aug 27, 2024

View reviewed changes

pmeier added 2 commits August 27, 2024 10:18

make banner more visible

5932c36

mypy

6035c2b

pmeier merged commit 6ecb23a into corpus-dev Aug 27, 2024
21 checks passed

pmeier deleted the 488-corpus-cli-poc branch August 27, 2024 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

488 corpus cli #491

488 corpus cli #491

nenb commented Aug 14, 2024 •

edited

Loading

nenb Aug 14, 2024

nenb Aug 14, 2024

nenb Aug 14, 2024

nenb Aug 14, 2024

nenb Aug 14, 2024

nenb Aug 14, 2024 •

edited

Loading

pmeier Aug 14, 2024

nenb Aug 16, 2024

pmeier Aug 14, 2024

nenb Aug 16, 2024

nenb Aug 16, 2024

nenb Aug 16, 2024

nenb Aug 26, 2024

pmeier left a comment

pmeier commented Aug 27, 2024

488 corpus cli #491

488 corpus cli #491

Conversation

nenb commented Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nenb Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

pmeier commented Aug 27, 2024

nenb commented Aug 14, 2024 •

edited

Loading

nenb Aug 14, 2024 •

edited

Loading