-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
488 corpus cli #491
488 corpus cli #491
Conversation
ragna/deploy/_cli/core.py
Outdated
typer.Option( | ||
"--document-filesystem-handler", | ||
metavar="DOCUMENT_FILESYSTEM_HANDLER", | ||
default_factory=lambda: "ragna.core.LocalDocument", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #488: default to ingest files from local disk unless user supplies a different document filesystem handler.
ragna/deploy/_cli/core.py
Outdated
help="Name of the corpus to ingest the documents into.", | ||
), | ||
], | ||
documents: list[Path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #488: Allow the user to pass an arbitrary amount of files.
ragna/deploy/_cli/core.py
Outdated
), | ||
], | ||
documents: list[Path], | ||
verbose: Annotated[bool, typer.Option("--verbose")] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used at the moment to capture files that were not ingested, but can be extended in various ways.
ragna/deploy/_cli/core.py
Outdated
) | ||
|
||
try: | ||
getattr(document_filesystem_handler_class, "from_path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #488: Enforce that every Document
class has a from_path
classmethod.
ragna/deploy/_cli/core.py
Outdated
with tqdm(total=len(documents) // BATCH_SIZE) as pbar: | ||
for batch_number in range(0, len(documents), BATCH_SIZE): | ||
document_collection = [] | ||
document_metadata_collection = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messy, and can probably be tidied a bit.
ragna/deploy/_cli/core.py
Outdated
if not document_collection: | ||
continue | ||
|
||
# TODO: add a check to see if the documents already exist in the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging a TODO
Update: Relevant if the user aborts the ingestion and returns later, or if there is a fatal error at some point, and the user needs to restart the ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True
. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.
Or am I being paranoid here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'v added a short-term checkpointing solution that stores all paths in a local file.
ragna/deploy/_cli/core.py
Outdated
if not document_collection: | ||
continue | ||
|
||
# TODO: add a check to see if the documents already exist in the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True
. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.
Or am I being paranoid here?
f20eb6b
to
ac5c9ea
Compare
ac5c9ea
to
198f35d
Compare
ragna/deploy/_cli/core.py
Outdated
corpus_name: Optional[str] = typer.Option( | ||
None, help="Name of the corpus to ingest the documents into." | ||
), | ||
config: ConfigOption = "./ragna.toml", # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have now added config
option as you suggested in #488
ragna/deploy/_cli/core.py
Outdated
verbose: bool = typer.Option( | ||
False, help="Print the documents that could not be ingested." | ||
), | ||
ignore_log: bool = typer.Option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short-term solution for checkpointing on files which have already been ingested. This option ignores the checkpoint logic.
Longer-term, I think this needs to be incorporated into the ragna
database. Otherwise, it's very easy to add the same file to the vector database multiple times, which reduces performance (instead of returning 10 sources, you just get 10 copies of the same source), unless we have some logic to deal with this already.
ragna/deploy/_cli/core.py
Outdated
with make_session() as session: | ||
user_id = database._get_user_id(session, user) | ||
|
||
# Log (JSONL) for recording which files previously added to vector database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint logic, stored in a JSONL file on filesystem where CLI is run
ragna/deploy/_cli/core.py
Outdated
f"Could not connect to the database: {config.api.database_url}" | ||
) | ||
|
||
if metadata_fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should contain a mapping from individual filepaths to a dictionary of metadata fields for each filepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love it, but I don't want to spend more time on this initial feature. I added a few small changes:
- Factor out the corpus commands in a separate module
- Switch the input arguments to use
typing.Annotated
as is recommended and we are doing so in our CLI already - Add a banner whenever some uses a subcommand of
ragna corpus
informing them that this part of the CLI is experimental and subject to change.
I'll open an issue with all of the comments I have about the functionality that we don't need to do here. Thanks Nick!
Actually, I think most of it is already documented in #488. |
PoC that addresses #488.
Based on a gist produced by @pmeier, as well as comments from #488.
An example usage from the current branch is: