Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

488 corpus cli #491

Merged
merged 11 commits into from
Aug 27, 2024
Merged

488 corpus cli #491

merged 11 commits into from
Aug 27, 2024

Conversation

nenb
Copy link
Contributor

@nenb nenb commented Aug 14, 2024

PoC that addresses #488.

Based on a gist produced by @pmeier, as well as comments from #488.

An example usage from the current branch is:

ragna corpus ingest $(find MY_DOCUMENT_DIRECTORY -type f) --metadata-fields metadata.json --corpus-name MY_CORPUS_NAME --user UNAME --report-failures 2>> failures.txt

cat metadata.json
# {"MY_DOCUMENT_DIRECTORY/MY_FILE1": {"test_field": true}, "MY_DOCUMENT_DIRECTORY/MY_FILE2": {"test_field": false}}

@nenb nenb requested a review from pmeier August 14, 2024 21:48
@nenb nenb marked this pull request as ready for review August 14, 2024 21:48
@nenb nenb changed the base branch from corpus-dev to 487-corpus-name-as-protocol August 14, 2024 21:48
typer.Option(
"--document-filesystem-handler",
metavar="DOCUMENT_FILESYSTEM_HANDLER",
default_factory=lambda: "ragna.core.LocalDocument",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #488: default to ingest files from local disk unless user supplies a different document filesystem handler.

help="Name of the corpus to ingest the documents into.",
),
],
documents: list[Path],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #488: Allow the user to pass an arbitrary amount of files.

),
],
documents: list[Path],
verbose: Annotated[bool, typer.Option("--verbose")] = False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used at the moment to capture files that were not ingested, but can be extended in various ways.

)

try:
getattr(document_filesystem_handler_class, "from_path")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #488: Enforce that every Document class has a from_path classmethod.

with tqdm(total=len(documents) // BATCH_SIZE) as pbar:
for batch_number in range(0, len(documents), BATCH_SIZE):
document_collection = []
document_metadata_collection = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messy, and can probably be tidied a bit.

if not document_collection:
continue

# TODO: add a check to see if the documents already exist in the database
Copy link
Contributor Author

@nenb nenb Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging a TODO

Update: Relevant if the user aborts the ingestion and returns later, or if there is a fatal error at some point, and the user needs to restart the ingestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.

Or am I being paranoid here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'v added a short-term checkpointing solution that stores all paths in a local file.

ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
if not document_collection:
continue

# TODO: add a check to see if the documents already exist in the database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Question is how do we check if a document is already there? By default, we generate a random ID and with that we'll never have a duplication. We could go for the path, but I'd put that in a feature flag, i.e. unique_paths=True. Meaning, by default we'd get duplicate detection by path, but the user has an option to disable it.

Or am I being paranoid here?

ragna/deploy/_cli/core.py Outdated Show resolved Hide resolved
Base automatically changed from 487-corpus-name-as-protocol to corpus-dev August 14, 2024 22:58
corpus_name: Optional[str] = typer.Option(
None, help="Name of the corpus to ingest the documents into."
),
config: ConfigOption = "./ragna.toml", # type: ignore[assignment]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have now added config option as you suggested in #488

verbose: bool = typer.Option(
False, help="Print the documents that could not be ingested."
),
ignore_log: bool = typer.Option(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short-term solution for checkpointing on files which have already been ingested. This option ignores the checkpoint logic.

Longer-term, I think this needs to be incorporated into the ragna database. Otherwise, it's very easy to add the same file to the vector database multiple times, which reduces performance (instead of returning 10 sources, you just get 10 copies of the same source), unless we have some logic to deal with this already.

with make_session() as session:
user_id = database._get_user_id(session, user)

# Log (JSONL) for recording which files previously added to vector database.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint logic, stored in a JSONL file on filesystem where CLI is run

f"Could not connect to the database: {config.api.database_url}"
)

if metadata_fields:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should contain a mapping from individual filepaths to a dictionary of metadata fields for each filepath.

@pmeier pmeier changed the title 488 corpus cli poc 488 corpus cli Aug 27, 2024
Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love it, but I don't want to spend more time on this initial feature. I added a few small changes:

  • Factor out the corpus commands in a separate module
  • Switch the input arguments to use typing.Annotated as is recommended and we are doing so in our CLI already
  • Add a banner whenever some uses a subcommand of ragna corpus informing them that this part of the CLI is experimental and subject to change.

I'll open an issue with all of the comments I have about the functionality that we don't need to do here. Thanks Nick!

@pmeier
Copy link
Member

pmeier commented Aug 27, 2024

I'll open an issue with all of the comments I have about the functionality that we don't need to do here.

Actually, I think most of it is already documented in #488.

@pmeier pmeier merged commit 6ecb23a into corpus-dev Aug 27, 2024
21 checks passed
@pmeier pmeier deleted the 488-corpus-cli-poc branch August 27, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants