Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksum support , improve control with new cmd line switches #2134

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

cforce
Copy link
Contributor

@cforce cforce commented Nov 10, 2024

This PR leverages the MD5 checksum available through local, blob, or Data Lake storage metadata fields, based on the discussion in #2113.

By default, the PR introduces a check during ingestion (for both local and Azure Data Lake Storage sources) to see if each file already exists in the target storage. The MD5 checksum is saved in the index as a verification method, enabling a lookup that ensures at least one record with the same category/filename is found. This also permits the same file to be copied to multiple categories if needed.

Additionally, a new command-line parameter, --folder <name>, allows specifying a dedicated subfolder within the data folder or Data Lake container for each script run. This enables isolated runs with explicitly controlled parameters and a specific subset of files.

The new --enablevision true/false option allows per-folder and per-category control over whether vision embeddings should be generated. This option is beneficial when dealing with folders containing predominantly heavy image content, providing control over resource use.

Example usage:

./.venv/bin/python ./app/backend/prepdocs.py './data/createindex_dummy/*' --category newtest --folder test --enablevision false --verbose

The PR also improves progress reporting by displaying the number of files processed and providing updates every 10 files. This enhances visibility into the processing speed, which is particularly useful when working with partially processed data. Additionally, the log level of some messages has been adjusted from INFO to DEBUG to reduce console noise and retain progress updates longer.

Finally, MD5 checksum comparisons are introduced at various stages: a top-level check (to avoid redundant processing), and checks for blobs, text, and image embeddings. A flag to bypass the top-level check for known MD5 values in the index may be introduced later (e.g., --ignore_checksum).


@cforce cforce changed the title checksum support , improve control with new cm line swicthes checksum support , improve control with new cmd line swicthes Nov 10, 2024
@cforce cforce changed the title checksum support , improve control with new cmd line swicthes checksum support , improve control with new cmd line switches Nov 10, 2024
file_md5 = file.metadata.get('md5')
blob_md5 = blob_metadata.get('md5')

# Remove md5 from file metadata if it matches the blob metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to describe the next line?

return await container_client.upload_blob(blob_name, reopened_file, overwrite=True, metadata=file.metadata)

async def _file_blob_update_needed(self, blob_client: BlobClient, file : File) -> bool:
md5_check : int = 0 # 0= not done, 1, positive,. 2 negative
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you end up not using this variable?

@@ -45,29 +45,63 @@ def __init__(
self.subscriptionId = subscriptionId
self.user_delegation_key: Optional[UserDelegationKey] = None

#async def upload_blob(self, file: File, container_client:ContainerClient) -> Optional[List[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be rm'ed now?

# Re-open and upload the original file

# Re-open and upload the original file
md5_check : int = 0 # 0= not done, 1, positive,. 2 negative
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Python Enum would be clearer here?

# Re-open and upload the original file
md5_check : int = 0 # 0= not done, 1, positive,. 2 negative

# upload the file local storage zu azure storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# upload the file local storage zu azure storage
# upload the file from local storage to Azure storage

md5_check = 1
file.url = blob_client.url

if md5_check!=1 and self.store_page_images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please install the pre-commit (see CONTRIBUTING guide) as it'll fix Python formatting issues for you, like the whitespace around operators. Thanks!

Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I've started a review but am OOO today, will continue review later.

logger.info("Converting page %s to image and uploading -> %s", i, blob_name)

blob_client = container_client.get_blob_client(blob_name)
if await blob_client.exists():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this additional check here, given the check that happens in the function that calls this code?

if not self.user_delegation_key:
self.user_delegation_key = await service_client.get_user_delegation_key(start_time, expiry_time)

if blob_client.account_name is not None:
if container_client.account_name is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am curious why the need to go from blob_client to container_client here? Does it matter?

@@ -22,11 +22,11 @@ async def parse_file(
if processor is None:
logger.info("Skipping '%s', no parser found.", file.filename())
return []
logger.info("Ingesting '%s'", file.filename())
logger.debug("Ingesting '%s'", file.filename())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you changed the levels to debug, does the current output feel too verbose? I find it helpful.

elif self.document_action == DocumentAction.RemoveAll:
await self.blob_manager.remove_blob()
await search_manager.remove_content()

async def process_file(self, file, search_manager):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions seem to be unused currently- perhaps your thought was to refactor the code above to call these two functions. I'll remove them for now to make the diff smaller.

"""

# Create a BlobServiceClient using account URL and credentials
service_client = BlobServiceClient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use BlobServiceClient to interact with a DataLake Storage account? That's what you seem to be doing here, but I didn't realize that was possible.

blob_metadata = blob_properties.metadata

# Check if the md5 values are the same
file_md5 = file.metadata.get("md5")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see logic in the main prepdocs for setting the md5 of the local file metadata, I only see that in ADLS2. How does this work when not using the ADLS2 strategy?

),
SearchableField(
name="content",
fields = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How necessary is it to store md5 as a field in the search index? I was thinking we'd also need to store it in the blob/file metadata? @mattgotteiner What do you think of storing md5 in search index, have you seen this approach before?

@pamelafox
Copy link
Collaborator

I've gone in and made the changes I recommended, but I have some higher-level questions about the approach, which I've left as comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants