scandir and md5 for adlsgen2setup.py #2113

cforce · 2024-11-04T07:33:48Z

see #2065

cforce · 2024-11-04T07:35:23Z

@microsoft-github-policy-service agree

mattgotteiner · 2024-11-05T00:20:42Z

scripts/adlsgen2setup.py

+        for directory, directory_client in directories.items():
+            directory_path = os.path.join(self.data_directory, directory)
+
+             # Überprüfen, ob 'scandir' existiert und auf False gesetzt ist


German comment?

Ja! will translate that ;)

mattgotteiner · 2024-11-05T00:21:58Z

scripts/adlsgen2setup.py

+            for file in files:
+                #Checks to see if the root is '.' and changes it to the correct current
+                #working directory by calling os.getcwd(). Otherwise root_path will just be the root variable value.
+                if root == '.':


Do we only need to set this once? or does it change for every file

yes, good point

mattgotteiner · 2024-11-05T00:22:46Z

scripts/adlsgen2setup.py

+                logger.info(f"Skipping directory {directory} as 'scandir' is set to False")
+                continue
+
+            groups = self.data_access_control_format["directories"][directory].get("groups", [])


groups is not used?

pamelafox · 2024-11-05T05:27:57Z

scripts/adlsgen2setup.py

+
+    async def check_md5(self, path: str, md5_hash: str) -> bool:
+        # if filename ends in .md5 skip
+        if path.endswith(".md5"):


We actually want to move the storage of the MD5 into the Blob storage metadata for our main prepdocs strategy, as the local md5 is problematic when you switch environments. Would that be possible with ADSL2? I assume it also has the ability to store metadata, given its a type of Blob storage?
What do you think of that approach, remote MD5?

I’ve already implemented that and will include it with some additional metadata in the next commit. If i read the
file directly from the source (local storage or blob/data lake storage) and create a new checksum on the fly, keep it in heap . Then, i can compare it with the persistent checksum on the target storage (datalake)and update the blob only if diff and as well the checksum in the meta.

I’ve also started on preodocs.sh to make use of the persist MD5 on the backend once it’s "injected" from the data lake using the datalake strategy which seems to be unstable for exceptions . (exits the whole loop what is even more painful if you don't have a resume) Md5 will not dupe index and tokenizing work - ok, but it still will need to iterate again over all the offset until where the ingestion died abnormally. For external systems i use queries with timestamps to find a proper point to resume but i have no idea if and how current data lake file walker creates any order or if is completely random.
Maybe there is a way to walk/iterate along a query based on a metadata field (i have added) like last change date in metadata? This way i could persist the last ingestion date seen in another blob and use it as resume

I’m unsure where best to store this:
Two options i had in mind

a) In the index, where it would be associated with each chunk, meaning we’d need to locate chunks where there are more than zero results for the fingerprint, updating multiple rows accordingly.

b) In /content (blob storage) for inline rendering of the source document as a citation in the browser. This has the benefit of only requiring one update per change, though we’d still need to locate and remove all matching chunks in the index by primary key for consistency.

Option (a) seems more logical, doesn’t it?

Btw i remember i had trouble with md5 dupes for bigger data amount , so maybe its better to use sha digests?

And even more important: how would above efforts fit into integrated-vectorization?
Will that work with whatever fingerprint being takeover into the index. I assume it requires a new skill, doesn't it?

Blob storage has embedded md5 already as it seems -

from azure.storage.blob import BlobServiceClient # Your connection string connection_string = "your_connection_string" # Your blob container name and blob name container_name = "your_container_name" blob_name = "your_blob_name" # Initialize a BlobServiceClient blob_service_client = BlobServiceClient.from_connection_string(connection_string) # Get a reference to the blob blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name) # Get the properties of the blob blob_properties = blob_client.get_blob_properties() # Retrieve the MD5 hash value from the properties md5_hash = blob_properties.content_settings.content_md5 print(f"MD5 Hash of the blob: {md5_hash}")

If that md5 will have same value as a local computed md5 from file is the question
Also as long as we use as blob name some filename (instead of md5) we still need to rely on that and can't handle rename scenarios. Also we can't easily with O(1) find out if we have any blob up there for the md5 of the file in our hands. If we would use md5 as blob name it would make things in code much slimmer, but debugging and using the azure portal ui file browser for debugging a bit more cumberstone. Also md5 would really need to be unique. her i would only trust sha256 in final instance

Blob service does have its own md5, but it only computes it for small files, so we would need to compute our own hash. Using sha256 also seems fine if we can store that in the blob metadata.

github-actions · 2024-11-06T18:12:58Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`270`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`274`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`75`

samples/data-ingestion/README.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`88`

samples/chat/README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`272`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`276`

github-actions · 2024-11-06T18:16:44Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`270`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`274`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`75`

samples/data-ingestion/README.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`88`

samples/chat/README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`272`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`276`

github-actions · 2024-11-06T18:17:15Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`270`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`274`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`75`

samples/data-ingestion/README.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`88`

samples/chat/README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`272`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`276`

github-actions · 2024-11-06T20:16:56Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`270`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`274`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`75`

samples/data-ingestion/README.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`88`

samples/chat/README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`272`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`276`

github-actions · 2024-11-06T20:17:24Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them. For more details, check our Contributing Guide.

File Full Path Issues

docs/data_ingestion.md

#	Link	Line Number
1	`/app/frontend/src/components/settings/Settings.tsx`	`60`

github-actions · 2024-11-06T20:18:03Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://aka.ms/entgptsearchblog`	`291`
2	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure/ba-p/3956408`	`295`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers`	`78`

github-actions · 2024-11-08T18:19:16Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/revolutionize-your-enterprise-data-with-chatgpt-next-gen-apps-w-azure-openai-and/3762087`	`291`
2	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/access-control-in-generative-ai-applications-with-azure-ai-search/3956408`	`295`

docs/deploy_lowcost.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid-retrieval-and-ranking-ca/3929167`	`43`

docs/customization.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid-retrieval-and-ranking-ca/3929167`	`120`

docs/productionizing.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azurearchitectureblog/azure-openai-landing-zone-reference-architecture/3882102`	`86`

docs/deploy_features.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in-azure-ai-search/3960809`	`209`

docs/data_ingestion.md

#	Link	Line Number
1	`https://techcommunity.microsoft.com/blog/azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in-azure-ai-search/3960809`	`78`

cforce · 2024-11-09T10:56:52Z

@pamelafox @mattgotteiner still wishes for this PR?

cforce added 3 commits November 4, 2024 08:17

add scan option

a608e57

add scandir option an example folder

3206092

update docs

a482213

fix if issue

ce4bffe

mattgotteiner reviewed Nov 5, 2024

View reviewed changes

pamelafox reviewed Nov 5, 2024

View reviewed changes

fix findings, add metadata like md5

285ae7d

cforce added 2 commits November 6, 2024 19:15

Update requirements

bfb95ab

Update requirements

d07c9eb

conditional upload - check blob md5 before upload

4769133

Merge branch 'Azure-Samples:main' into cforce-2065

9a385a4

Merge branch 'main' into cforce-2065

66b5506

cforce mentioned this pull request Nov 10, 2024

checksum support , improve control with new cmd line switches #2134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scandir and md5 for adlsgen2setup.py #2113

scandir and md5 for adlsgen2setup.py #2113

cforce commented Nov 4, 2024

cforce commented Nov 4, 2024

mattgotteiner Nov 5, 2024

cforce Nov 5, 2024

mattgotteiner Nov 5, 2024

cforce Nov 5, 2024

mattgotteiner Nov 5, 2024

pamelafox Nov 5, 2024

cforce Nov 5, 2024 •

edited

Loading

cforce Nov 5, 2024 •

edited

Loading

pamelafox Nov 5, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

github-actions bot commented Nov 8, 2024

cforce commented Nov 9, 2024

scandir and md5 for adlsgen2setup.py #2113

Are you sure you want to change the base?

scandir and md5 for adlsgen2setup.py #2113

Conversation

cforce commented Nov 4, 2024

cforce commented Nov 4, 2024

mattgotteiner Nov 5, 2024

Choose a reason for hiding this comment

cforce Nov 5, 2024

Choose a reason for hiding this comment

mattgotteiner Nov 5, 2024

Choose a reason for hiding this comment

cforce Nov 5, 2024

Choose a reason for hiding this comment

mattgotteiner Nov 5, 2024

Choose a reason for hiding this comment

pamelafox Nov 5, 2024

Choose a reason for hiding this comment

cforce Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

cforce Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

pamelafox Nov 5, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 6, 2024

Check Broken URLs

github-actions bot commented Nov 6, 2024

Check Broken URLs

github-actions bot commented Nov 6, 2024

Check Broken URLs

github-actions bot commented Nov 6, 2024

Check Broken URLs

github-actions bot commented Nov 6, 2024

Check Broken Paths

github-actions bot commented Nov 6, 2024

Check Broken URLs

github-actions bot commented Nov 8, 2024

Check Broken URLs

cforce commented Nov 9, 2024

cforce Nov 5, 2024 •

edited

Loading

cforce Nov 5, 2024 •

edited

Loading