-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide file hashes in the URLs to avoid unnecessary file downloads (bandwidth saver) #1433
Changes from 3 commits
e7e9bce
713cfea
b3799a8
089e93d
a604491
984f7dd
9966096
8312f09
b65a945
1d11042
8895930
5d132b9
744503f
f7e6d7f
004fa70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
from packaging.version import parse | ||
|
||
import boto3 | ||
from botocore.exceptions import NoCredentialsError | ||
|
||
|
||
S3 = boto3.resource('s3') | ||
|
@@ -212,6 +213,23 @@ def normalize_package_version(self: S3IndexType, obj: str) -> str: | |
def obj_to_package_name(self, obj: str) -> str: | ||
return path.basename(obj).split('-', 1)[0] | ||
|
||
def fetch_checksum_from_s3(self, s3_key): | ||
s3_key = s3_key.replace("%2B", "+") | ||
try: | ||
response = CLIENT.get_object_attributes( | ||
Bucket=BUCKET, | ||
Key=s3_key, | ||
ObjectAttributes=['Checksum'] | ||
) | ||
checksum = response['Checksum']['ChecksumSHA256'] | ||
return checksum | ||
except NoCredentialsError: | ||
print("No AWS credentials found") | ||
return None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just let this propogate? Otherwise it'll just repeatedly print the error and generate a sub-optimal index. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking was that the index page should still generate as it does now even if there is no hash available to provide. I should definitely remove the print statements. I am not sure what exceptions would be raised before the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Did this mean to get commented below? on https://github.com/pytorch/builder/pull/1433/files/b3799a8eb42b56118a4bc0658902875734888721#r1261472864? If not I'm not following the thread here. If the credentials aren't found we really can't know if there's a checksum or not) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean its the same code block in my mind -- I am just saying I don't know what exceptions we want to catch and handle for the case of the checksum isn't available yet -- maybe we don't care about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's safe to assume:
Because otherwise if these assumptions aren't held, there's a chance the script generates an So I think the only thing to handle here is "this object has no checksum attribute". Which means the code looks roughly like: response = CLIENT.get_object_attributes(
Bucket=BUCKET,
Key=s3_key,
ObjectAttributes=["Checksum"]
)
checksum = response.get("Checksum", {}).get("ChecksumSHA256", None) without any FWIW I tested your code against a work bucket and got So probably use Here's some local testing... >>> import boto3
>>> BUCKET=S3.Bucket("<bucketname>")
>>> CLIENT = boto3.client("s3")
>>> response = CLIENT.get_object_attributes(Bucket=BUCKET.name,Key="<object-without-checksum>",ObjectAttributes=['Checksum'])
>>> list(response)
['ResponseMetadata', 'LastModified', 'VersionId']
>>> response.get("Checksum", {}).get("ChecksumSHA256", None)
>>> response = CLIENT.get_object_attributes(Bucket=BUCKET.name,Key="<object-WITH-checksum>",ObjectAttributes=['Checksum'])
>>> list(response)
['ResponseMetadata', 'LastModified', 'Checksum']
>>> response.get("Checksum", {}).get("ChecksumSHA256", None)
'7Uc3AtM1Dlc4XiQ8A6NkQdV9JMU4liPtz7wOd+RRCTE=' There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Resolved. |
||
except Exception as e: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a wide exception. I think the Otherwise, a simple bug could manifest as an index not having hashes. When the alternative is an error with a stacktrace given to the user, and no index generated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Resolved. |
||
print(f"Unable to retrieve checksum due to {e}") | ||
return None | ||
|
||
def to_legacy_html( | ||
self, | ||
subdir: Optional[str]=None | ||
|
@@ -255,7 +273,11 @@ def to_simple_package_html( | |
out.append(' <body>') | ||
out.append(' <h1>Links for {}</h1>'.format(package_name.lower().replace("_","-"))) | ||
for obj in sorted(self.gen_file_list(subdir, package_name)): | ||
out.append(f' <a href="/{obj}">{path.basename(obj).replace("%2B","+")}</a><br/>') | ||
checksum = self.fetch_checksum_from_s3(obj) | ||
if checksum: | ||
out.append(f' <a href="/{obj}#sha256={checksum}">{path.basename(obj).replace("%2B","+")}</a><br/>') | ||
else: | ||
out.append(f' <a href="/{obj}">{path.basename(obj).replace("%2B","+")}</a><br/>') | ||
matteius marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Adding html footer | ||
out.append(' </body>') | ||
out.append('</html>') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to leverage the call in
from_S3
to also include this information?If not, perhaps this should be done there so the returned objects already have the checksum attached and the downstream code can choose to use it or not?
That would isolate the S3 retrieval code to be in one logical place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell they are different clients/endpoints that serve different purposes.
From my perspective, the
from_S3
appears to just be tracking a list of strings which is the object s3_key withList[str]
so I am not sure it makes sense to tackle a larger refactor just to add all the additional calls to pre-lookup these checksums and store them in memory until their point of use.Isn't it already isolated to this one file manage.py where the clients are in scope for the whole file?
Maybe I am being dense and don't see the obvious way to accomplish this without spending a couple more hours on it, plus the more complexity I add, well I have no real way of testing it. Looking for any guidance on the right exceptions to catch or not catch above, but I think we need to catch something to cover the case that s3 isn't ready or able to provide the hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so looks like you should be able to do this in
from_s3
(which is ideal, because we'll want the checksums for metadata files in #1457 as well:And then for an object without a checksum:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build is failing with:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1548
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging @malfet again, as these inner comments are notoriously hard to discover. Hope it's no big deal 😇