-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize crypto #5715
optimize crypto #5715
Conversation
Hi @geyaning, thank you for your contribution in Avocado. Before we will do review, can you please squash your commits into the one and write down a commit message with description of your changes and your signature. You can follow our contribution guidelines. You can do the changes with force-push and If you need any help, don't hesitate to ask. |
62502fd
to
101ceee
Compare
Signed-off-by: geyaning <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @geyaning for this. It is definitely an improvement. I wrote down some comments for you. But unfortunately I wasn't able to try your changes, I got TypeError: Strings must be encoded before hashing
error. Can you please check that? Thank you.
I created a small reproducer:
#!/usr/bin/env python3
from avocado.utils.crypto import hash_file
def main():
print(hash_file("./examples/tests/passtest.py"))
print(hash_file("./examples/tests/passtest.py", num_threads=2))
if __name__ == "__main__":
main()
output:
$ python /tmp/test.py
c186ebee0d058eda87590de512ed1128
Traceback (most recent call last):
File "/tmp/test.py", line 10, in <module>
main()
File "/tmp/test.py", line 7, in main
print(hash_file("./examples/tests/passtest.py", num_threads=2))
File "/home/janrichter/Avocado/avocado/avocado/utils/crypto.py", line 72, in hash_file
hash_obj.update(future.result())
TypeError: Strings must be encoded before hashing
|
||
def hash_file(filename, size=None, algorithm="md5"): | ||
def hash_file(filename, size=None, algorithm="md5", buffer_size=65536, num_threads=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make default buffer_size to io.DEFAULT_BUFFER_SIZE, so the default behavior won't be changed.
|
||
LOG = logging.getLogger(__name__) | ||
|
||
def calculate_hash(data, algorithm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a docstring with description of this method and its parameters.
break | ||
hash_obj.update(data) | ||
size -= len(data) | ||
if num_threads > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but you should be able to use ThreadPoolExecutor
even then the num_threads=1
. Therefore, the if
is not necessary and you should be able to remove the code repetition.
:return: Hash of the file, if something goes wrong, return None. | ||
""" | ||
chunksize = io.DEFAULT_BUFFER_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split the work into two commits, one for buffer size change and the second for concurrent computing, and write down the description of those changes into commit messages. You can use what you wrote in the PR description. With such changes will be much easier to search in commit history and understand what the changes do.
@geyaning also, can you please fix the style, so the static checks would be happy. You can run the checks locally by |
optimize crypto.py
buffer_size: Increasing the buffer size can reduce the number of I/O operations, thus improving the read speed
calculate_hash: This is a helper function that computes the hash value of a given block of data. It takes a block of data and a hash algorithm as input and returns a hash value. In parallel processing, this function is called in different threads to compute the hash value concurrently.