Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize crypto #5715

Closed
wants to merge 1 commit into from
Closed

optimize crypto #5715

wants to merge 1 commit into from

Conversation

geyaning
Copy link
Contributor

@geyaning geyaning commented Jul 10, 2023

optimize crypto.py
buffer_size: Increasing the buffer size can reduce the number of I/O operations, thus improving the read speed

calculate_hash: This is a helper function that computes the hash value of a given block of data. It takes a block of data and a hash algorithm as input and returns a hash value. In parallel processing, this function is called in different threads to compute the hash value concurrently.

@richtja
Copy link
Contributor

richtja commented Jul 10, 2023

Hi @geyaning, thank you for your contribution in Avocado. Before we will do review, can you please squash your commits into the one and write down a commit message with description of your changes and your signature. You can follow our contribution guidelines. You can do the changes with force-push and If you need any help, don't hesitate to ask.

@geyaning geyaning force-pushed the master branch 3 times, most recently from 62502fd to 101ceee Compare July 12, 2023 06:16
Copy link
Contributor

@richtja richtja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geyaning for this. It is definitely an improvement. I wrote down some comments for you. But unfortunately I wasn't able to try your changes, I got TypeError: Strings must be encoded before hashing error. Can you please check that? Thank you.

I created a small reproducer:

#!/usr/bin/env python3

from avocado.utils.crypto import hash_file

def main():
    print(hash_file("./examples/tests/passtest.py"))
    print(hash_file("./examples/tests/passtest.py", num_threads=2))

if __name__ == "__main__":
    main()

output:

$ python /tmp/test.py
c186ebee0d058eda87590de512ed1128
Traceback (most recent call last):
  File "/tmp/test.py", line 10, in <module>
    main()
  File "/tmp/test.py", line 7, in main
    print(hash_file("./examples/tests/passtest.py", num_threads=2))
  File "/home/janrichter/Avocado/avocado/avocado/utils/crypto.py", line 72, in hash_file
    hash_obj.update(future.result())
TypeError: Strings must be encoded before hashing


def hash_file(filename, size=None, algorithm="md5"):
def hash_file(filename, size=None, algorithm="md5", buffer_size=65536, num_threads=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make default buffer_size to io.DEFAULT_BUFFER_SIZE, so the default behavior won't be changed.


LOG = logging.getLogger(__name__)

def calculate_hash(data, algorithm):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a docstring with description of this method and its parameters.

break
hash_obj.update(data)
size -= len(data)
if num_threads > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but you should be able to use ThreadPoolExecutor even then the num_threads=1. Therefore, the if is not necessary and you should be able to remove the code repetition.

:return: Hash of the file, if something goes wrong, return None.
"""
chunksize = io.DEFAULT_BUFFER_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split the work into two commits, one for buffer size change and the second for concurrent computing, and write down the description of those changes into commit messages. You can use what you wrote in the PR description. With such changes will be much easier to search in commit history and understand what the changes do.

@richtja
Copy link
Contributor

richtja commented Jul 12, 2023

@geyaning also, can you please fix the style, so the static checks would be happy. You can run the checks locally by python setup.py test --select=static-checks Thank you.

@richtja richtja added this to the #104 - Codename TBD milestone Jul 12, 2023
@geyaning geyaning closed this Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants