-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORM: Add get_size_on_disk
method to RemoteData
#6584
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6584 +/- ##
==========================================
+ Coverage 77.92% 77.93% +0.02%
==========================================
Files 563 563
Lines 41668 41755 +87
==========================================
+ Hits 32464 32537 +73
- Misses 9204 9218 +14 ☔ View full report in Codecov by Sentry. |
get_total_size_on_disk
method to RemoteData
.get_total_size_on_disk
method to RemoteData
This is maybe machine-dependent, but rather than going via our API (that is more robust, but definitely going to be slower, I think) have a first "fast" option just running |
Note to self to run |
e801a78
to
e0be575
Compare
get_total_size_on_disk
method to RemoteData
get_size_on_disk
method to RemoteData
e0be575
to
b1a0560
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeigerJ2 please go through more pains that I imposed 😈
|
||
return format_directory_size(size_in_bytes=total_size) | ||
|
||
def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole function, doesn't really need to be separated I feel.. It's basically only executes a command, and you don't re-use it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the design to have just one top-level method get_size_on_disk
nice, which tries to call the two private methods, in terms of separation of concerns.
Though, it is true that this pollutes the API of RemoteData
somewhat... So I'm also fine of either moving the private methods to some utils
module, or merging them. Maybe @unkcpz can comment on good coding practices here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which tries to call the two private methods, in terms of separation of concerns.
I'm not sure if I understand, sorry 😬
Since it's not really re-usable by other method, I'd vote for merging it, and avoid over-fictionalizing
tests/orm/nodes/data/test_remote.py
Outdated
@pytest.fixture | ||
def remote_data_ssh(tmp_path, aiida_computer_ssh): | ||
"""Return a non-empty ``RemoteData`` instance.""" | ||
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙉 aiida_computer_ssh
should be compared with aiida_computer_local
, which in that sense they are similar.. the issue is we don't have "aiida_ssh
" that would return the actual computer instance, lol 🙉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. We could add a fixture that actually calls the factory and returns the Computer
instance, though I'd rather call it aiida_localhost_ssh
, and don't see the need for it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yews, so maybe please fix this comment, which I found confusing:
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory | |
# `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now use the aiida_computer_local
fixture instead of aiida_localhost
, and actually create the Computer
from the factory myself (basically the same single line, which is also the only content of the aiida_localhost
fixture), such that the behavior between the two remote_data_[local,ssh]
fixtures is exactly the same, and no comment in the code is needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why my replies via the files
GH tab appear as a review, but whatever... ^^
Thanks for the review @khsrali. I implemented most of your proposed changes!
As to our in-person discussion if du
or lstat
is preferred, I think none of the two is ideal... lstat
giving the actual byte-sized content, which is neat and all, but won't correspond to the disk space that will actually be occupied locally by a file, due to the use of blocks for the file system. And du
giving the actual occupied disk space on the remote, which, however, might be different from the local file system (due to having a different file system on the local machine, different formatting, different block size, etc.). Hence, the big difference in the file size check in the test. For real-world use cases, with more and larger files, the difference is likely much smaller, and won't matter too much, I think. I'll add a test for a larger file, as well as modify the message that the given size is just an estimate, so that user are aware they should take the value with a grain of salt.
Maybe also @agoscinski with his actual computer science background can weigh in 🫶
|
||
return format_directory_size(size_in_bytes=total_size) | ||
|
||
def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the design to have just one top-level method get_size_on_disk
nice, which tries to call the two private methods, in terms of separation of concerns.
Though, it is true that this pollutes the API of RemoteData
somewhat... So I'm also fine of either moving the private methods to some utils
module, or merging them. Maybe @unkcpz can comment on good coding practices here?
f2f90f9
to
4fcb1ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @GeigerJ2! Just a few minor comments..
tests/orm/nodes/data/test_remote.py
Outdated
@pytest.fixture | ||
def remote_data_ssh(tmp_path, aiida_computer_ssh): | ||
"""Return a non-empty ``RemoteData`` instance.""" | ||
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yews, so maybe please fix this comment, which I found confusing:
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory | |
# `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory |
return format_directory_size(size_in_bytes=total_size) | ||
|
||
def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int: | ||
"""Connects to the remote folder and returns the total size of all files in the directory recursively in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Connects to the remote folder and returns the total size of all files in the directory recursively in bytes | |
"""Connects to the remote folder and returns the total size of all files in the directory in bytes |
:param transport: Open transport instance | ||
:type transport: Transport | ||
:raises RuntimeError: When `du` command cannot be successfully executed | ||
:return: Total size of directory recursively in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I found the term recursively confusing.. (thought the function does yield
)
:return: Total size of directory recursively in bytes. | |
:return: Total size of directory in bytes (including all it's contents) |
:param transport: Open transport instance. | ||
:type transport: Transport | ||
:raises RuntimeError: When `du` command cannot be successfully executed. | ||
:return: Total size of directory recursively in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same:
:return: Total size of directory recursively in bytes. | |
:return: Total size of directory in bytes (including all it's contents) |
|
||
return format_directory_size(size_in_bytes=total_size) | ||
|
||
def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which tries to call the two private methods, in terms of separation of concerns.
I'm not sure if I understand, sorry 😬
Since it's not really re-usable by other method, I'd vote for merging it, and avoid over-fictionalizing
f7dfe7e
to
ecfa53b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes seems like we should split the function into two or provide optional arguments. These two concepts seem to be distinguished by the terms "disk usage" and "apparent size" (see https://stackoverflow.com/a/569485). I think the output of du --apparent-size
is the same as with lstat
, so you do not need to use lstat
""" | ||
try: | ||
total_size = 0 | ||
contents = self.listdir_withattributes(full_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work recursively? what if there are nested subfolders does this still work correctly?
""" | ||
|
||
try: | ||
retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work on macos. It seems like macos has no easy way do get the size in bytes (one can get the number of blocks but I don't know how to get the block size). One can get the size in kilobytes du -k
. Anyway we do not seem to run CI on macos at the moment so I don't think it is important for this PR
Thanks for the review, @agoscinski! I'm currently still working on this, will ping you once it's again ready for review.
The reason I'm providing |
In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.
In addition, extend the tests for `RemoteData` in `test_remote.py` for the methods added in this PR, as well as parametrize them to run on a `RemoteData` via local and ssh transport.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
… nested directory
for more information, see https://pre-commit.ci
58b3248
to
8bcf0f8
Compare
OK, this should be ready for a final review, @agoscinski and @khsrali. Also pinging, @mikibonacci, if you want to provide some feedback on the CLI/API for use in AiiDAlab? |
get_size_on_disk
method to RemoteData
get_size_on_disk
method to RemoteData
Required for PR #6578.
By default, the
get_size_on_disk
method calls the methodget_size_on_disk_du
that usesdu
to obtain the total directory size in bytes. The output will eventually be formatted in a human-readable way. If the call todu
fails for whatever reason, recursivelstat
is being used, though, that is discouraged due to (copied from_get_size_on_disk_lstat
docstring):"Note that even if a file is only 1 byte, on disk, it still occupies one full disk block size. As such, getting accurate measures of the total expected size on disk when retrieving a
RemoteData
is not straightforward withlstat
, as one would need to consider the occupied block sizes for each file, as well as repository metadata. Thus, this function only serves as a fallback in the absence of thedu
command."I further extended the existing tests for
RemoteData
to use both,LocalTransport
, as well asSshTransport
, and test for the functionality added in this PR.Pinging also @npaulish, as she's currently working on retrieving
RemoteData
objects.