-
Notifications
You must be signed in to change notification settings - Fork 26
Fix error when /tmp
is a symlink
#221
base: main
Are you sure you want to change the base?
Conversation
On macOS, `/tmp` is a symlink to `/private/tmp`. When creating the tarball, the symlink will be resolved, making the file paths on server are different from those on client. This results in a `FileNotFoundError`. This patch fixes the issue by resolving the local path to the real one if it is a symlink.
greenplumpython/experimental/file.py
Outdated
@@ -121,7 +121,7 @@ def _install_on_server(pkg_dir: str, requirements: str) -> str: | |||
def _install_packages(db: gp.Database, requirements: str): | |||
tmp_archive_name = f"tar_{uuid.uuid4().hex}" | |||
# FIXME: Windows client is not supported yet. | |||
local_dir = pathlib.Path("/") / "tmp" / tmp_archive_name / "pip" | |||
local_dir = (pathlib.Path("/") / "tmp").resolve() / tmp_archive_name / "pip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hard code /tmp
path, libs like tempfile
is preferred to be used. https://docs.python.org/3/library/tempfile.html
/tmp
is a convention, and users can actually set the temp directory to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using tempfile
per your suggestion.
While I am OK with the switch, I don't quite understand why tempfile
is preferred:
- If, as you said, users can actually set the temp directory to something else, then
/tmp
is not a hard-coded path, but can be set to point to anywhere by user. - Based on my testing ,
tempfile
also creates files and directories under/tmp
, the same as the previous scheme. tempfile
will delete the files upon exiting the context. Would this make debugging harder?
Could you please share your thoughts on these issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, as you said, users can actually set the temp directory to something else, then /tmp is not a hard-coded path, but can be set to point to anywhere by user.
Yes, see https://www.baeldung.com/linux/change-tmp-directory-path
Based on my testing , tempfile also creates files and directories under /tmp, the same as the previous scheme.
Not quite sure about the question.
tempfile will delete the files upon exiting the context. Would this make debugging harder?
There is argument for this
class tempfile.TemporaryDirectory(suffix=None, prefix=None, dir=None, ignore_cleanup_errors=False, *, delete=True)
The delete parameter can be used to disable cleanup of the directory tree upon exiting the context. While it may seem unusual for a context manager to disable the action taken when exiting the context, it can be useful during debugging or when you need your cleanup behavior to be conditional based on other logic.
[¶](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, are you OK with using directory under /tmp
for storing the files here?
Why tempfile
is preferred over the previous scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, are you OK with using directory under /tmp for storing the files here?
I prefer to have a subdir to store temporary files, for example, store tmp files in /tmp/sOmEUuid/
actually, won't this be the same with previouse code? previously, tmp_archive_name
is a uuid and it is the subdir name in /tmp
. tempfile.TemporaryDirectory()
does the same thing.
Why tempfile is preferred over the previous scheme?
- it handles all platforms temp directory
- it handles customized
TMPDIR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The previous code stores temp files in file named /tmp/tar_<uuid>
. I think it's similar to what tempfile does without setting something like TMPDIR
.
BTW, I think all files handled in this module are temp files and can be deleted at the end of the function:
- For
_from_files()
, the data will eventually be parsed from files and written to tables in databases; - For
_install_packages()
, the packages will eventually be installed to a Python environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the doc https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory:
Changed in version 3.12: Added the delete parameter.
We will need to use GD
to "pin" the temp directory to make it available to multiple UDFs.
greenplumpython/experimental/file.py
Outdated
) | ||
assert len(list(extracted)) == 1 | ||
server_dir = ( | ||
pathlib.Path("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to be changed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed per your suggestion.
However, while CI is good so far, I have some concerns:
- The code gets much more complicated. Due to lacking of the
delete
parameter intempfile.TemporaryDirectory()
, we need to pin the directory object to ensure that it will not be deconstructed and then pass the necessary info around to quite a few UDFs. - Debugging is harder. Before the change, randomness only exists on client side. That means, after the SQL statement is built, running it multiple times anywhere will give the same result. This is especially good for debugging. But with
tempfile.TemporaryDirectory()
, the directory path will change for every run and it is hard to tell which directory is created in which run. - Because the returned DataFrame is lazily evaluated, we cannot remove the temp directory at the end of
from_files()
, this can cause flaky issue if the directory is removed by Python's GC before the DataFrame is evaluated. - Depending on a env variable means that user need to restart the database server for the changes to take effect. This is not easy if the server continues ingesting data. Would it be better to customize with a GUC? This does not requires creating an extension on server. For example, on Greenplum we can set an undefined GUC and read it like this:
[gpadmin@localhost GreenplumPython]$ gpconfig -c abc.defg -v hello --skipvalidation
20231023:21:20:17:3893841 gpconfig:localhost:gpadmin-[INFO]:-completed successfully with parameters '-c abc.defg -v hello --skipvalidation'
[gpadmin@localhost GreenplumPython]$ gpstop -afr
[gpadmin@localhost GreenplumPython]$ psql
psql (12.12)
Type "help" for help.
gpadmin=# show abc.defg;
abc.defg
----------
hello
(1 row)
Is this ready for the next round of review? I saw CI failed. |
Quote from #221 (comment)
Not yet, still fixing. The switch on server is not easy. |
On macOS,
/tmp
is a symlink to/private/tmp
. When creating the tarball, the symlink will be resolved, making the file paths on server are different from those on client. This results in aFileNotFoundError
.This patch fixes the issue by resolving the local path to the real one if it is a symlink.