Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Fix error when /tmp is a symlink #221

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Fix error when /tmp is a symlink #221

wants to merge 9 commits into from

Conversation

xuebinsu
Copy link
Contributor

On macOS, /tmp is a symlink to /private/tmp. When creating the tarball, the symlink will be resolved, making the file paths on server are different from those on client. This results in a FileNotFoundError.

This patch fixes the issue by resolving the local path to the real one if it is a symlink.

On macOS, `/tmp` is a symlink to `/private/tmp`. When creating the
tarball, the symlink will be resolved, making the file paths on server
are different from those on client. This results in a
`FileNotFoundError`.

This patch fixes the issue by resolving the local path to the real one
if it is a symlink.
@@ -121,7 +121,7 @@ def _install_on_server(pkg_dir: str, requirements: str) -> str:
def _install_packages(db: gp.Database, requirements: str):
tmp_archive_name = f"tar_{uuid.uuid4().hex}"
# FIXME: Windows client is not supported yet.
local_dir = pathlib.Path("/") / "tmp" / tmp_archive_name / "pip"
local_dir = (pathlib.Path("/") / "tmp").resolve() / tmp_archive_name / "pip"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard code /tmp path, libs like tempfile is preferred to be used. https://docs.python.org/3/library/tempfile.html

/tmp is a convention, and users can actually set the temp directory to something else.

Copy link
Contributor Author

@xuebinsu xuebinsu Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using tempfile per your suggestion.

While I am OK with the switch, I don't quite understand why tempfile is preferred:

  1. If, as you said, users can actually set the temp directory to something else, then /tmp is not a hard-coded path, but can be set to point to anywhere by user.
  2. Based on my testing , tempfile also creates files and directories under /tmp, the same as the previous scheme.
  3. tempfile will delete the files upon exiting the context. Would this make debugging harder?

Could you please share your thoughts on these issues?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, as you said, users can actually set the temp directory to something else, then /tmp is not a hard-coded path, but can be set to point to anywhere by user.

Yes, see https://www.baeldung.com/linux/change-tmp-directory-path

Based on my testing , tempfile also creates files and directories under /tmp, the same as the previous scheme.

Not quite sure about the question.

tempfile will delete the files upon exiting the context. Would this make debugging harder?

There is argument for this

class tempfile.TemporaryDirectory(suffix=None, prefix=None, dir=None, ignore_cleanup_errors=False, *, delete=True)
The delete parameter can be used to disable cleanup of the directory tree upon exiting the context. While it may seem unusual for a context manager to disable the action taken when exiting the context, it can be useful during debugging or when you need your cleanup behavior to be conditional based on other logic.


[¶](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, are you OK with using directory under /tmp for storing the files here?

Why tempfile is preferred over the previous scheme?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, are you OK with using directory under /tmp for storing the files here?

I prefer to have a subdir to store temporary files, for example, store tmp files in /tmp/sOmEUuid/ actually, won't this be the same with previouse code? previously, tmp_archive_name is a uuid and it is the subdir name in /tmp. tempfile.TemporaryDirectory() does the same thing.

Why tempfile is preferred over the previous scheme?

  • it handles all platforms temp directory
  • it handles customized TMPDIR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The previous code stores temp files in file named /tmp/tar_<uuid>. I think it's similar to what tempfile does without setting something like TMPDIR.

BTW, I think all files handled in this module are temp files and can be deleted at the end of the function:

  • For _from_files(), the data will eventually be parsed from files and written to tables in databases;
  • For _install_packages(), the packages will eventually be installed to a Python environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the doc https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory:

Changed in version 3.12: Added the delete parameter.

We will need to use GD to "pin" the temp directory to make it available to multiple UDFs.

)
assert len(list(extracted)) == 1
server_dir = (
pathlib.Path("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to be changed as well?

Copy link
Contributor Author

@xuebinsu xuebinsu Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed per your suggestion.

However, while CI is good so far, I have some concerns:

  • The code gets much more complicated. Due to lacking of the delete parameter in tempfile.TemporaryDirectory(), we need to pin the directory object to ensure that it will not be deconstructed and then pass the necessary info around to quite a few UDFs.
  • Debugging is harder. Before the change, randomness only exists on client side. That means, after the SQL statement is built, running it multiple times anywhere will give the same result. This is especially good for debugging. But with tempfile.TemporaryDirectory(), the directory path will change for every run and it is hard to tell which directory is created in which run.
  • Because the returned DataFrame is lazily evaluated, we cannot remove the temp directory at the end of from_files(), this can cause flaky issue if the directory is removed by Python's GC before the DataFrame is evaluated.
  • Depending on a env variable means that user need to restart the database server for the changes to take effect. This is not easy if the server continues ingesting data. Would it be better to customize with a GUC? This does not requires creating an extension on server. For example, on Greenplum we can set an undefined GUC and read it like this:
[gpadmin@localhost GreenplumPython]$ gpconfig -c abc.defg -v hello --skipvalidation
20231023:21:20:17:3893841 gpconfig:localhost:gpadmin-[INFO]:-completed successfully with parameters '-c abc.defg -v hello --skipvalidation'
[gpadmin@localhost GreenplumPython]$ gpstop -afr
[gpadmin@localhost GreenplumPython]$ psql
psql (12.12)
Type "help" for help.

gpadmin=# show abc.defg;
 abc.defg 
----------
 hello
(1 row)

@beeender
Copy link
Collaborator

Is this ready for the next round of review? I saw CI failed.

@xuebinsu
Copy link
Contributor Author

Quote from #221 (comment)

Is this ready for the next round of review? I saw CI failed.

Not yet, still fixing. The switch on server is not easy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants