Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After upgrade to 3005.3, failure to fetch latest pillar data after being updated in git pillar repo branch #65467

Closed
2 of 8 tasks
Ikramromdhani opened this issue Oct 26, 2023 · 75 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Pillar Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@Ikramromdhani
Copy link

Ikramromdhani commented Oct 26, 2023

Description of Issue

A minion/master is unable to get the new pillar data from an alternative branch in git after it was updated on remote git pillar repository. This was working with 3005.1, but broke after upgrade to 3005.3. It also seem to be the case for those using 3006.3. Please refer to community discussion: https://saltstackcommunity.slack.com/archives/C7K04SEJC/p1697576330136199

Setup

Amazon Linux 2 salt master, configured with git repository for states and pillar data.

  • VM (Virtualbox, KVM, etc. please specify)
  • on-prem machine
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce Issue

Master config

ext_pillar:
- git:
  - __env__ ssh://git@xxxxxxxxxxxxxxxxx/repo.git:
    - root: pillar
git_pillar_pubkey: xxxxxxxxxxxxxxxx
git_pillar_privkey: xxxxxxxxxxxxxxxxxxx
git_pillar_update_interval: 60
pillar_merge_lists: True

Push an update to an existant pillar called test in repo.git on branch testpillar. On a minion, run salt-call pillar.get test pillarenv=testpillar

Expected behavior

The new value of pillar test should be visible in the output of the salt call on the minion.

Actual behavior

The minion get the old data of pillar test. Please note that when executing the same comand after deleting /var/cache/salt/master/git_pillar, the minion get the correct value

Versions Report

Salt Version:
          Salt: 3005.3

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: 1.5.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: 1.10.1
        Python: 3.9.18 (main, Sep 18 2023, 18:18:39)
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: amzn 2
        locale: utf-8
       machine: x86_64
       release: 4.14.318-241.531.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2
@welcome
Copy link

welcome bot commented Oct 26, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@OrangeDog
Copy link
Contributor

3005.2 changed how git pillars were cached to fix a security issue. Having to clear the cache after upgrading is probably expected.

@OrangeDog OrangeDog added Bug broken, incorrect, or confusing behavior Regression The issue is a bug that breaks functionality known to work in previous releases. Pillar needs-triage labels Oct 30, 2023
@HL-SaltBae
Copy link

Having to clear the cache after upgrading is probably expected.

This does NOT only affect the pillar cache across upgrades.

With a running 3006.3 master, the master will NOT refresh a git_pillar branch after the first time that branch is fetched.

Steps to test:

  1. Have git_pillar configured with __env__ like in the bug report
  2. Create a new branch in the git_pillar repo
  3. Have any minion pull the pillar data using pillar.items pillarenv=<branchname>
  4. Notice that the data aligns with what's in the branch
  5. Commit a new change to the branch in the git_pillar repo
  6. Do any or all of the following:
    • Wait for the master loop_interval to expire and observe that the master log file indicates that the git_pillar remote is up-to-date
    • run sudo salt-run git_pillar.update on the master and observe that the return for the pit_pillar repo in question is "None"
    • restart the master
    • run saltutil.refresh_pillar on the minion in question
  7. Repeat pulling pillar data with any minion using pillar.items pillarenv=<branchname>
  8. Notice that the minion does NOT receive new pillar data from the new commit on the branch, and instead returns the same pillar data that was on the branch when the master first fetched the branch
  9. Repeat 5,6,7 as many times as you want
  10. Ultimately stop the master process, rm -f /var/cache/salt/master/git_pillar, restart the master process
  11. Go to step 3 and repeat forever

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Nov 3, 2023

Are there any chances we may get a fix for this soon? It affects both 3006 and 3005 and it was introduced after the CVE patches so it makes sense to have it fixed within both versions. This kind of issues make it harder to upgrade salt version from the versions with security issues to avoid having breaking issues and keep everything working as desired.

@HL-SaltBae
Copy link

@cmcmarrow do you have any insight into this? I think this regression happened as a result of your CVE patch 6120bca

@ITJamie
Copy link
Contributor

ITJamie commented Jan 31, 2024

Theres been no github activity on @cmcmarrow 's account since sept. Safe to say hes no longer involved on the project ☹️

@Ikramromdhani
Copy link
Author

I would like to add another detail: We tried to use the new released version 3005.5 which contain additional CVE patches, and we noticed that the pillar refresh issue is partially fixed and happens only on branches containing / for example: feature/test/pillar . I hope this helps fixing the issue

@HL-SaltBae
Copy link

I had wondered if / in the branch name might be a factor, but someone on the Slack said they tested that and it happened with or without the /. Good news if true!

@dmurphy18
Copy link
Contributor

@Ikramromdhani Can you upgrade to latest Salt 3006 and retry, Salt 3005 has passed it's bug fix (last August 2023) and it about to run out of CVE support on the Feb 25th 2024, 24 days from now.

At this stage nothing will be getting done to Salt 3005 given that it is dead in 24 days, hardly time to debug/test/fix/release cycle, esp. since would have to build classic packages and there is a couple of days in just getting that done.

@dmurphy18 dmurphy18 self-assigned this Feb 1, 2024
@dmurphy18 dmurphy18 added this to the Sulfur v3006.7 milestone Feb 1, 2024
@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 2, 2024

@Ikramromdhani Rereading this I remember Chad talking about how caching is changed and how the cache is now only updated correctly, rather than as it was where the cache was needlessly getting updated due to other commands being run, need to go back and find the actual commit etc., it was last year - Autumn time.

And yup, Chad is gone, moved on to greener pastures, so I am reading up on GitFS, etc. to get fully familiar (having to do more areas since less people now on the core team), so will update next week with code lines that changed, but I expect you were relying on a side-effect of an operation that has since been fixed (namely incorrectly updating the cache), but will get the full details for you.

That said, doesn't mean there isn't an error too.

@donkopotamus
Copy link

donkopotamus commented Feb 15, 2024

We are also observing this error after recently upgrading from salt-3004 to salt-3006.4:

We have observed that when we use __env__ to effectively map pillarenv to a branch name that:

  • if we create a new branch feature/testing that is not already in the cache, this will be pulled correctly; and
  • if we then add a new commit on it, or amend HEAD, that the changes will not be pulled;

However, if we create a new branch named testing then it will be pulled, and subsequent changes to it will be pulled. That is, it seems that only branches with a / in them do not update. Unfortunately we use branch names like release/* or feature/*, like many people do so its a severe issue for us at the moment!

(This evidence conflicts with #65467 (comment), although the commenter had not actually confirmed it themselves)

Note:

  • We are using GitPython rather than pygit2
  • We do not see this issue with gitfs ... changes to our state files are pulled ... it is only git_pillar

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Feb 15, 2024

@donkopotamus , that is what I was about to confirm. Tested today with 3006.6 using pygit2 and still have issues pulling changes for pillars when relying on pillarenv branch having /.
@dmurphy18 any clues if thiscan make it to 3006.7 or 3007?

@HL-SaltBae
Copy link

I observed that the cache directory organizes cached branches by foldername=branchname, and that branches with / in the branch name does result in a subfolder created

For example, with three pillar branches:

  • foo,
  • feature/bar
  • feature/boo
    you might see a structure like this:
cache_dir/
    foo/top.sls
    feature/
        bar/top.sls
        boo/top.sls

If this turns out to be the actual issue, then the obvious solution would appear to be escaping the / in branch names or converting to something like -

cache_dir/
    foo/top.sls
    feature\/bar/top.sls
    feature\/boo/top.sls

or

cache_dir/
    foo/top.sls
    feature-bar/top.sls
    feature-boo/top.sls

or

cache_dir/
    foo/top.sls
    feature-SLASH-bar/top.sls
    feature-SLASH-boo/top.sls

@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 15, 2024

@Ikramromdhani Won't make it for Salt 3006.7, that is close to release, waiting on clean tests in the pipeline. Possibly for 3007, but on my backlog, dealing with #65816, which may be related (similar code changes for pillarenv went in at same time for gitfs locks), writing tests for that fix now.
So it is on the block but limited hands since the buyout. Get to it when I can.

@max-arnold
Copy link
Contributor

Not sure if it helps, but it looks like the CVE fix in 3005.2 and then #65017 in 3005.3 that mitigated a negative effect of that fix, probably contributed to this issue

@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 15, 2024

Yup, the set of changes that Chad did and had to be adjusted, applied to both the gitfs changes he did along the pillarenv and cache changes, etc. That is why I expect some linkage to the gitfs changes.
Those changes are also in 3006.x branches too

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Mar 6, 2024

Just a heads up that we tested 3006.7 and the problem is still there. Any chances to get a fix for this on salt 3007? This problem is holding us for performing salt upgrades.

@Ikramromdhani
Copy link
Author

@dmurphy18 since the issue fix did not make it to 3007. Just would like to ask if there are specific version to target? Thank you

@dmurphy18
Copy link
Contributor

@Ikramromdhani Working on fix for Git FS and the Git Pillar is right after it, fix for Git FS is done but have to refractor a new test after PR review (#65937), after that then work on Git Pillar since it will be possibly affected by GitFS changes.

@dwoz dwoz modified the milestones: Sulfur v3006.8, Sulfur v3006.9 May 1, 2024
@dmurphy18
Copy link
Contributor

@donkopotamus Just making sure, all of your changes on branch are committed and pushed ?, since I am doing the same and seeing the changes, regardless of pillarenv

@donkopotamus
Copy link

@dmurphy18 I understand why such questions need to be asked but yes, changes are definitely committed and pushed! (I've also repeated it this morning)

This isn't an isolated test really ... I've tested each minor 3006.x release as we've rolled it out on our systems (after we first observed when bumping from 3004 -> 3006.4. We also regularly observe it as a "gotcha" where a developer will push changes on a feature/... branch (or merge changes in a release/... branch) and then become confused about why their pillar changes are not coming into effect! When observed we stop the master, destroy /var/cache/salt/master/git_pillar and start the master, after which changes then come into effect.

What else can we provide that might help us close the gap between your test, and our production systems? The config snippets I posted above are our only pillar related config changes on the master, with all other pillar related setting being the default.

Here is our salt-run --versions

┌╴XXX@XXX - fish - subnet: XXX
└╴/e/s/master.d> sudo salt-run --versions
Salt Version:
          Salt: 3006.9
 
Python Version:
        Python: 3.11.7 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
 
Dependency Versions:
          cffi: 1.17.1
      cherrypy: Not Installed
  cryptography: 43.0.1
      dateutil: 2.9.0.post0
     docker-py: Not Installed
         gitdb: 4.0.11
     gitpython: 3.1.43
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.1.0
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 24.1
     pycparser: 2.22
      pycrypto: Not Installed
  pycryptodome: 3.21.0
        pygit2: Not Installed
  python-gnupg: Not Installed
        PyYAML: 6.0.2
         PyZMQ: 26.2.0
        relenv: Not Installed
         smmap: 5.0.1
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.5
 
System Versions:
          dist: rhel 9.4 Plow
        locale: utf-8
       machine: x86_64
       release: 5.14.0-427.37.1.el9_4.x86_64
        system: Linux
       version: Red Hat Enterprise Linux 9.4 Plow

@donkopotamus
Copy link

donkopotamus commented Oct 31, 2024

@dmurphy18 I believe I have located a symptom of the problem that might help. Its all about the location of the marker file fetch_request in the work directory.

Take my test case above where I created both a branch some-branch and some/branch. Then we check the state of the cache on the master

# change into the relevant **work** directory on the master
master$ cd /var/cache/salt/master/git_pillar/work/<opaque-hash>

master$ ls
some/    some-branch/

# look at the contents of those ... `some-branch` has a `fetch_request` marker file in it
master$ ls some-branch/
fetch_request

# but the directory corresponding to `some/branch` does **not**
master$ ls some/branch
<empty>

# instead the marker is in the parent `some/` (which is not a valid branch in and of itself)
master$ ls some/
branch/    fetch_request

Now ... suppose I push a change that affects the pillar, as in my previous example

# the `git_pillar_refresh_interval` is 60 so this is very conservative
minion$ sleep 120

# check the value of `some-var` ... its still `100`, should be `200`
minion$ sudo salt-call pillar.items | grep -A 1 some-var
    some-var:
        100

# now ... put a `fetch_request` into `some/branch` **on the `master`**
master$ sudo touch /var/cache/salt/master/git_pillar/work/<opaque-hash>/some/branch/fetch_request

# back to the minion ... sleep again
minion$ sleep 120

# check the value of `some-var` ... its changed!
minion$ sudo salt-call pillar.items | grep -A 1 some-var
    some-var:
        200   <== MAGIC!

With a fetch_request being in some/branch rather than some/, it updated it. However, having updated it it has removed the fetch_request file.

# our file has gone ...
master$ sudo ls /var/cache/salt/master/git_pillar/work/<opaque-hash>/some/branch/fetch_request
<no such file>

@donkopotamus
Copy link

A further refinement of the above:

# we push a change, altering `some-var: 200 -> 300` 
minion$ sudo salt-call pillar.items | grep -A 1 some-var
    some-var:
        200  <== change not propagated

# took a break for an hour, confirmed its still not propagated
minion$ sudo salt-call pillar.items | grep -A 1 some-var
    some-var:
        200  <== change not propagated

# on the master, we **move** the `fetch_request` out of `some/` and into `some/branch`
master$ cd /var/cache/salt/master/git_pillar/work/<opaque-hash>/some
master$ sudo mv fetch_request branch/

# some time later, maybe 30 seconds or so, its reappeared in `some/` and still exists in `some/branch`
master$ ls . branch/
.:
branch/    fetch_request

branch/:
fetch_request

Meanwhile, on the minion we sleep for five minutes ...

# five minutes later, the state of the `work` cache on the master still looks the same
master$ ls . branch/
.:
branch/    fetch_request

branch/:
fetch_request

# on the minion, we now ask for the pillar, having ensured that `some/branch/fetch_request` exists
minion$ sudo salt-call pillar.items | grep -A 1 some-var
    some-var:
        300   <== Hooray!

# on the master, the `fetch_request` has been removed from `some/branch` as a result of our pillar query
master$ ls . branch/
.:
branch/    fetch_request

branch/:

As fetch_request has been removed from some/branch we're back to the situation that we won't see updates on the branch until we create such a file there

@donkopotamus
Copy link

donkopotamus commented Nov 1, 2024

Summary

  • Something decides where to put fetch_request marker files in the work directory;
  • It is only putting them in the top most directories, not the final directories corresponding to the actual branch;
    • Thus branches that have no / do have a fetch_request while branches that do, do not have such a file;
  • Directories that do not have a fetch_request are not re-fetched when we ask for their pillar;
  • If we manually create such a file via eg touch ... then they will be fetched, and the marker file removed

Addendum:

No idea if its relevant ... but the analysis above is for the disk cache for the ext_pillar definition for __env__, as in the below config:

ext_pillar:
  - git:
    ...

    # map other environments to the corresponding branch
    - __env__  git@.../some-repo.git
      - root: salt/pillar

I currently have no idea if they affect a cache for specific named branches

@donkopotamus
Copy link

I currently have no idea if they affect a cache for specific named branches

It does not ... if we replace the ext_pillar definition with one that does not use __env__ eg:
ext_pillar:

  - git:
    ...

    # 
    - another/branch  git@.../some-repo.git
      - root: salt/pillar

then the cache directory looks slightly different, and does not use the branch name at all. (This may explain why you cannot reproduce it ... are you using __env__ or a specific named branch?). Instead we get a different file layout:

# change into the relevant **work** directory on the master
master$ cd /var/cache/salt/master/git_pillar/work/<opaque-hash>

master$ ls
_/

The Problem?

So, looking at salt/utils/gitfs.py:

  • If we use __env__ then we will end up with gitfs having repo.get_salt_working_dir() being eg <cache-dir>/work/<opaque-hash>/some/branch
  • If we do not, then we will have gitfs having repo.get_salt_working_dir() being eg <cache-dir>/work/<opaque-hash>/_
  • This is due to this snippet in GitProvider.__init__
          self._cache_basename = "_"
          if self.id.startswith("__env__"):
              try:
                  self._cache_basename = self.get_checkout_target()
    

Now ... look at what's dropping these fetch_request files (GitBase.fetch_remotes)

        for repo in self.remotes:
            name = getattr(repo, "name", None)
            if not remotes or (repo.id, name) in remotes or name in remotes:
                try:
                    # Find and place fetch_request file for all the other branches for this repo
                    repo_work_hash = os.path.split(repo.get_salt_working_dir())[0] <= THIS IS HASH DIR
                    for branch in os.listdir(repo_work_hash):  <== PROBLEM HERE!!
                        # Don't place fetch request in current branch being updated
                        if branch == repo.get_cache_basename():  <== PROBABLY TRUE WHEN NOT __env__
                            continue
                        branch_salt_dir = salt.utils.path.join(repo_work_hash, branch)
                        fetch_path = salt.utils.path.join(
                            branch_salt_dir, "fetch_request"
                        )
                        if os.path.isdir(branch_salt_dir):
                            try:
                                with salt.utils.files.fopen(fetch_path, "w"):    <== CREATED `work/<opaque-id>/some/fetch_request`
                                    pass

Assuming that repo_work_hash is the eg <cache-dir>/work/<opaque-hash> then this only iterates over the topmost directories in the working directory, meaning we will visit some/ but not visit some/branch. (I've now injected logging there and confirmed that).

Next looking at fetch_request_check

    def fetch_request_check(self):
        fetch_request = salt.utils.path.join(self._salt_working_dir, "fetch_request")  <=== LOOKING FOR `work/<opaque-id>/some/branch/fetch_request`
        if os.path.isfile(fetch_request):  <== THIS TEST FAILS SO WE DO NOT FETCH
            log.debug("Fetch request: %s", self._salt_working_dir)
            try:
                os.remove(fetch_request)
            except OSError as exc:
                log.error(
                    "Failed to remove Fetch request: %s %s",
                    self._salt_working_dir,
                    exc,
                    exc_info=True,
                )
            self.fetch() <== WE DON"T DO THIS
            return True
        return False

@OrangeDog
Copy link
Contributor

Related, but I wonder what happens if you name a branch ../../../../../../../etc? New CVE perhaps?

@dmurphy18
Copy link
Contributor

@OrangeDog I take you out behind the bike shed 🤣 , joking.

Some interesting takes on that parameter input, beginning to think might need some sanity checking on valid input etc. , brings horrors of Perl (Halloween/Day of the Dead/All Saints Day after all) and sanity checking from a couple decades ago to mind.
Will have to consider it, and bring that up in standup.

Good Point.

@OrangeDog
Copy link
Contributor

I'm pretty sure it would only be an issue if you never ran a minion on the master, as if you can control the repo you can already add states to do whatever you want.

@donkopotamus
Copy link

Related, but I wonder what happens if you name a branch ../../../../../../../etc

@OrangeDog Thankfully a git ref cannot contain ..:

Git imposes the following rules on how references are named:
...
3. They cannot have two consecutive dots .. anywhere.

@Ikramromdhani
Copy link
Author

@dmurphy18 , I test new version of salt master 3006.9 and pygit2 version 1.13.1 and the issue still persist. I pushed a change and the pillar value was not updated and I had to clear the git_pillar cache folder to have the new value:

[master] ~]# salt --versions-report
Salt Version:
          Salt: 3006.9
 
Python Version:
        Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: 1.7.1
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.13.1
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.17.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: amzn 2 
        locale: utf-8
       machine: x86_64
       release: 5.10.227-219.884.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2 

[master] ~]# salt '*'  pillar.get message pillarenv=feature/test/pillar-refresh
minion1:
    ----------
    key1:
        Hello World 2!
[master] ~]# salt '*'  pillar.get message pillarenv=feature/test/pillar-refresh
minion1:
    ----------
    key1:
        Hello World 2!
[master] ~]# rm -rf /var/cache/salt/master/git_pillar/
[master] ~]# salt '*'  pillar.get message pillarenv=feature/test/pillar-refresh
minion1:
    ----------
    key1:
        Hello World 3!

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Dec 4, 2024

@dmurphy18 any updates on this issue? we would like to move to using 3006 and we were wondering if a fix could be included in 3006.10 or it may be something we need to live with

@aPollO2k
Copy link

I'm affected by this issue. The only workaround I found is to run on salt master:

rm -rf /var/cache/salt/master/git_pillar/

I would be happy to assist with testing or providing feedback if needed. However, I am not a developer, so I won't be able to contribute directly to fixing the code.

Let me know if there's anything I can do to help further.

@dmurphy18
Copy link
Contributor

@Ikramromdhani I have been unable to reproduce this with GitPython and pygit2 < 1.15.0 (1.15.0 and above has its own problems, see #67017).

I am going to be gone on Christmas Break and then PTO, so won't be working on this till new year 2025.
Note been unable to work on it due to CI/CD changes with loss of AWS (Broadcom infrastrucutre changes) so been busy on bootstrap.

So apart from the later versions of pygit2, there has to be something else in your environment which is affecting functionality, since I am unable to reproduce the issue. Is there any other detail that could be added ?, will be rereading this so see if there is something I missed too.

@donkopotamus
Copy link

donkopotamus commented Dec 17, 2024

@dmurphy18 @Ikramromdhani Can you review again this sequence of comments on this issue, which I believe identifies the exact cause of the issue:

The latter in particular identifies the problem as the piece of code that is responsible for dropping fetch_request marker sentinels on the filesystem (see my comments embedded in it such as <== PROBLEM HERE!!).

I have been unable to reproduce this with GitPython

@dmurphy18 Also relevant in there is the following:

  • this problem only occurs if you are using an __env__ definition for git_pillar. i.e. this breaks:

    ext_pillar:
      - git:
        ...
    
        # map other environments to the corresponding branch
        - __env__  git@.../some-repo.git
          - root: salt/pillar
    

    but this does not

    - git:
     ...
    
     # 
     - another/branch  git@.../some-repo.git
       - root: salt/pillar
    

Can you confirm that you are using an __env__ definition when attempting to reproduce it?

A WORKING FIX

If I alter the mentioned code to instead walk the tree at repo_work_hash, rather than simply using listdir to drop sentinels at the top level, then a fetch_request is placed at all directory nodes and then those branches such as some/branch are definitely updated.

@Ikramromdhani I, for example, no longer blow away the entire cache directory ... instead I simply

sudo touch /var/cache/salt/master/git_pillar/work/<opaque-hash>/some/branch/fetch_request

on the branch I need fixed and 60 seconds later that branch will have been updated. (Clearly I don't like having to do this manually all the time though!)

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Jan 2, 2025

@dmurphy18 can you confirm you still cannot reproduce when using a configuration as below:

ext_pillar:
- git:
  - __env__ <pillar_repository>:
    - root: saltpillar
git_pillar_pubkey: <key_pub>
git_pillar_privkey: <key_priv>
git_pillar_update_interval: 60
pillar_merge_lists: True

@dmurphy18
Copy link
Contributor

@Ikramromdhani Back and looking at this again, it was a long PTO.
@donkopotamus Thanks for that info, examining it now

@dmurphy18
Copy link
Contributor

@Ikramromdhani Did do the env with GitPython, will try with pygit2, see

david@david-XPS-15-9570:~/dev/fix_65467/notes$ cat master_conf.txt 

ext_pillar:
  - git:
    - __env__ https://github.com/dmurphy18/salt-test-pillar-gitfs.git
    - branch https://github.com/dmurphy18/salt-test-pillar-gitfs.git


git_pillar_provider: gitpython

Once I get my test environment set back up.

@dmurphy18
Copy link
Contributor

dmurphy18 commented Jan 31, 2025

@donkopotamus Thanks for those comments, __env__ and <cache-dir>/work/<opaque-hash>/some/branch with os.listdir, and how to tell a real directory structure and a name with a directory separator in it, need to go away and figure a way to do that.

But wondering what not seeing this with pygit2 < 1.15 or on GitPython, given the code examples are in the base classes.

Getting a test environment to check that and back up to speed after a long vacation.

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Feb 3, 2025

@dmurphy18 , we are still witnessing the issue with pygit2==1.14.0 and salt master 3006.9

@dmurphy18
Copy link
Contributor

@Ikramromdhani @donkopotamus Have a solution which worked in hand testing, converting '/' in branch names using __env__ to a '-',, dash. Now to write automated tests to confirm the fix, but thought I let you know.

@OrangeDog
Copy link
Contributor

That doesn't sound like a great solution, as branch names can also contain -.
Is there something wrong with @donkopotamus 's fix?

@donkopotamus
Copy link

@dmurphy18 As @OrangeDog notes, substituting '/' -> '_' precludes the possibility of having branches named both a/b and a-b.

Git ref names are specifically designed to generally map well, and be safe, on filesystems. The current method of mapping them onto a directory structure does not seem wrong in any sense ... its simpy the traversing of that structure that is wrong.

i.e I think:

  • the branch name a/b/c should continue to be mapped to the directory a/b/c (thats exactly how git itself would represent the branch in a checked out repository, by using .git/refs/heads/a/b/c
  • instead of using os.listdir to traverse that structure, it should simply be walked with os.walk

@dmurphy18
Copy link
Contributor

@donkopotamus @OrangeDog Did I mention I hate gitfs 🤣 , I was duplicating this

lc_path_adj = load["saltenv"].replace(os.path.sep, "_|-")

and yeah, I refactored that line, but the replace '/' with '-' was originally there for 5 years previously.

I am wondering why os.walk would be any better given

os.walk(top, topdown=True, onerror=None, followlinks=False)[¶](https://docs.python.org/3.10/library/os.html?highlight=os%20listdir#os.walk)
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each

So would it not also treat the '/' in a git branch name as an os.sep too, similar to os.listdir

Understanding the replacement of '/' with dash '-' does preclude a/b with a-b, but that would also be the same with use of underscore, as used in the cache base hash , see

salt/salt/utils/gitfs.py

Lines 483 to 488 in 2b44a6c

self._cache_basehash = str(
base64.b64encode(hash_type(self.id.encode("utf-8")).digest()),
encoding="ascii", # base64 only outputs ascii
).replace(
"/", "_"
) # replace "/" with "_" to not cause trouble with file system

Noting the a/b and a-b clash would be for use with __env__ or pillarenv, but the outcome would be an extra cache refresh, which should not be too bad a side effect.

Still to put the changes out for team review, so I am sure they will have some comments too.

Please let me know If I am wrong in my logic about a minor side effect.

@idenkov-tc
Copy link

idenkov-tc commented Feb 6, 2025

We are trying to upgrade from 3002.2 to something newer(onedir) and stumbled upon this issue. Using __env__/gitfs

However we do not have / in our branch name, but we do have - and _. Tested on 3006.2 and 3006.3 so far.

Edit: It seems I am not experiencing this issue with 3006.9 or I might have introduced some bad config when I was testing.

@donkopotamus
Copy link

donkopotamus commented Feb 6, 2025

Understanding the replacement of '/' with dash '-' does preclude a/b with a-b, but that would also be the same with use of underscore, as used in the cache base hash , see

salt/salt/utils/gitfs.py

Lines 483 to 488 in 2b44a6c

self.cache_basehash = str(
base64.b64encode(hash_type(self.id.encode("utf-8")).digest()),
encoding="ascii", # base64 only outputs ascii
).replace(
"/", "
"
) # replace "/" with "_" to not cause trouble with file system

@dmurphy18 Just a side comment on this first before talking about os.walk ... I may be misunderstanding what is happening here, but my reading of this code is:

  • id is the branch name, and it may or may not contain / ... eg a/b/c
  • it is not having any kind of / -> _ replacement done on it (thereby potentially clashing with a_b_c) ... rather it is being hashed
    • and then since the hash may contain / (but not _ as that isn't in the standard b64 alphabet), that is replaced to ensure the hashed branch name is a valid directory name.

So there is no potential for clash here beyond the extremely miniscule possibility of a hash collision

@donkopotamus
Copy link

Summary

@dmurphy18 @OrangeDog

This is a long and detailed response but this might be a good summary of my thoughts:

  • The current structure of checkouts is fine, there is no need to munge branch names with / -> -
  • The current structure of the work directory is fine, there is no need to munge branch names with / -> -
  • The current way of deciding whether a checkout needs updating, based on a subdirectory of work is fine
  • The problem is tiny and simply in how fetch_request is dropped inside the the subdirectories of work

My proposed solution is just a few lines of code, but I may well be missing something altogether.

Details

I am wondering why os.walk would be any better given

os.walk(top, topdown=True, onerror=None, followlinks=False)[¶](https://docs.python.org/3.10/library/os.html?highlight=os%20listdir#os.walk)
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each

So would it not also treat the '/' in a git branch name as an os.sep too, similar to os.listdir

Yes ... but in this case that is exactly the point. Currently, branch names that contain a / are directly mapped into a directory structure. Thus if we are using __env__ and we have a repo with branches a/b/c, d/e and f, then git_pillar will aim to check these out at eg

/var/cache/salt/master/git_pillar/<opaque-hash>/a/b/c
/var/cache/salt/master/git_pillar/<opaque-hash>/d/e
/var/cache/salt/master/git_pillar/<opaque-hash>/f

These are the working directories, with their git repos being in:

/var/cache/salt/master/git_pillar/<opaque-hash>/a/b/c/.git
/var/cache/salt/master/git_pillar/<opaque-hash>/d/e/.git
/var/cache/salt/master/git_pillar/<opaque-hash>/f/.git

This is the current situation for how class GitPillar(GitBase) does checkouts of particular branches when __env__ is in effect, and:

  • this works fine;
  • git rules around branch names mean the branch names are all valid directory paths, without any need to recode / etc
    • further, it is impossibly to to have a situation where you have a branch names a as well as a/b/c

Next, class GitPillar(GitBase) has created a mirror of this directory structure in <cache-root>/work/<opaque-hash> as:

/var/cache/salt/master/git_pillar/work/<opaque-hash>/a/b/c
/var/cache/salt/master/git_pillar/work/<opaque-hash>/d/e
/var/cache/salt/master/git_pillar/work/<opaque-hash>/f

This also is absolutely fine for all the reasons given above.

When the pillar is fetched, how does it decide if it needs refetching? If the pillar for a/b/c is fetched then fetch_request_check is called:

salt/salt/utils/gitfs.py

Lines 1307 to 1309 in 2b44a6c

def fetch_request_check(self):
fetch_request = salt.utils.path.join(self._salt_working_dir, "fetch_request")
if os.path.isfile(fetch_request):

ie it looks for <cache-root>/work/<opaque-hash>/a/b/c/fetch_request ... and if that exists then it updates the checkout in <cache-root>/git_pillar/<opaque-hash>/a/b/c. That is it is looking for the following files:

/var/cache/salt/master/git_pillar/work/<opaque-hash>/a/b/c/fetch_request
/var/cache/salt/master/git_pillar/work/<opaque-hash>/d/e/fetch_request
/var/cache/salt/master/git_pillar/work/<opaque-hash>/f/fetch_request

This logic is also fine, and works perfectly well.

But where do those fetch_request files come from? These are generated in fetch_remotes with this snippet:

salt/salt/utils/gitfs.py

Lines 2791 to 2803 in 2b44a6c

repo_work_hash = os.path.split(repo.get_salt_working_dir())[0]
for branch in os.listdir(repo_work_hash):
# Don't place fetch request in current branch being updated
if branch == repo.get_cache_basename():
continue
branch_salt_dir = salt.utils.path.join(repo_work_hash, branch)
fetch_path = salt.utils.path.join(
branch_salt_dir, "fetch_request"
)
if os.path.isdir(branch_salt_dir):
try:
with salt.utils.files.fopen(fetch_path, "w"):
pass

The intent of this code should be to drop fetch_request marker file in every branch that is in the work directory, ie:

/var/cache/salt/master/git_pillar/work/<opaque-hash>/a/b/c
/var/cache/salt/master/git_pillar/work/<opaque-hash>/d/e
/var/cache/salt/master/git_pillar/work/<opaque-hash>/f

Instead, as it uses os.listdir, it drops files in:

/var/cache/salt/master/git_pillar/work/<opaque-hash>/a    [this doesn't even correspond to a branch]
/var/cache/salt/master/git_pillar/work/<opaque-hash>/d    [this doesn't even correspond to a branch]
/var/cache/salt/master/git_pillar/work/<opaque-hash>/f

Now, requests for the pillar corresponding to f will cause it to be refreshed before its returned. But requests for eg a/b/c will not (as it checks for <cache-root>/work/<opaque-hash>/a/b/c/fetch_request which does not exist)

How do we drop a fetch_request in every terminal subdirectory of <cache-root>/<opaque-hash>? We can use os.walk:

  • it generates tuples of the form (path, sub_dirs, files) ... we can find terminal directories by filtering for those where sub_dirs is empty
  • the path is the full path, including eg <cache-root>/<opaque-hash> so we then normalise for that using os.relpath

For example:

repo_work_hash = os.path.split(repo.get_salt_working_dir())[0]
branches = [
    os.relpath(path, repo_work_hash) 
    for (path, subdirs, files) in os.walk(repo_work_hash) if not subdirs
]
for branch in os.listdir(repo_work_hash):
    # Don't place fetch request in current branch being updated
    if branch == repo.get_cache_basename():
        continue
    ...

Note that this logic all still works fine for the GitPillar when __env__ is not in effect, as the branch is then stored simply in a subdirectory called _ and os.walk will still find it. (I have not looked deeply into what will happen with gitfs itself, as opposed to git_pillar, but the former also seems to simply have _ in the work directory so again os.walk should be fine here)

@dmurphy18
Copy link
Contributor

@donkopotamus Had the PR passing and it got merged as I was about to edit it to try the os.walk (had played around with os.walk and similar to the git branch name 'doggy/moggy' in a test file).

The PR was passing all tests and had the milestone set for 3006.10, and close to release, hence it made sense to merge it (by the merge master - had requested that this morning before I saw your comment, was on PTO Friday and only saw it after lunch).

Hence opened new Tech-Debt issue to fix this up to use the os.walk approach after the 3006.x branch is released which is immanent, see #67722

Hence closing this Issue since PR #66954 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Pillar Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

No branches or pull requests