Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster GitIgnore directory check #3007

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fellhorn
Copy link
Contributor

Why are the changes needed?

flytekit's GitIgnore performs a recursive file check to detect empty / ignored folders instead of checking the folder status directly. For folders with a lot of files (e.g. a python .venv), this can be unnecessarily slow.

An extreme example with 1M files in an ignored folder:

Old ignore folder: 5.39s
New ignore folder: 0.000158s
Code

The measurements were created using this script

import subprocess
import shutil
from pathlib import Path
import os
import logging

from flytekit.tools.ignore import (
    GitIgnore as GitIgnoreOld,
    Ignore,
)

class GitIgnoreNew(Ignore):
  # The implementation ftom this PR
  ...

import time

start = time.perf_counter()
newIgnore = GitIgnoreNew(Path.cwd())
print(f"New ignore setup: {time.perf_counter() - start}")

start = time.perf_counter()
oldIgnore = GitIgnoreOld(Path.cwd())
print(f"Old ignore setup: {time.perf_counter() - start}")

start = time.perf_counter()
assert newIgnore.is_ignored("large-file-collection")
print(f"New ignore folder: {time.perf_counter() - start}")

start = time.perf_counter()
assert newIgnore.is_ignored("large-file-collection/1.txt")
print(f"New ignore file: {time.perf_counter() - start}")

start = time.perf_counter()
assert oldIgnore.is_ignored("large-file-collection")
print(f"Old ignore folder: {time.perf_counter() - start}")


start = time.perf_counter()
assert oldIgnore.is_ignored("large-file-collection/1.txt")
print(f"Old ignore file: {time.perf_counter() - start}")

What changes were proposed in this pull request?

  1. Use git ls-files to also check against the list of ignored directories

How was this patch tested?

Existing unit tests seem to already cover the changed code, performance benchmarks done manually as explained above.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Dennis Keck <[email protected]>
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 64.70588% with 6 lines in your changes missing coverage. Please review.

Project coverage is 50.91%. Comparing base (f99d50e) to head (4fa0c40).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/tools/ignore.py 64.70% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3007      +/-   ##
==========================================
- Coverage   51.08%   50.91%   -0.18%     
==========================================
  Files         201      201              
  Lines       21231    21173      -58     
  Branches     2731     2728       -3     
==========================================
- Hits        10846    10780      -66     
- Misses       9787     9797      +10     
+ Partials      598      596       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant