-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add git to filesystem source 301 #312
Conversation
- was loading same file twice - it now matches the behaviour in the equivalent pytest.
…em-source-301' into add-git-to-filesystem-source-301
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good piece of work, thanks for the docstrings and tests for the filesystem! see my review for improvement.
now your questions
modified_date is valuable for ETL, but how valuable?
it allows for incremental load - to retrieve only files that were modified from the last run. our use case: we have GPT-4 help in our docs with now we reindex from scratch even if a single doc page in our docs changes. with that we will reindex only changed pages. and doing embedding costs and takes time
Do we also want ref (sha, tag, branch) as an optional parameter? The existing github and git fsspec implementations have that parameter.
yes - we need ref
, we should be able to checkout branch, tag or commit
Better name than gitpythonfs? git is already taken by the implementation based on pygit2
cool name!
I am not sure best way to instruct or warn user that they need git installed.
git python does it nicely! you can look at dlt.common.git - we import gitpython lately (per particular function) so lack of the git binary does not kill the whole process too early
The package's tests uses a repo dynamically built as a pytest fixture. I was wondering if, for verified-sources tests/filesystem/test_filesystem.py, we could try to reference the verified-sources repo - , rather than create a new repo just for sample files. Were there other reasons for a separate repo for test samples?
in those tests we have a standard set of directories and files. so the same tests (without any modifications) can pass for a new file system we add. so what you need is to create a repo with those files (and a few commits I'd say to test timestamps) and we can distribute the bare repo in cases
folder. also see later
Does not fully support putting arguments, or omitting arguments, in fsspec urls. For example, the repo path and ref in gitpythonfs://[path-to-repo:][ref@]path/to/file causes errors and strange output
I need to see the errors. If those are valid URIs we should be able to parse them
Requires custom code tweaks for each fsspec implementation, for example, to get modified_date. Conversely, uses several inline if statements to detect and support file fsspec.
true. so we should fallback to some standard name ie mtime
for unknown protocols. and we should use it for our own. feel free to do that as part of this topic
Are there any plans to expose fsspec's higher level methods - open() or open_files() ?
we do that in FIleDict. you can also access the origina, authenticated file system
Conflates credentials and file system configuration. - rules and structures that work for bucket-like services ( s3, az, gcs ) and files don’t work other fsspec implementations.
yeah possibly... any examples :)
the idea is that: bucket_url
contains all required settings to open url and credentials really contain credentials.
also I noticed that we do not use the host and port which is actually quite cool. we could use them to clone (bare) remote repo from github/gitlab right?
results = [] | ||
|
||
# For traversal, always start at the root of repo. | ||
tree = self.repo.tree() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cached in init. 100% it spawns git command and creates a process per every ls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 14eaa82 (link is not in PR context)
Since git refs feature was added the trees differ by commit, so the cache is a little more complex than setting a variable in init.
There is some danger of the user getting unexpected results if they move a pointer (branch, HEAD) during the lifetime of an fsspec instance (noting that fsspec caches instances by defaul). For now, I have simply mentioned that in docstrings as I believe it's an edge case. If it's a common case, the cache could be indexed by commit sha, with conversion functions between ref <--> sha. But I'm wary of spawning more git commands just to manage the cache :) !
# "committed_date": commit.committed_date, | ||
} | ||
|
||
if include_committed_date: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this will create a process for each file to run git rev parse or something similar. this is indeed a no go... what we should do is to index and cache a whole tree or a path when doing ls
with a single git command.
ie. look here https://stackoverflow.com/questions/1964470/whats-the-equivalent-of-subversions-use-commit-times-for-git
git --no-pager whatchanged --pretty=%at
this is really fast
then here you'll just look into cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First cut of this is at 9daaf9a
It is not as fast as I had hoped.
- I used
git log --raw
instead ofgit whatchanged
.log
has better documentation. Apparently they both usegit rev-list
behind the scenes. Also using a git pathspec to limit what is returned to the scope of whatls()
is doing. - The caching is very conservative - limited the single tree (not recursive) that
ls()
is operating on. So one call togit log --raw
for every folder in the repo. - Could do one
git log --raw
for the entire folder hierarchy, for given ref. It starts to push the architecture of fsspec though. As far as I can see, we would need to override (or examine at runtime) something up the hierarchy (or call stack) - like glob(), walk() - to detect when they're about to deeply traverse the repo. The glob pattern could then be applied in the pathspec of the git command so it only pre-fetches revisions that will be used. - Note complexity that any cache needs to be indexed by ref
II will try to find where it is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref (commit sha, tag, branch) can be supplied in url or function parameter.
This parameter value should only be the path to the git repo, not to folders/files in the repo.
twice as fast as previous method, but still 4s to walk() dlt repo.
walk() times by repo: dlt: 0.5s wine: ~50s
- Does not pass tests yet because dlt fsspec adapter doesn't understand netloc component with `/`. Following commits will fix that. - s3, gs, az test parameters disabled for local developmet. Plan to have s3 working on local
Interdependent code chages are being made
- includes comments about code issues. - for more discussion see main PR at dlt-hub/verified-sources#312
Just sharing WIP in dlt main repo to complement this PR. See these branches:
Those branches have @Pipboyguy 's filesystem and client args features merged in. I"m ready to discuss when suits you. I can provide more written analysis if needed. |
By passing it a ready-to-use fs client, we're not relying on the value of of FileItem.file_url to construct one. Specifically, if the fsspec client requires instantiation with some parameters in the url netloc field, those parameters could be missing from FileItem.file_url. Generally, FileItem.file_url has conflicting uses.
They aren't valid URIs unless the netloc params are very carefully quoted/unquoted - that is definitely part of the problem. There are other issues with putting fssspec parameters in the netloc, which I have highlighted in a draft PR in the main dlt repo: dlt-hub/dlt#893 . It includes some ideas for restructuring the dlt.common.storages code in general, that might help with future fssspec requirements. But even after that investment I think the urls parameters could still be troublesome. They are not very consistently implemented in fsspec itself so it is hard to deal with them in an abstract way. As an experiment, I dropped the requirement for supporting fssspec parameters in the netloc, and:
|
closing in favor of #350 |
Add experimental fsspec implementation for git, based on GitPython.
@rudolfix please comment and give direction.
Relevant issue
Issue #301
More PR info.
gitpythonfs
fsspec implementation with fsspec. It could stay in verified-sources (not as project), be in a fork of fsspec, some day integrated into fsspec as standard implementation.ref
(sha, tag, branch) as an optional parameter? The existing github and git fsspec implementations have that parameter.committed_date
- the best git proxy for modifed_date. But this is an expensive lookup in git db, slowsls()
by x20 andls()
gets called a lot - in find, walk, glob etc. Takes 8 seconds to walk() dlt-hub/dlt repo, which is just small/medium size? So there is optioninclude_committed_date=False
to omit gettingcommitted_date
. modified_date is valuable for ETL, but how valuable? An alternative is to read all commits (or recent commits, with append disposition) as a separate dlt source/resource and JOIN with the files in code or in destination db.gitpythonfs
?git
is already taken by the implementation based on pygit2tests/filesystem/test_filesystem.py
, we could try to reference theverified-sources
repo - , rather than create a new repo just for sample files. Were there other reasons for a separate repo for test samples?Next step is to make it work with filesystem verified source. See skeleton example below. There are some barriers to that working...
Observations about dlt.common.storages fsspec handling:
gitpythonfs://[path-to-repo:][ref@]path/to/file
causes errors and strange output. Should it support that? Or should we always use keyword args (coming in issue #814)if
statements to detect and supportfile
fsspec.open()
oropen_files()
?Uncommitted example function for
sources/filesystem_pipeline.py
:That would be the first example where the file contents are kept intact, not parsed like for csv, parquet etc.
Could do a similar example that reads *.md doco from dlt repo and loads it for GPT4 docs chatbot. Was that an original use case for git fsspec?