-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vendor or scrap lazy mode? #79
Comments
Is there a particular reason why you need to use the lazy API? And have you tried replacing usage of
The main difference that Of course, if you don't want the weight of Dagger, it would be a reasonable time to remove it as a dependency. But I do think that there could also be an opportunity to have a "generic" |
I'm all for making use of Dagger also in the eager loading, I just haven't gotten around to it. To me, that is a bit of a separate issue. I use the lazy mode of FileTrees quite alot in my workflow when doing exploratory data analysis. This often involves lazy-loading data which does not fit in memory so I absolutely do not want to try to load the entire data set and then slicing around and combining in the tree interactively. One simple to explain case when the lazy loading is very useful is when using This could clearly be done without the lazy mode, but I fear that it will not feel as good to use. Also note that this does not really need Daggers lazy mode scheduling (I would never ask of you to keep it alive just for this), but I think it can be worth it to have a light weight lazy mode in FileTrees. To me, this is independent of whether Dagger is used to enable parallelism. The usefulness of lazy mode has more to do with interactivity than it has to do with parallelism for me, which is why I also think it makes some sense to vendor it. Another common use case is to supply a FileTree to a long function which plots various aspects of the data and generates a report. The fact that it does not matter if the tree is lazy or not is extremely convenient as one can develop the function with a small dataset loaded into memory (e.g. trying out the preprocessing for each plot in the REPL) and then just supply a lazy FileTree with the larger data that does not fit and it just works. If one was using eager mode it would just start trying to load all the data (which would fail), whereas with the lazy mode only the data needed for each plot is loaded (and is then GC:able after the plot is produced and put in the report). |
It sounds like I might have missed the features that you rely on when considering my response - if you wouldn't mind, could you provide an example or two (with code) of things that the lazy API lets you do that you're concerned that the eager API would fail on? That makes it more concrete for me and might help me see where the eager API is lacking. Also, the eager API fully supports lazy-loading of data, but of course the first task that uses a piece of lazy-loaded data as an argument will force that data to be materialized, at least for some amount of time (the data can be swapped back to disk automatically). It might be that we need more tools for defining lazy/batched operations over lazy-loaded data, which might look very close to the lazy task API, so we could consider what that would look like. In the end, it is also possible to build a full lazy API over top of the eager API - this probably seems redundant, but the lazy API exists in a way that makes the eager API harder to implement, so removing and re-adding it could lead to an overall better design that still benefits from everything the eager API offers. |
Also I missed the example you posted - I'll read into it and let you know if I can come up with any solutions! Please point me to any others that you think could help me! |
Thanks alot for your responses and for investing time in this issue. I really appreciate it! Let me know if you can't make sense of the example I provided and I'll try to construct another one. Just imagine that the file structure in the example is way larger, deeper and irregular and that the data does not fit into memory, and the main thing you want to do is to combine data based on patterns in the paths. Also imagine that you do make use of FileTrees to find said patterns (e.g. by just looking at its output). :) Just to try to be perfectly clear: I don't think anything is needed from Dagger, neither in terms of maintaining the lazy API or adding capabilities the eager. My point of view is that FileTrees provides a lazy mode which happens to be quite convenient when making everyday use of the stuff that FileTrees does. Up to this point, this functionality was just half-accidentally freeriding on the fact that Dagger happened to have it (this is probably not how FileTrees was conceived, but this is how it looks like for me as a user of FileTrees). Now that time has ended and I/we need to do something about it. I saw Dagger.File in the docs, and if Dagger could be given the lazy:ish capabilities you speak of that sounds like it could be quite useful though. |
Everything is really clear from the example you posted and from the documentation as a whole - it's all a great introduction to FileTrees! I definitely see a clear benefit from using the lazy API for FileTrees - primarily the chainable nature of it is quite powerful for building up a transformation. To be clear, it seems like there are two modes - the default simulates the movement/transformation of files within the tree (allowing users to preview the result of a transformation), while adding in I do agree that Dagger maybe is not necessary for the core of FileTrees for many users. Implementing a more basic chainable, function call-based DAG which can be built and all at once evaluated would probably suffice for 95% of needs, it seems. Still, if it's not too much of a burden, and some of Dagger's other features can prove useful for users, I think it could make sense to keep the maintenance burden of supporting such a runtime system on Dagger, since that's what Dagger tries to do best 😄 If we do agree that keeping Dagger as the FileTrees core is a reasonable idea, then let me outline some of the features that I personally see as being key to FileTrees' utility to users (whether they're currently used or might be used in the future):
|
Awesome! I'm happy that you think the use case is worthwhile to have in Dagger. The main reason I started this issue was just because I'm also aware of much much of a maintenance burden the current lazy API is. I can see how adding some lazyness on top of the eager API could be much easier.
I think it is pretty much correct, except that I agree with the entire list above, and I do make use of the other stuff from time to time. The only thing is the lazy API. Minor question: What is the reason that Another minor comment is that I often find that threads slow things down since the tasks are memory bound so I often run with |
I would not want to use a Dagger.File here. A philosophy of FileTrees.load is it just takes a path and returns anything it likes--the most parameterized thing imaginable. I am a little bit saddened that Dagger's original goal of being out-of-core and necessarily lazy for that reason now needs to be rediscovered and added back as a nice-to-have. It might just make sense to have a smaller package that does this well. (I would look into @tanmaykm's scheduler from 2018 since that was written with the lazy graph in mind and based on the best research available then. See more here https://www.youtube.com/watch?v=2G4ptA5J1bk) But to begin with a simple work-stealing scheduler would be good enough for most workloads FileTrees can run on. |
Can you clarify what you mean by this, maybe with an example? If you mean that you can just pass a path and Dagger would return an object of the appropriate data type, you could always use
Have you tried the new out-of-core support in Dagger? It's provided via MemPool and implements a tunable LRU/MRU strategy, and allows working with data of any type. Any Dagger
It's not ideal, that's true, but Dagger's core APIs are basically maintained by only me, and my focus has been on supporting APIs that compose with Julia's inherently sequential, eager interfaces. If you wanted to help me figure out how to rebuild
I also wish that this scheduler would have made its way into Dagger's core codebase. At the time, it seemed like that was the goal, but it didn't appear that anyone was actively working on either integrating it into Dagger's core, or getting the ecosystem interested in using it (to help spur development). In the absence of being able to understand it myself, I had to basically implement things from scratch within Dagger's core. In the end, we now have a quite capable scheduler built-in to Dagger, and I have a large overhaul planned that will make it ever more programmable. I also want to improve its reliability and robustness in the face of various failures and stalls, but I will need a lot of help from the ecosystem to find use cases (with locally reproducible code) to help drive that development. |
The writing has been on the wall for a long time for Daggers lazy API (which is what powers FileTrees lazy mode) and in the latest release it gives deprecation warnings when used.
I personally use the lazy mode quite a bit in my workflows so I wouldn't mind adding a simple lazy computation framework in FileTrees. I'm not sure however if it is just because 1) I'm used to it and 2) I underestimate how much effort it would be to maintain it.
I haven't given it much thought, but it seems that a lightweight lazy computation framework which just recursively executes the thunks would not be much extra work. Getting parallelism on this could then just be to put
Dagger.@spawn
in front of every call when executing or something.I guess this also gives the opportunity to have Dagger as an weak dependency. In about 90% of my use cases I don't have any use for parallelism and if this is similar for other users it could reduce the weight of the package by quite a bit.
Maybe one could also have a weak dependency for Distributed, although I guess that would be a slippery slope towards reimplementing the entire lazy machinery in Dagger which should probably be avoided.
@shashi and @jpsamaroo : I guess this will be a larger change to the package in either case, so I'd like to get your opinions if you have the time.
The text was updated successfully, but these errors were encountered: