-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Adding ln.track()
and ln.Artifact(filepath).save()
equivalents
#76
Comments
It turns out that the important calls work already via Executing the following script: library(reticulate)
# Import lamindb
ln <- import("lamindb")
ln$track("EPnfDtJz8qbE0000", path="test-laminr.R") # <-- unique id for the script, script path
# Create a sample R dataframe
r_df <- data.frame(
id = 1:5,
value = c(10.5, 20.3, 15.7, 25.1, 30.2),
category = c("A", "B", "A", "C", "B")
)
# Save the dataframe as RDS
storage_path <- "example_data.rds"
saveRDS(r_df, storage_path)
ln$Artifact(storage_path, description="Example dataframe")$save() # save an artifact
ln$finish() # mark the script run as finished gives rise to these logs: % Rscript test-laminr.R
→ connected lamindb: falexwolf/docsbuild
→ created Transform('EPnfDtJz'), started new Run('ZEauiBpc') at 2024-11-12 18:52:06 UTC
→ returning existing artifact with same hash: Artifact(uid='fkOF8rmyP06cA4lQ0000', is_latest=True, description='Example dataframe', suffix='.rds', size=227, hash='R94kXc7AXWPd2zJgkpHwpg', _hash_type='md5', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, created_at=2024-11-12 18:49:24 UTC)
Artifact(uid='fkOF8rmyP06cA4lQ0000', is_latest=True, description='Example dataframe', suffix='.rds', size=227, hash='R94kXc7AXWPd2zJgkpHwpg', _hash_type='md5', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-11-12 18:49:24 UTC)
→ finished Run('ZEauiBpc') after 0h 0m 1s at 2024-11-12 18:52:08 UTC |
Five TODOs. (1) Generating the
|
ln.track()
in Rln.track()
and ln.Artifact(filepath).save()
features
ln.track()
and ln.Artifact(filepath).save()
featuresln.track()
and ln.Artifact(filepath).save()
equivalents
@Zethson mentioned that the mlflow R package is a rather shallow wrapper around the python mlflow package and heavily relies on reticulate for it. We should think through all the pros and cons of building certain features in R, hitting the REST API (and building logic serverside) vs. using reticulate. It might be that the current path is, in fact, a sweet spot: leverage the REST API for queries, leverage reticulate for basic Python logic. But I need to think more and we should have a dedicated meeting to discuss. |
I think this should be doable but it might take some messing around. I'm not exactly sure how to catch the Python error but it should be possible and then we can output the matching R code to run. I think in RStudio we can also make it so you can click to run the code.
We can check. I'm not sure if there is a way to get the current running path or not (possibly for Rmd/qmd but not R scripts?).
In theory, yes but RStudio does messing around with paths and the order things are run. Yesterday I was having issues connecting to a Python environment in RStudio that worked fine in the R terminal. That's fairly rare though, and if the environment connects it should work. There are ways to help manage environments for users but I'm not sure if they fit here because they also need to be able to run things on the command line (currently at least).
We should be able to wrap them so that they have a similar interface to Python. Usually that's fairly easy once you get code that works.
This is a problem I have already run into trying to make it possible to create artifacts from data frames. In R we don't have issues connecting to multiple instances (if they have APIs, I'll add more about that in a second). But as far as I can tell, you can't do that in a Python session. I haven't fully tested it yet but I think this will be an issue with using {reticulate} as anything that imports Python lamindb will connect to an instance in that session and block connections to any other instances. I also had to make sure the Python auto connect setting was turned off so we make sure we connect to the same instance in R and Python. Maybe we need to have the same limitation of only connecting to one instance in a session? This would mean we can't have the suggested workflow of connecting to CELLxGENE, doing something, and saving results to a local instance but I'm not sure that is possible in Python either. API vs {reticulate}On the point of using the API vs {reticulate}. There are pros and cons either way but I have already run into one fairly major issue. We designed the current implementation around the API which unfortunately means we can't connect to a local instance where that API doesn't exist. @rcannood and I can think about if there is a workaround but if not the only solution I can think of is to either have an API for a local instance or rewrite the package to wrap Python instead. Neither of which would be quick/easy. A temporary solution might be to have a public API instance that we can mess around with. This is probably something we need to have a meeting about though. |
Thanks for the in-depth response, Luke! All your points make sense to me. Would you have time for a brief call to come up with solutions and clarify some questions that I read between the lines? Either at 10 am, 11:30 am, 13 am or if none of these work https://calendly.com/falexwolf/45flex? |
I'm now convinced that we need to build also build
ln.track()
right away and can't delay it by several weeks. I'll help.What we'll do is the outmost simple way of implementing data lineage tracking:
What
db$track("EPnfDtJz8qbE0000")
does under the hood is simply callingln.track()
in Python via reticulate.Then, upon
db$Artifact.create("filepath", description="description")
(if I remember correctly how we wanted to call this function), the artifact will be linked against a run and there will be an automatic registration of the output of the script.Does this sound reasonable?
Background
I've started to work on enabling saving R scripts here
.qmd
and.Rmd
lamin-cli#95In the process of doing so, it became clear that without an
ln.track()
equivalent, we'd start to support a whole new way of dealing with transforms which wouldn't even merit the name "Transform" anymore. We'd have no way of associating output artifacts with the R script.While supporting all of this is possible it'll both introduce strange patterns in the code base and introduce a way of doing something that we're upfront not OK with. A BIG argument for using lamindb is that you get data lineage and no longer need to wonder "where the hell my dataset came from". That's why I'm concluding we should invest into getting this to work in R, too, right away.
The text was updated successfully, but these errors were encountered: