Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Adding ln.track() and ln.Artifact(filepath).save() equivalents #76

Closed
falexwolf opened this issue Nov 12, 2024 · 6 comments
Closed

Comments

@falexwolf
Copy link
Member

falexwolf commented Nov 12, 2024

I'm now convinced that we need to build also build ln.track() right away and can't delay it by several weeks. I'll help.

What we'll do is the outmost simple way of implementing data lineage tracking:

library(laminr)

db <- connect()  # <-- connect to your default db instance, configured via CLI

db$track("EPnfDtJz8qbE0000")  # <-- unique id for your script

# do things

db$finish()  # <-- finish run of script

What db$track("EPnfDtJz8qbE0000") does under the hood is simply calling ln.track() in Python via reticulate.

Then, upon db$Artifact.create("filepath", description="description") (if I remember correctly how we wanted to call this function), the artifact will be linked against a run and there will be an automatic registration of the output of the script.

Does this sound reasonable?

Background

I've started to work on enabling saving R scripts here

In the process of doing so, it became clear that without an ln.track() equivalent, we'd start to support a whole new way of dealing with transforms which wouldn't even merit the name "Transform" anymore. We'd have no way of associating output artifacts with the R script.

While supporting all of this is possible it'll both introduce strange patterns in the code base and introduce a way of doing something that we're upfront not OK with. A BIG argument for using lamindb is that you get data lineage and no longer need to wonder "where the hell my dataset came from". That's why I'm concluding we should invest into getting this to work in R, too, right away.

@falexwolf
Copy link
Member Author

falexwolf commented Nov 12, 2024

It turns out that the important calls work already via reticulate.

Executing the following script:

library(reticulate)

# Import lamindb
ln <- import("lamindb")

ln$track("EPnfDtJz8qbE0000", path="test-laminr.R")   # <-- unique id for the script, script path

# Create a sample R dataframe
r_df <- data.frame(
  id = 1:5,
  value = c(10.5, 20.3, 15.7, 25.1, 30.2),
  category = c("A", "B", "A", "C", "B")
)

# Save the dataframe as RDS
storage_path <- "example_data.rds"
saveRDS(r_df, storage_path)

ln$Artifact(storage_path, description="Example dataframe")$save()  # save an artifact
ln$finish()  # mark the script run as finished

gives rise to these logs:

% Rscript test-laminr.R
→ connected lamindb: falexwolf/docsbuild
→ created Transform('EPnfDtJz'), started new Run('ZEauiBpc') at 2024-11-12 18:52:06 UTC
→ returning existing artifact with same hash: Artifact(uid='fkOF8rmyP06cA4lQ0000', is_latest=True, description='Example dataframe', suffix='.rds', size=227, hash='R94kXc7AXWPd2zJgkpHwpg', _hash_type='md5', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, created_at=2024-11-12 18:49:24 UTC)
Artifact(uid='fkOF8rmyP06cA4lQ0000', is_latest=True, description='Example dataframe', suffix='.rds', size=227, hash='R94kXc7AXWPd2zJgkpHwpg', _hash_type='md5', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-11-12 18:49:24 UTC)
→ finished Run('ZEauiBpc') after 0h 0m 1s at 2024-11-12 18:52:08 UTC

@falexwolf
Copy link
Member Author

falexwolf commented Nov 12, 2024

Five TODOs.

(1) Generating the uid

The most tricky part of tracking transforms is generating the unique id for the transform (its uid).

This works via reticulate, too:

# file: test-laminr1.R
library(reticulate)

ln <- import("lamindb")

ln$track(path="test-laminr1.R")   # <-- create a unique id for the script, pass the script path

Executing it gives rise to these logs:

% Rscript test-laminr1.R
→ connected lamindb: falexwolf/docsbuild
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  lamindb.core.exceptions.MissingContextUID: ✗ to track this script, copy & paste `ln.track("C5ARZjmB4E750000")` and re-run

This tells the user that they need to copy & paste ln.track("C5ARZjmB4E750000").

Now we need to think through the easiest way to make this error message more accessible for R users. 🤔

It shouldn't even throw an error but rather open an interactive dialogue that tells the user: "Please copy and paste db$track("...")"

Can one of you achieve this, @rcannood or @lazappi?

(2) Passing the path of the file that's being run

As you see from the code examples, there is a path argument that's being passed to ln.track(). In Python code, this is auto-inferred from the Python environment or the notebook. It wasn't so easy to get this to work robustly in Python across editors and ways of running things.

Is there a way to do this in R? If people need to manually pass the path to their script that's not a show stopper but would be worse UX.

(3) Trying in RStudio and with .Rmd or .qmd

I don't have R Studio, so I've only tried this with a simple R script. I hope it behaves the same in R studio.

(4) Wrapping the reticulate calls into the R API of the package

Evidently, we don't want users to call reticulate; otherwise we wouldn't even need the laminr project. So, I hope there is a low complexity way to wrap these calls into a nice API.

I think what'd be pretty important is to introduce the notion of a "default instance" during an R session. Otherwise it'll be confusing for users.

Meaning the following should be allowed

library(laminr)

db <- connect()  # this is the writeable default instance that tracks everything the user does (inputs, outputs, script, run)
dbcxg <- connect("laminlabs/cellxgene")  # this is a readonly instance the user wants to query

db$track()

# note: robrecht added something:
db$Artifact("path/to/example.rds", description="Example dataframe")$save()
db$finish()  # mark the script run as finished

# note from robrecht: would this also work?
dbcxg$track()
dbcxg$Artifact("path/to/example.rds", description="Example dataframe")$save()
dbcxg$finish()  # mark the script run as finished
# -> apparently the python client would throw an error in this case

But if somebody tries to call dbxg$track() that wouldn't make any sense.

In the Python lamindb this problem doesn't occur because ln.track() is at the root level of the API and necessitates that the user configured a default instance via the CLI.

(5) Adapting the lamin CLI

I need to finalize his PR

The important assumption that I'll make is that I can get the transform uid from a function call track("XXXXXXXXXXXX" ...). This will be done via regex and happens like so also for Python code.

@falexwolf falexwolf changed the title ln.track() in R ✨ Adding capability for ln.track() and ln.Artifact(filepath).save() features Nov 12, 2024
@falexwolf falexwolf changed the title ✨ Adding capability for ln.track() and ln.Artifact(filepath).save() features ✨ Adding ln.track() and ln.Artifact(filepath).save() equivalents Nov 12, 2024
@falexwolf
Copy link
Member Author

@Zethson mentioned that the mlflow R package is a rather shallow wrapper around the python mlflow package and heavily relies on reticulate for it.

We should think through all the pros and cons of building certain features in R, hitting the REST API (and building logic serverside) vs. using reticulate.

It might be that the current path is, in fact, a sweet spot: leverage the REST API for queries, leverage reticulate for basic Python logic. But I need to think more and we should have a dedicated meeting to discuss.

@lazappi
Copy link
Collaborator

lazappi commented Nov 13, 2024

(1) Generating the uid

I think this should be doable but it might take some messing around. I'm not exactly sure how to catch the Python error but it should be possible and then we can output the matching R code to run. I think in RStudio we can also make it so you can click to run the code.

(2) Passing the path of the file that's being run

We can check. I'm not sure if there is a way to get the current running path or not (possibly for Rmd/qmd but not R scripts?).

(3) Trying in RStudio and with .Rmd or .Qmd

In theory, yes but RStudio does messing around with paths and the order things are run. Yesterday I was having issues connecting to a Python environment in RStudio that worked fine in the R terminal. That's fairly rare though, and if the environment connects it should work.

There are ways to help manage environments for users but I'm not sure if they fit here because they also need to be able to run things on the command line (currently at least).

(4) Wrapping the reticulate calls into the R API of the package

We should be able to wrap them so that they have a similar interface to Python. Usually that's fairly easy once you get code that works.

I think what'd be pretty important is to introduce the notion of a "default instance" during an R session. Otherwise it'll be confusing for users.

This is a problem I have already run into trying to make it possible to create artifacts from data frames. In R we don't have issues connecting to multiple instances (if they have APIs, I'll add more about that in a second). But as far as I can tell, you can't do that in a Python session. I haven't fully tested it yet but I think this will be an issue with using {reticulate} as anything that imports Python lamindb will connect to an instance in that session and block connections to any other instances. I also had to make sure the Python auto connect setting was turned off so we make sure we connect to the same instance in R and Python.

Maybe we need to have the same limitation of only connecting to one instance in a session? This would mean we can't have the suggested workflow of connecting to CELLxGENE, doing something, and saving results to a local instance but I'm not sure that is possible in Python either.

API vs {reticulate}

On the point of using the API vs {reticulate}. There are pros and cons either way but I have already run into one fairly major issue. We designed the current implementation around the API which unfortunately means we can't connect to a local instance where that API doesn't exist. @rcannood and I can think about if there is a workaround but if not the only solution I can think of is to either have an API for a local instance or rewrite the package to wrap Python instead. Neither of which would be quick/easy. A temporary solution might be to have a public API instance that we can mess around with. This is probably something we need to have a meeting about though.

@falexwolf
Copy link
Member Author

falexwolf commented Nov 13, 2024

Thanks for the in-depth response, Luke! All your points make sense to me.

Would you have time for a brief call to come up with solutions and clarify some questions that I read between the lines?

Either at 10 am, 11:30 am, 13 am or if none of these work https://calendly.com/falexwolf/45flex?

@lazappi
Copy link
Collaborator

lazappi commented Nov 21, 2024

Fixed by #78, #83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants