Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracking of runs #83

Merged
merged 7 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Add a `from_df()` method to the `Registry` class to create new artifacts from data frames (PR #78)
- Create `TemporaryRecord` classes for new artifacts before they have been saved to the database (PR #78)
- Add a `delete()` method to the `Record` class (PR #78)
- Add `track()` and `finish()` methods to the `Instance` class (PR #83)

## MAJOR CHANGES

Expand Down
65 changes: 64 additions & 1 deletion R/Instance.R
Original file line number Diff line number Diff line change
Expand Up @@ -191,10 +191,73 @@ Instance <- R6::R6Class( # nolint object_name_linter
},
#' @description Get the Python lamindb module
#'
#' @param check Logical, whether to perform checks
#' @param what What the python module is being requested for, used in check
#' messages
#'
#' @return Python lamindb module.
get_py_lamin = function() {
get_py_lamin = function(check = FALSE, what = "This functionality") {
if (check && isFALSE(self$is_default)) {
cli::cli_abort(c(
"{what} can only be performed by the default instance",
"i" = "Use {.code connect(slug = NULL)} to connect to the default instance"
))
}

if (check && is.null(self$get_py_lamin())) {
cli::cli_abort(c(
"{what} requires the Python lamindb package",
"i" = "Check the output of {.code connect()} for warnings"
))
}

private$.py_lamin
},
#' @description Start a run with tracked data lineage
#'
#' @details
#' Calling `track()` with `transform = NULL` with return a UID, providing
#' that UID with the same path with start a run
#'
#' @param path Path to the R script or document to track
#' @param transform UID specifying the data transformation
track = function(path, transform = NULL) {
py_lamin <- self$get_py_lamin(check = TRUE, what = "Tracking")

if (is.null(transform)) {
transform <- tryCatch(
py_lamin$track(path = path),
error = function(err) {
py_err <- reticulate::py_last_error()
if (py_err$type != "MissingContextUID") {
cli::cli_abort(c(
"Python error {.val {py_err$type}}",
"i" = "Run {.run reticulate::py_last_error()} for details"
))
}

uid <- gsub(".*\\(\"(.*?)\"\\).*", "\\1", py_err$value)
cli::cli_inform(paste(
"Got UID {.val {uid}} for path {.file {path}}.",
"Run this function with {.code transform = \"{uid}\"} to track this path."
))
}
)
} else {
if (is.character(transform) && nchar(transform) != 16) {
cli::cli_abort(
"The transform UID must be exactly 16 characters, got {nchar(transform)}"
)
}

py_lamin$track(transform = transform, path = path)
}
},
#' @description Finish a tracked run
finish = function() {
py_lamin <- self$get_py_lamin(check = TRUE, what = "Tracking")
py_lamin$finish()
},
#' @description
#' Print an `Instance`
#'
Expand Down
18 changes: 3 additions & 15 deletions R/Registry.R
Original file line number Diff line number Diff line change
Expand Up @@ -154,27 +154,15 @@ Registry <- R6::R6Class( # nolint object_name_linter
#' @return A `TemporaryRecord` object containing the new record. This is not
#' saved to the database until `temp_record$save()` is called.
from_df = function(dataframe, key = NULL, description = NULL, run = NULL) {
if (isFALSE(private$.instance$is_default)) {
cli::cli_abort(c(
"Only the default instance can create records",
"i" = "Use {.code connect(slug = NULL)} to connect to the default instance"
))
}

if (is.null(private$.instance$get_py_lamin())) {
cli::cli_abort(c(
"Creating records requires the Python lamindb package",
"i" = "Check the output of {.code connect()} for warnings"
))
}

if (private$.registry_name != "artifact") {
cli::cli_abort(
"Creating records from data frames is only supported for the Artifact registry"
)
}

py_lamin <- private$.instance$get_py_lamin()
py_lamin <- private$.instance$get_py_lamin(
check = TRUE, what = "Creating records"
)

py_record <- py_lamin$Artifact$from_df(
dataframe,
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ LaminDB is accompanied by LaminHub which is a data collaboration hub built on La
- Planned: `.fcs`, `.h5mu`, `.zarr`.
- Create records from data frames.
- Delete records.
- Track code in R scripts and notebooks.

See the development roadmap for more details (`vignette("development", package = "laminr")`).

Expand Down
48 changes: 47 additions & 1 deletion man/Instance.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

80 changes: 31 additions & 49 deletions vignettes/architecture.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,9 @@ classDiagram
Instance --> RelatedRecords
InstanceAPI --> RelatedRecords

%% Use #emsp; to create indents in the rendered diagram when necessary
%% Methods must be on one line to be shown in the right diagram section
%% Use \n for newlines and #emsp; to create indents in the rendered
%% diagram when necessary

class laminr{
+connect(String slug): RichInstance
Expand All @@ -150,13 +152,16 @@ classDiagram
+api_url: String
}
class Instance{
+initialize(
#emsp;InstanceSettings Instance_settings, API api,
#emsp;Map<String, any> schema
): Instance
+initialize(\n#emsp;InstanceSettings Instance_settings, API api, \n#emsp;Map<String, any> schema\n): Instance
+get_modules(): Module[]
+get_module(String module_name): Module
+get_module_names(): String[]
+get_api(): InstanceAPI
+get_settings(): InstanceSettings
+get_py_lamin(Boolean check, String what): PythonModule
+track(String path, String transform): NULL
+finish(): NULL
+is_default: Boolean
}
class InstanceAPI{
+initialize(InstanceSettings Instance_settings)
Expand All @@ -166,38 +171,28 @@ classDiagram
+delete_record(...): NULL
}
class Module{
+initialize(
#emsp;Instance Instance, API api, String module_name,
#emsp;Map<String, any> module_schema
): Module
+initialize(\n#emsp;Instance Instance, API api, String module_name,\n#emsp;Map<String, any> module_schema\n): Module
+name: String
+get_registries(): Registry[]
+get_registry(String registry_name): Registry
+get_registry_names(): String[]
}
class Registry{
+initialize(
#emsp;Instance Instance, Module module, API api,
#emsp;String registry_name, Map<String, Any> registry_schema
): Registry
+initialize(\n#emsp;Instance Instance, Module module, API api,\n#emsp;String registry_name, Map<String, Any> registry_schema\n): Registry
+name: String
+class_name: String
+is_link_table: Bool
+get_fields(): Field[]
+get_field(String field_name): Field
+get_field_names(): String[]
+get(String id_or_uid, Bool include_foreign_keys, List~String~ select, Bool verbose): RichRecord
+get(\n#emsp;String id_or_uid, Bool include_foreign_keys,\n#emsp;List~String~ select, Bool verbose\n): RichRecord
+get_record_class(): RichRecordClass
+get_temporary_record_class(): TemporaryRecordClass
+df(Integer limit, Bool verbose): DataFrame
+from_df(DataFrame dataframe, String key, String description, String run)): TemporaryRecord
+from_df(\n#emsp;DataFrame dataframe, String key,\n#emsp;String description, String run\n): TemporaryRecord
}
class Field{
+initialize(
#emsp;String type, String through, String field_name, String registry_name,
#emsp;String column_name, String module_name, Bool is_link_table, String relation_type,
#emsp;String related_field_name, String related_registry_name, String related_module_name
): Field
+initialize(\n#emsp;String type, String through, String field_name,\n#emsp;String registry_name, String column_name, String module_name,\n#emsp;Bool is_link_table, String relation_type, String related_field_name,\n#emsp;String related_registry_name, String related_module_name\n): Field
+type: String
+through: Map
+field_name: String
Expand All @@ -211,15 +206,12 @@ classDiagram
+related_module_name: String
}
class Record{
+initialize(Instance Instance, Registry registry, API api, Map<String, Any> data): Record
+initialize(\n#emsp;Instance Instance, Registry registry,\n#emsp;API api, Map<String, Any> data\n): Record
+get_value(String field_name): Any
+delete(): NULL
}
class RelatedRecords{
+initialize(
#emsp;Instance instance, Registry registry, Field field,
#emsp;String related_to, API api
): RelatedRecords
+initialize(\n#emsp;Instance instance, Registry registry, Field field,\n#emsp;String related_to, API api\n): RelatedRecords
+df(): DataFrame
+field: Field
}
Expand Down Expand Up @@ -317,13 +309,16 @@ classDiagram
+api_url: String
}
class Instance{
+initialize(
#emsp;InstanceSettings Instance_settings, API api,
#emsp;Map<String, any> schema
): Instance
+initialize(\n#emsp;InstanceSettings Instance_settings, API api, \n#emsp;Map<String, any> schema\n): Instance
+get_modules(): Module[]
+get_module(String module_name): Module
+get_module_names(): String[]
+get_api(): InstanceAPI
+get_settings(): InstanceSettings
+get_py_lamin(Boolean check, String what): PythonModule
+track(String path, String transform): NULL
+finish(): NULL
+is_default: Boolean
}
class InstanceAPI{
+initialize(InstanceSettings Instance_settings)
Expand All @@ -333,38 +328,28 @@ classDiagram
+delete_record(...): NULL
}
class Module{
+initialize(
#emsp;Instance Instance, API api, String module_name,
#emsp;Map<String, any> module_schema
): Module
+initialize(\n#emsp;Instance Instance, API api, String module_name,\n#emsp;Map<String, any> module_schema\n): Module
+name: String
+get_registries(): Registry[]
+get_registry(String registry_name): Registry
+get_registry_names(): String[]
}
class Registry{
+initialize(
#emsp;Instance Instance, Module module, API api,
#emsp;String registry_name, Map<String, Any> registry_schema
): Registry
+initialize(\n#emsp;Instance Instance, Module module, API api,\n#emsp;String registry_name, Map<String, Any> registry_schema\n): Registry
+name: String
+class_name: String
+is_link_table: Bool
+get_fields(): Field[]
+get_field(String field_name): Field
+get_field_names(): String[]
+get(String id_or_uid, Bool include_foreign_keys, List~String~ select, Bool verbose): RichRecord
+get(\n#emsp;String id_or_uid, Bool include_foreign_keys,\n#emsp;List~String~ select, Bool verbose\n): RichRecord
+get_record_class(): RichRecordClass
+get_temporary_record_class(): TemporaryRecordClass
+df(Integer limit, Bool verbose): DataFrame
+from_df(DataFrame dataframe, String key, String description, String run)): TemporaryRecord
+from_df(\n#emsp;DataFrame dataframe, String key,\n#emsp;String description, String run\n): TemporaryRecord
}
class Field{
+initialize(
#emsp;String type, String through, String field_name, String registry_name,
#emsp;String column_name, String module_name, Bool is_link_table, String relation_type,
#emsp;String related_field_name, String related_registry_name, String related_module_name
): Field
+initialize(\n#emsp;String type, String through, String field_name,\n#emsp;String registry_name, String column_name, String module_name,\n#emsp;Bool is_link_table, String relation_type, String related_field_name,\n#emsp;String related_registry_name, String related_module_name\n): Field
+type: String
+through: Map
+field_name: String
Expand All @@ -378,15 +363,12 @@ classDiagram
+related_module_name: String
}
class Record{
+initialize(Instance Instance, Registry registry, API api, Map<String, Any> data): Record
+initialize(\n#emsp;Instance Instance, Registry registry,\n#emsp;API api, Map<String, Any> data\n): Record
+get_value(String field_name): Any
+delete(): NULL
}
class RelatedRecords{
+initialize(
#emsp;Instance instance, Registry registry, Field field,
#emsp;String related_to, API api
): RelatedRecords
+initialize(\n#emsp;Instance instance, Registry registry, Field field,\n#emsp;String related_to, API api\n): RelatedRecords
+df(): DataFrame
+field: Field
}
Expand Down
8 changes: 5 additions & 3 deletions vignettes/development.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,11 @@ This document outlines the features of the **{laminr}** package and the roadmap

### Track notebooks & scripts

* [ ] **Track code execution**: Automatically track the execution of R scripts and notebooks.
* [x] **Track code execution**: Automatically track the execution of R scripts and notebooks.
* [ ] **Capture run context**: Record information about the execution environment (e.g., package versions, parameters).
* [ ] **Link code to artifacts**: Associate code execution with generated artifacts.
* [x] **Link code to artifacts**: Associate code execution with generated artifacts.
* [ ] **Visualize data lineage**: Create visualizations of data lineage and dependencies.
* [x] **Finalize tracking**: End and save a run.

### Curate datasets

Expand Down Expand Up @@ -126,10 +127,11 @@ A first version of the package that allows users to:
* Expand query functionality with comparators, relationships, and pagination.
* Implement basic data and metadata management features (create, save, load and delete artifacts).
* Expand support for different data formats and storage backends.
* Implement code tracking.

### Version 0.3.0

* Implement code tracking and data lineage visualization.
* Implement data lineage visualization.
* Introduce data curation features (validation, standardization, annotation).
* Enhance support for bionty registries and ontology interactions.

Expand Down