Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Transformer #223

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Snapshot Transformer #223

wants to merge 3 commits into from

Conversation

lukfor
Copy link
Collaborator

@lukfor lukfor commented Jun 13, 2024

This PR uses JSONPath selectors to remove, replace or map elements in snapshots.
Happy to hear your feedback! (especially @sateeshperi @nvnieuwk @maxulysse @adamrtalbot)

Related issues: #211 and #116

Snapshot Transformer

Taking snapshots of objects is an easy and effective way to create regression tests. By capturing the state of an object at a particular point in time, you can compare it against future states to detect any unintended changes. However, not every object is deterministic. Certain elements such as dates, log files, and headers can introduce variability, making direct comparisons unreliable.

Consider the following snapshot object:

{
    "id": "1234-5678-9101",
    "status": "SUCCESS",
    "start_time": "2024-06-12T14:31:16+0000",
    "end_time": "2024-06-12T14:31:16+0000",
    "chunks": [
        2,
        9,
        3,
        5,
        7,
        1,
        8,
        4,
        6
    ],
    "value": 1.3568956165789456,
    "files": [
        "output.csv:md5,3e4001700fcbfb82691fc113caed10f1",
        "output.log:md5,176337b1b6dafb59aeff4a574d3730d5"
    ]
}

This snapshot object has several non deterministic values:

  • Dates: The start_time and end_time fields contain timestamps that will vary with each execution.
  • Chunks with Random Order: The chunks array can have elements in random order.
  • Log Files with Timestamps: The files array includes a log file that contain timestamps.

To address this, nf-test provides methods to transform and reduce snapshots to make them deterministic. These methods include replacing random values, formatting numbers, and reducing large contents by generating MD5 hashes.

To make this snapshot deterministic, we need to transform these elements.

assert snapshot(function.result).  
    remove('.start_time').  
    replace('.end_time', "<END_TIME>").  
    map('.chunks', chunks -> chunks.sort()).
    map('.value', value -> format("#0.00", value)).  
    traverse((key, value) -> {
        return value.toString().endsWith(".log") ? file(value).name : value;
    }).  
    match()

Methods and Functions

replace

The replace function allows you to replace a value at a specific JSON path with a fixed value or pattern.

Parameters:

  • jsonPath: The path to the element in the JSON structure to replace.
  • replacement: The value or pattern to replace the element with.

Example:

replace('.end_time', "<END_TIME>")

This replaces the end_time field with a fixed string "<END_TIME>".

map

The map function transforms elements of an array or an object using a provided function. You can apply this to each element individually or to the array as a whole.

Parameters:

  • jsonPath: The path to the elements in the JSON structure to transform.
  • mapperFunction: A function that defines how each element should be transformed.

Example:

map('.chunks', chunks -> chunks.sort())

This sorts the elements of the chunks array to ensure consistent ordering.

Example:

map('.chunks[*]', value -> 27)

This replaces each element in the chunks array with the value 27.

remove

The remove function removes elements from the snapshot based on a JSON path.

Parameters:

  • jsonPath: The path to the element(s) in the JSON structure to remove.

Example:

remove('.start_time')

This removes the start_time field from the snapshot.

traverse

The traverse function iterates over all key-value pairs in the snapshot and applies a transformation function.

Parameters:

  • traversalFunction: A function that defines how each key-value pair should be transformed. It takes two arguments: key and value.

Example:

traverse((key, value) -> {
    return value.toString().endsWith(".log") ? "<LOG>" : value;
})

This replaces values that end with ".log" with the string "<LOG>".

view

The view function is used to print the final snapshot. It provides a way to see what the snapshot looks like after all transformations have been applied.

You can also view intermediate result:

assert snapshot(function.result). 
    view(). 
    remove('.start_time').  
    view().
    replace('.end_time', "<END_TIME>").  
    view().
    match()

Explanation of JSONPath Selectors

  • .start_time: Selects the start_time field at the root level of the JSON object.
  • .end_time: Selects the end_time field at the root level of the JSON object.
  • .chunks: Selects the chunks array at the root level of the JSON object.
  • .chunks[*]: Selects each element within the chunks array.
  • contents[0].value: Selects the value field of the first element in the contents array.
  • contents[0].files[?(@ == "output.log")]: Selects the output.log file in the files array if it matches exactly.
  • contents[0].files[?(@ =~ /.*\\.log$/)]: Selects all log files in the files array using a regular expression.

@maxulysse
Copy link

That looks promising, I'll have a look

@nvnieuwk
Copy link
Contributor

This looks very good! I definitely see some useful things that could simplify how we create snapshots here 🥳

@drpatelh
Copy link

drpatelh commented Jul 1, 2024

Hi @lukfor ! This looks great!

From what we have seen, the biggest problem typically arises from a few "variable" files that don't allow us to snapshot meaning we have to revert to only checking for workflow.success.

Using your example below,

"files": [
        "output.csv:md5,3e4001700fcbfb82691fc113caed10f1",
        "output.log:md5,176337b1b6dafb59aeff4a574d3730d5"
    ]

What would the relevant code snippet be to:

  1. Check file existence for output.log and ignore the md5sum match but still check the md5sum for output.csv

  2. Remove/ignore output.log from the snapshot validation

  3. If we had more entries in files, how could we combine 1. and 2. if required?

cc @stevekm since you created #211

@sateeshperi
Copy link
Contributor

@GallVp Usman, could you give this feature a look and provide your feedback. In general, this might be a good feature but, I doubt if it will be useful right away in nf-core as it expects the outputs to be separated into their respective channels (which is not the case for many nf-core modules). anyway, plz give it a look and lets discuss when we meet next. Thanks

@GallVp
Copy link

GallVp commented Jul 16, 2024

Hi @sateeshperi

This can be very useful in some cases. I was in a somewhat similar situation and resorted to the following logic for the orthofinder module:

import groovy.io.FileType

.
.
.

assert process.success

def all_files = []

file(process.out.orthofinder[0][1]).eachFileRecurse (FileType.FILES) { file ->
    all_files << file
}

def all_file_names = all_files.collect { it.name }.sort(false)

def stable_file_names = [
    'Statistics_PerSpecies.tsv',
    'SpeciesTree_Gene_Duplications_0.5_Support.txt',
    'SpeciesTree_rooted.txt'
]

def stable_files = all_files.findAll { it.name in stable_file_names }

assert snapshot(
    all_file_names,
    stable_files,
    process.out.versions[0]
).match()

With the proposed transformers, we may be able to make the snapshotting more groovy!

@lukfor
Copy link
Collaborator Author

lukfor commented Jul 17, 2024

Thanks for your examples! They really help me get a better understanding of what is needed.

@GallVp
Copy link

GallVp commented Jul 31, 2024

Thank you @lukfor

This pattern is very common (example from nf-core/modules),

{ assert snapshot(process.out.versions,
    process.out.bam.collect { bam(it[1]).getReadsMD5() },
    process.out.fastq,
    process.out.log
    ).match()
}

where all the outputs can be md5'ed except the log file or a bam file. We currently have to list all the outputs and apply a function to a specific output. Would it be possible to select a specific output by name and apply a function to it. So that,

{ assert snapshot(
    process.out.mutate('bam') { it -> bam(it[1]).getReadsMD5() }
    ).match()
}

mutate takes the process/workflow output object, selects by name, applies a closure and returns the mutated output object.

@sateeshperi
Copy link
Contributor

@GallVp check out the nft-bam plugin by @nvnieuwk it makes getting stable snapshots from bam/cram files much easily

@GallVp
Copy link

GallVp commented Aug 1, 2024

@GallVp check out the nft-bam plugin by @nvnieuwk it makes getting stable snapshots from bam/cram files much easily

Thank you @sateeshperi

The example code I pasted above is already using nft-bam: bam(it[1]).getReadsMD5()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants