Snapshot Transformer #223

lukfor · 2024-06-13T13:34:10Z

This PR uses JSONPath selectors to remove, replace or map elements in snapshots.
Happy to hear your feedback! (especially @sateeshperi @nvnieuwk @maxulysse @adamrtalbot)

Related issues: #211 and #116

Snapshot Transformer

Taking snapshots of objects is an easy and effective way to create regression tests. By capturing the state of an object at a particular point in time, you can compare it against future states to detect any unintended changes. However, not every object is deterministic. Certain elements such as dates, log files, and headers can introduce variability, making direct comparisons unreliable.

Consider the following snapshot object:

{
    "id": "1234-5678-9101",
    "status": "SUCCESS",
    "start_time": "2024-06-12T14:31:16+0000",
    "end_time": "2024-06-12T14:31:16+0000",
    "chunks": [
        2,
        9,
        3,
        5,
        7,
        1,
        8,
        4,
        6
    ],
    "value": 1.3568956165789456,
    "files": [
        "output.csv:md5,3e4001700fcbfb82691fc113caed10f1",
        "output.log:md5,176337b1b6dafb59aeff4a574d3730d5"
    ]
}

This snapshot object has several non deterministic values:

Dates: The start_time and end_time fields contain timestamps that will vary with each execution.
Chunks with Random Order: The chunks array can have elements in random order.
Log Files with Timestamps: The files array includes a log file that contain timestamps.

To address this, nf-test provides methods to transform and reduce snapshots to make them deterministic. These methods include replacing random values, formatting numbers, and reducing large contents by generating MD5 hashes.

To make this snapshot deterministic, we need to transform these elements.

assert snapshot(function.result).  
    remove('.start_time').  
    replace('.end_time', "<END_TIME>").  
    map('.chunks', chunks -> chunks.sort()).
    map('.value', value -> format("#0.00", value)).  
    traverse((key, value) -> {
        return value.toString().endsWith(".log") ? file(value).name : value;
    }).  
    match()

Methods and Functions

`replace`

The replace function allows you to replace a value at a specific JSON path with a fixed value or pattern.

Parameters:

jsonPath: The path to the element in the JSON structure to replace.
replacement: The value or pattern to replace the element with.

Example:

replace('.end_time', "<END_TIME>")

This replaces the end_time field with a fixed string "<END_TIME>".

`map`

The map function transforms elements of an array or an object using a provided function. You can apply this to each element individually or to the array as a whole.

Parameters:

jsonPath: The path to the elements in the JSON structure to transform.
mapperFunction: A function that defines how each element should be transformed.

Example:

map('.chunks', chunks -> chunks.sort())

This sorts the elements of the chunks array to ensure consistent ordering.

Example:

map('.chunks[*]', value -> 27)

This replaces each element in the chunks array with the value 27.

`remove`

The remove function removes elements from the snapshot based on a JSON path.

Parameters:

jsonPath: The path to the element(s) in the JSON structure to remove.

Example:

remove('.start_time')

This removes the start_time field from the snapshot.

`traverse`

The traverse function iterates over all key-value pairs in the snapshot and applies a transformation function.

Parameters:

traversalFunction: A function that defines how each key-value pair should be transformed. It takes two arguments: key and value.

Example:

traverse((key, value) -> {
    return value.toString().endsWith(".log") ? "<LOG>" : value;
})

This replaces values that end with ".log" with the string "<LOG>".

`view`

The view function is used to print the final snapshot. It provides a way to see what the snapshot looks like after all transformations have been applied.

You can also view intermediate result:

assert snapshot(function.result). 
    view(). 
    remove('.start_time').  
    view().
    replace('.end_time', "<END_TIME>").  
    view().
    match()

Explanation of JSONPath Selectors

.start_time: Selects the start_time field at the root level of the JSON object.
.end_time: Selects the end_time field at the root level of the JSON object.
.chunks: Selects the chunks array at the root level of the JSON object.
.chunks[*]: Selects each element within the chunks array.
contents[0].value: Selects the value field of the first element in the contents array.
contents[0].files[?(@ == "output.log")]: Selects the output.log file in the files array if it matches exactly.
contents[0].files[?(@ =~ /.*\\.log$/)]: Selects all log files in the files array using a regular expression.

maxulysse · 2024-06-13T14:36:33Z

That looks promising, I'll have a look

nvnieuwk · 2024-06-25T08:46:36Z

This looks very good! I definitely see some useful things that could simplify how we create snapshots here 🥳

drpatelh · 2024-07-01T09:30:33Z

Hi @lukfor ! This looks great!

From what we have seen, the biggest problem typically arises from a few "variable" files that don't allow us to snapshot meaning we have to revert to only checking for workflow.success.

Using your example below,

"files": [
        "output.csv:md5,3e4001700fcbfb82691fc113caed10f1",
        "output.log:md5,176337b1b6dafb59aeff4a574d3730d5"
    ]

What would the relevant code snippet be to:

Check file existence for output.log and ignore the md5sum match but still check the md5sum for output.csv
Remove/ignore output.log from the snapshot validation
If we had more entries in files, how could we combine 1. and 2. if required?

cc @stevekm since you created #211

sateeshperi · 2024-07-16T14:46:46Z

@GallVp Usman, could you give this feature a look and provide your feedback. In general, this might be a good feature but, I doubt if it will be useful right away in nf-core as it expects the outputs to be separated into their respective channels (which is not the case for many nf-core modules). anyway, plz give it a look and lets discuss when we meet next. Thanks

GallVp · 2024-07-16T23:07:24Z

Hi @sateeshperi

This can be very useful in some cases. I was in a somewhat similar situation and resorted to the following logic for the orthofinder module:

import groovy.io.FileType

.
.
.

assert process.success

def all_files = []

file(process.out.orthofinder[0][1]).eachFileRecurse (FileType.FILES) { file ->
    all_files << file
}

def all_file_names = all_files.collect { it.name }.sort(false)

def stable_file_names = [
    'Statistics_PerSpecies.tsv',
    'SpeciesTree_Gene_Duplications_0.5_Support.txt',
    'SpeciesTree_rooted.txt'
]

def stable_files = all_files.findAll { it.name in stable_file_names }

assert snapshot(
    all_file_names,
    stable_files,
    process.out.versions[0]
).match()

With the proposed transformers, we may be able to make the snapshotting more groovy!

lukfor · 2024-07-17T06:43:09Z

Thanks for your examples! They really help me get a better understanding of what is needed.

GallVp · 2024-07-31T01:47:11Z

Thank you @lukfor

This pattern is very common (example from nf-core/modules),

{ assert snapshot(process.out.versions,
    process.out.bam.collect { bam(it[1]).getReadsMD5() },
    process.out.fastq,
    process.out.log
    ).match()
}

where all the outputs can be md5'ed except the log file or a bam file. We currently have to list all the outputs and apply a function to a specific output. Would it be possible to select a specific output by name and apply a function to it. So that,

{ assert snapshot(
    process.out.mutate('bam') { it -> bam(it[1]).getReadsMD5() }
    ).match()
}

mutate takes the process/workflow output object, selects by name, applies a closure and returns the mutated output object.

sateeshperi · 2024-08-01T08:39:46Z

@GallVp check out the nft-bam plugin by @nvnieuwk it makes getting stable snapshots from bam/cram files much easily

GallVp · 2024-08-01T21:49:40Z

@GallVp check out the nft-bam plugin by @nvnieuwk it makes getting stable snapshots from bam/cram files much easily

Thank you @sateeshperi

The example code I pasted above is already using nft-bam: bam(it[1]).getReadsMD5()

lukfor added 3 commits June 12, 2024 16:48

Add multiple snapshot transformers

9e01e08

Fix issues with paths in snapshots

77cad42

Make format numbers deterministic

635305d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot Transformer #223

Snapshot Transformer #223

lukfor commented Jun 13, 2024 •

edited

Loading

maxulysse commented Jun 13, 2024

nvnieuwk commented Jun 25, 2024

drpatelh commented Jul 1, 2024 •

edited

Loading

sateeshperi commented Jul 16, 2024

GallVp commented Jul 16, 2024

lukfor commented Jul 17, 2024

GallVp commented Jul 31, 2024

sateeshperi commented Aug 1, 2024

GallVp commented Aug 1, 2024

Snapshot Transformer #223

Are you sure you want to change the base?

Snapshot Transformer #223

Conversation

lukfor commented Jun 13, 2024 • edited Loading