Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes and maintaining reproducibility #276

Open
nevrome opened this issue Sep 26, 2023 · 4 comments
Open

Breaking changes and maintaining reproducibility #276

nevrome opened this issue Sep 26, 2023 · 4 comments
Assignees

Comments

@nevrome
Copy link
Member

nevrome commented Sep 26, 2023

A recent change added in trident v1.4.0.0 (#268) and again reverted in v.1.4.0.2 (#275) uncovered an interesting issue with our current package version management concept for Poseidon:

This change boils down to a new cross-file consistency check: The genotype ploidy (Genotype_Ploidy) documented in .janno files was supposed to be checked when reading a package, and if a sample marked as haploid had a heterozygote SNP, this would throw an error. So this change in trident did not reflect a change in the Poseidon schema, but only a more strict validation of an already formalized promise in the schema.

Activating this check had unforeseen consequences, because some packages in our public community archive actually failed this validation step and had to be fixed (poseidon-framework/community-archive#135). They were, strictly speaking, not valid Poseidon packages before this update. With the new, adjusted versions of said packages, we considered this issue solved.

This was premature, though, because trident v1.4.0.0 and v1.4.0.1 were not able to read the old package versions any more. For a user this is not necessarily a problem: They can either switch to the new, updated package versions, or continue working with an old trident release, which doesn't check this property. But a big problem emerged for the Poseidon web server (trident serve): It could neither read nor serve the old package versions any more, thus breaking the reproducibility promise that motivated the entire shift to a versioned web API.

We averted chaos by deactivating the ploidy check in v.1.4.0.2. But this is only a temporary solution, because we want to include this check eventually. A similar issue would probably also emerge, if we ever decided to update the Poseidon schema in a breaking way.

There are various ways to solve this issue permanently. Here are two potential solutions (formulated in a way that clearly highlights which one I prefer 😬):

  1. Loosen the requirements on Poseidon packages in trident's reading process to a point where even broken or anachronistic packages could be loaded. Issues could be pointed out as warnings. validate could continue to be strict. This way the server could easily read and serve every somewhat functional package and even outdated versions. Users may experience less issues when using trident, but may end up with potentially broken output (as a consequence of broken input). This explicitly includes packages that do not fulfill the guarantees of the Poseidon schema.
  2. Keep the reading exactly as strict as it is now (and thus keep course towards the high quality standard we're aiming for), but allow the server to serve potentially invalid packages specifically for the /zip_file and /packages endpoints. This could, for example, be implemented by relying on meaningful directory names, which encode package name and version, or by only parsing POSEIDON.yml file partially. The server could thus serve literally anything old and outdated, because it does not have to read it.

While I think 2. is the better approach, there are some obvious drawbacks:

  • While the /packages API could be pretty clear about which packages and package versions are old and outdated, /groups and /individuals would lack information on the unreadable, old packages. This mismatch might be confusing and it propagates directly to list.
  • As a direct consequence fetch should probably not offer the option to list individuals and group entities in its selection language (-f) any more.
  • timetravel is also affected by this issue. Just as serve for zip_file it could switch to a minimal package discovery workflow.
@stschiff
Copy link
Member

OK, excellent writeup and description of the problem, and nice summary of various solutions.

I can immediately say that I would be sad to see the option to specifying groups and individuals going out, for fetch. I was dreaming of a blog post like this: There was never a faster way to obtain and analyse genotype data, for example, literally as a two-liner:

trident fetch -d ~/poseidon_repo -f 'Chimp.REF,Altai_Neanderthal.DG,Yoruba,French';
xerxes fstats -d ~/poseidon_repo --stat 'F4(Chimp.REF,Altai_Neanderthal.DG,Yoruba,French)'

which is beautiful.

I also sympathise with your solution 2, though. Perhaps we can perform two reading-runs: One package-reading with the usual strictness, and then another one which is far more lenient. Any packages that were read in the second run, but not in the first, get reported as "broken according to current standards" or something like that: Users can still list them in all APIs and even local listings, and download them, but perhaps not forge without additional --force flag?

@stschiff
Copy link
Member

We could even provide a base flag called --allowBroken, which needs to be explicitly set to read packages with cross-consistency issues. The server would then of course use that flag by default, and the various APIs would return JSON-objects that include a broken field.

@stschiff
Copy link
Member

OK, this issue became new fodder now with #283.

The more I think about it, the more I think we should implement both a lenient and a strict way of loading packages. One could throw warnings, the other one would refuse to load. The main question is what should be the default, and I think you are leaning into a clear way of making the strict reading the default, which I would sympathize with. So indeed we should implement a package reading option --allowBroken, which the server would use by default.

@stschiff
Copy link
Member

Another thought: I guess we could also codify the changing parsing strictness in the schema itself (for example by changing a SHOULD to a MUST statement), and provide a new Poseidon Version for the stricter behavior. Then, we could implement the behavior in trident contingent on the Poseidon Version number.

Concretely:
If we are reading the Janno file of a Poseidon Package with version X.1, we are lenient and provide a warning. If the version is X.2, however, we refuse to load the package.

Of course, that would be a bit of a paradigm shift, I guess. I'm just throwing it out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants