Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: minimal full translation run #101

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

gordonkoehn
Copy link
Collaborator

@gordonkoehn gordonkoehn commented Feb 20, 2025

Integrating the complete translation and insertion handling into the workflow based on diamond.

This PR is still for small data, i.e. thousands of reads. Beyond the current architecture, it will fail because of memory.

For the first time, this current state generates nucleotides and amino acids with proper insertion handling on both.

@gordonkoehn gordonkoehn linked an issue Feb 20, 2025 that may be closed by this pull request
@gordonkoehn gordonkoehn self-assigned this Feb 20, 2025
@gordonkoehn gordonkoehn added the enhancement New feature or request label Feb 20, 2025
@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Feb 20, 2025

Submission and processing is running fine. Some adjustments were needed in the s3/silo modules to run it bare metal i.e. no docker.

Next, the read object will get the metadata they deserve.

  • Add metadata to reads
  • Give option for non-intended JSON
  • Add instructions ReadMe
  • Sucessfully start SILO on it
  • try to run with docker

Modify README install // the WIP run instructions in the README for no Docker.

@gordonkoehn
Copy link
Collaborator Author

Just ran SILO - note that the schema is wrong:

it hast to match the database config obviously, add a check for that:

siloPreprocessing-1 | [2025-02-20 16:17:23.353] [logger] [info] [preprocessor.cpp:97] preprocessing - ndjson pipeline chosen siloPreprocessing-1 | [2025-02-20 16:17:23.959] [logger] [warning] [metadata_info.cpp:136] The field 'nuc_reference' which is contained in the metadata file '/preprocessing/input/data.ndjson' is not contained in the database config. siloPreprocessing-1 | [2025-02-20 16:17:23.959] [logger] [warning] [metadata_info.cpp:136] The field 'aa_reference' which is contained in the metadata file '/preprocessing/input/data.ndjson' is not contained in the database config. siloPreprocessing-1 | [2025-02-20 16:17:23.959] [logger] [error] [metadata_info.cpp:153] The field 'nextclade_reference' which is contained in the database config is not contained in the input field '/preprocessing/input/data.ndjson'. siloPreprocessing-1 | [2025-02-20 16:17:23.967] [logger] [error] [main.cpp:50] Preprocessing Error: The field 'nextclade_reference' which is contained in the database config is not contained in the input field '/preprocessing/input/data.ndjson'. siloPreprocessing-1 exited with code 1 Gracefully stopping... (press Ctrl+C again to force) service "siloPreprocessing" didn't complete successfully: exit 1 (sr2silo) (base) ➜ ww_test git:(wip/silo) ✗

@gordonkoehn gordonkoehn marked this pull request as ready for review February 21, 2025 09:46
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

tests/test_database_config_validation.py:17

  • Typo: 'Validateds' should be 'Validates'.
    """Validateds that the schema of the database config file matches the ReadMetadata model

tests/test_database_config_validation.py:19

  • Typo: 'nameing' should be 'naming'.
    config file matches the ReadMetadata model at least in the nameing of the fields

@gordonkoehn
Copy link
Collaborator Author

@Taepper requested review just FYI.

This PR generates a full NDJSON with Nucliotides and Amino Acids with accurate Indel handling.

Am implementing a bit of validation for the format of the database schema here.

The core workflow is still in a script, but will migrate into the package in the next PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate translation into full Flow in vp_transformer
1 participant