-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cosmic database download wrapper #345
base: master
Are you sure you want to change the base?
Conversation
The tests fail because no credentials (i.e. environment variables |
I have to admit I don't fully understand the solutions you propose, but is that something that would make sense to have in snakemake-wrapper-utils? |
Yes, I guess that'd be a good place to put it. |
build="GRCh38", | ||
dataset="cosmic", | ||
version="v92", | ||
file=lambda wc: wc.db # e.g. "CosmicHGNC.tsv.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comment here explaining the meaning of file? And where to get the possible values from?
This PR was marked as stale because it has been open for 6 months with no activity. |
WalkthroughThe changes introduce new configuration files and a wrapper for the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (7)
bio/cosmic-db/environment.yaml (1)
1-8
: Consider adding dependencies specific to the COSMIC database wrapper.The current environment setup provides a good foundation with curl, python, and requests. However, given that this environment is for a COSMIC database wrapper, consider if any additional dependencies are required for:
- Data processing (e.g., pandas for data manipulation)
- File compression/decompression (e.g., gzip, if COSMIC files are compressed)
- Logging (e.g., loguru for enhanced logging capabilities)
- Testing (e.g., pytest for unit testing)
Adding these dependencies now can ensure a more complete and robust environment for development and execution of the wrapper.
bio/cosmic-db/meta.yaml (3)
2-3
: Consider enhancing the description for clarity.While the current description is clear and concise, it could be more informative. Consider expanding it to mention that this is a wrapper and include a brief explanation of what COSMIC is.
Here's a suggested improvement:
description: | A wrapper to download COSMIC (Catalogue Of Somatic Mutations In Cancer) databases from https://cancer.sanger.ac.uk/cosmic/download. COSMIC is a comprehensive resource for exploring the impact of somatic mutations in human cancer.
8-9
: Enhance the output description for clarity.The current output description is too brief and lacks important details. Consider providing more information about the output, such as its format and expected location.
Here's a suggested improvement:
output: - database: description: Downloaded COSMIC database file format: Depending on the selected file (e.g., TSV, VCF) location: Specified by the Snakemake rule
1-9
: Consider adding sections for requirements and error handling.While the current structure provides essential information, the file could benefit from additional sections to make it more comprehensive:
Add a 'requirements' section to list necessary environment variables:
requirements: - COSMIC_EMAIL: Email for COSMIC authentication - COSMIC_PW: Password for COSMIC authenticationConsider adding an 'error_handling' section to describe potential issues and how they're managed:
error_handling: - Authentication failure: Raises an error if COSMIC credentials are invalid or missing - Invalid parameters: Checks and raises errors for invalid build, dataset, or version inputs - Download issues: Handles and reports any network or server errors during downloadThese additions would provide a more complete picture of the wrapper's functionality and requirements.
bio/cosmic-db/test/Snakefile (1)
1-3
: Add a comment explaining the environment variables.It would be helpful to add a brief comment explaining the purpose of these environment variables and how they should be set.
Consider adding a comment like this:
# Environment variables for COSMIC database authentication # These should be set in the user's environment before running the workflowtest.py (1)
128-133
: LGTM! Consider adding a comment explaining the purpose of this test.The new
test_download_cosmic_db()
function is well-structured and consistent with the existing test functions in this file. It correctly uses the@skip_if_not_modified
decorator and therun()
function to execute a Snakemake command for downloading a COSMIC database file.To improve code readability and maintainability, consider adding a brief comment explaining the purpose of this test and what it's specifically checking in the COSMIC database download process.
@skip_if_not_modified def test_download_cosmic_db(): # Test the download and processing of the COSMIC HGNC database file run( "bio/cosmic-db", ["snakemake", "--cores", "1", "--use-conda", "resources/CosmicHGNC.tsv.gz"], )bio/cosmic-db/wrapper.py (1)
69-69
: Ensure safe variable substitution in shell commandWhile the
shell
function from Snakemake handles variable substitution, it's good practice to pass variables explicitly to prevent any unintended shell injection issues.Refactor the shell command to use keyword arguments:
-shell('curl "{download_url}" -o {snakemake.output[0]} {log}') +shell( + 'curl "{url}" -o "{output}" {log}', + url=download_url, + output=snakemake.output[0], + log=log +)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- bio/cosmic-db/environment.yaml (1 hunks)
- bio/cosmic-db/meta.yaml (1 hunks)
- bio/cosmic-db/test/Snakefile (1 hunks)
- bio/cosmic-db/wrapper.py (1 hunks)
- test.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
bio/cosmic-db/wrapper.py (2)
Pattern
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
Pattern
**/wrapper.py
: Do not complain about use of undefined variable calledsnakemake
.test.py (1)
Pattern
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
🪛 Ruff
bio/cosmic-db/wrapper.py
45-45: Undefined name
snakemake
(F821)
46-46: Undefined name
snakemake
(F821)
47-47: Undefined name
snakemake
(F821)
48-48: Undefined name
snakemake
(F821)
49-49: Undefined name
snakemake
(F821)
🔇 Additional comments (6)
bio/cosmic-db/environment.yaml (2)
1-4
: LGTM: Channel configuration is appropriate.The channel configuration follows best practices for bioinformatics projects:
- bioconda: Specific to bioinformatics packages.
- conda-forge: Community-led channel with a wide range of packages.
- defaults: Default channel provided by Anaconda.
This order ensures that bioinformatics-specific packages are prioritized, followed by community-maintained packages, with fallback to default packages.
5-8
: Consider updating and refining dependency versions.While the current dependencies are functional, there are some potential improvements:
- curl: The current pin
=7
is quite broad. Consider specifying a more precise version, e.g.,=7.88.1
.- python: The minimum version of 3.6.2 is quite old (released in 2017). Consider updating to a more recent version, e.g.,
>=3.8
.- requests: Version 2.25 is from 2020. Consider updating to a more recent version, e.g.,
=2.31.0
.Updating these versions can provide bug fixes, performance improvements, and enhanced security.
To verify the latest versions and their compatibility, you can run:
bio/cosmic-db/meta.yaml (2)
1-1
: LGTM: Project name is appropriate and follows conventions.The project name 'cosmic-db' is concise, descriptive, and follows the naming convention of using lowercase letters and hyphens.
4-5
: LGTM: Author information is correctly formatted.The author's name is properly listed and follows the correct YAML format for lists.
bio/cosmic-db/test/Snakefile (2)
12-12
: Explain the 'file' parameter and its possible values.As suggested in a previous review, it would be helpful to add a comment explaining the meaning of the 'file' parameter and where to get the possible values from.
Consider adding a comment like this:
# 'file' parameter specifies the name of the file to download from COSMIC database # Possible values can be found in the COSMIC database file list (provide URL if available) file=lambda wc: wc.db # e.g. "CosmicHGNC.tsv.gz"
5-16
: LGTM: Overall structure of the rule looks good.The structure of the
cosmic_download
rule follows Snakemake best practices. It correctly defines output, params, log, and uses a wrapper. The use of a wildcard in the output path provides flexibility.
input: | ||
- genome build, cosmic release version, dataset and file name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructure the input section for clarity and completeness.
The current input description lacks structure and specificity. Consider restructuring it as a YAML list with more detailed information about each parameter, including possible values where applicable.
Here's a suggested improvement based on the information provided in the PR objectives:
input:
- build:
description: Genome build version
choices: [GRCh37, GRCh38]
- dataset:
description: COSMIC dataset to download
choices: [cosmic, cell_lines]
- version:
description: COSMIC release version
choices: [v83, v84, v85, v86, v87, v88, v89, v90, v91, v92, latest]
- file:
description: Specific file to download (uses {db} wildcard)
Also, consider mentioning the required environment variables COSMIC_EMAIL
and COSMIC_PW
in the description or in a separate 'requirements' section.
build="GRCh38", | ||
dataset="cosmic", | ||
version="v92", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider making 'build', 'dataset', and 'version' params more flexible.
The current implementation hardcodes these values, which might limit the reusability of this rule. Consider making these parameters more flexible by using config values or wildcards.
Here's a suggestion for improvement:
params:
build=config.get("cosmic_build", "GRCh38"),
dataset=config.get("cosmic_dataset", "cosmic"),
version=config.get("cosmic_version", "v92"),
This allows users to override these values in their config file while still providing sensible defaults.
"""Snakemake wrapper for trimming paired-end reads using cutadapt.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring to reflect the correct functionality
The current docstring mentions trimming paired-end reads using cutadapt, which doesn't match the functionality of this wrapper. It should describe downloading files from the COSMIC database.
Apply this diff to correct the docstring:
-"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
+"""Snakemake wrapper for downloading files from the COSMIC database."""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"""Snakemake wrapper for trimming paired-end reads using cutadapt.""" | |
"""Snakemake wrapper for downloading files from the COSMIC database.""" | |
email = os.environ["COSMIC_EMAIL"] | ||
password = os.environ["COSMIC_PW"] | ||
assert email, "$COSMIC_EMAIL is not set" | ||
assert password, "$COSMIC_PW is not set" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing environment variables to avoid KeyError
Accessing environment variables with os.environ["VAR"]
will raise a KeyError
if the variable is not set. Use os.environ.get("VAR")
and handle the case where it returns None
or an empty string.
Apply this diff to improve error handling:
-email = os.environ["COSMIC_EMAIL"]
-password = os.environ["COSMIC_PW"]
-assert email, "$COSMIC_EMAIL is not set"
-assert password, "$COSMIC_PW is not set"
+email = os.environ.get("COSMIC_EMAIL")
+password = os.environ.get("COSMIC_PW")
+if not email:
+ raise EnvironmentError("$COSMIC_EMAIL environment variable is not set")
+if not password:
+ raise EnvironmentError("$COSMIC_PW environment variable is not set")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
email = os.environ["COSMIC_EMAIL"] | |
password = os.environ["COSMIC_PW"] | |
assert email, "$COSMIC_EMAIL is not set" | |
assert password, "$COSMIC_PW is not set" | |
email = os.environ.get("COSMIC_EMAIL") | |
password = os.environ.get("COSMIC_PW") | |
if not email: | |
raise EnvironmentError("$COSMIC_EMAIL environment variable is not set") | |
if not password: | |
raise EnvironmentError("$COSMIC_PW environment variable is not set") |
def available_files(build: str, dataset: str, version: str) -> List[str]: | ||
files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json() | ||
return [f.rpartition("/")[-1] for f in files] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to available_files
function
Include error handling for HTTP requests in the available_files
function to handle potential request failures.
Apply this diff:
def available_files(build: str, dataset: str, version: str) -> List[str]:
- files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json()
+ response = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}")
+ response.raise_for_status()
+ files = response.json()
return [f.rpartition("/")[-1] for f in files]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def available_files(build: str, dataset: str, version: str) -> List[str]: | |
files = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}").json() | |
return [f.rpartition("/")[-1] for f in files] | |
def available_files(build: str, dataset: str, version: str) -> List[str]: | |
response = requests.get(f"{COSMIC_URL}/{build}/{dataset}/{version}") | |
response.raise_for_status() | |
files = response.json() | |
return [f.rpartition("/")[-1] for f in files] |
def available_builds() -> List[str]: | ||
builds = requests.get(COSMIC_URL).json() | ||
return builds | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to available_builds
function
The available_builds
function performs an HTTP request without checking for errors. If the request fails or returns a non-200 status code, it could cause exceptions or return invalid data.
Apply this diff to handle potential errors:
def available_builds() -> List[str]:
- builds = requests.get(COSMIC_URL).json()
+ response = requests.get(COSMIC_URL)
+ response.raise_for_status()
+ builds = response.json()
return builds
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def available_builds() -> List[str]: | |
builds = requests.get(COSMIC_URL).json() | |
return builds | |
def available_builds() -> List[str]: | |
response = requests.get(COSMIC_URL) | |
response.raise_for_status() | |
builds = response.json() | |
return builds |
def available_versions(build: str, dataset: str) -> List[str]: | ||
versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json() | ||
return [v.rpartition("/")[-1] for v in versions] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to available_versions
function
The available_versions
function should check the HTTP response for errors to prevent unexpected exceptions.
Apply this diff:
def available_versions(build: str, dataset: str) -> List[str]:
- versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json()
+ response = requests.get(f"{COSMIC_URL}/{build}/{dataset}")
+ response.raise_for_status()
+ versions = response.json()
return [v.rpartition("/")[-1] for v in versions]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def available_versions(build: str, dataset: str) -> List[str]: | |
versions = requests.get(f"{COSMIC_URL}/{build}/{dataset}").json() | |
return [v.rpartition("/")[-1] for v in versions] | |
def available_versions(build: str, dataset: str) -> List[str]: | |
response = requests.get(f"{COSMIC_URL}/{build}/{dataset}") | |
response.raise_for_status() | |
versions = response.json() | |
return [v.rpartition("/")[-1] for v in versions] |
download_url = ( | ||
requests.get(download_path(build, dataset, version, file), auth=(email, password)) | ||
.json()["url"] | ||
.strip() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling when retrieving the download URL
The code assumes the HTTP request is successful and the response contains a 'url'
key. If the request fails or the 'url'
key is missing, it could cause an exception. Please add error handling to manage these cases.
Apply this diff to improve robustness:
download_url = (
- requests.get(download_path(build, dataset, version, file), auth=(email, password))
- .json()["url"]
- .strip()
+ response = requests.get(download_path(build, dataset, version, file), auth=(email, password))
+ response.raise_for_status()
+ json_response = response.json()
+ download_url = json_response.get("url", "").strip()
+ if not download_url:
+ raise Exception("Failed to retrieve download URL from the response.")
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
download_url = ( | |
requests.get(download_path(build, dataset, version, file), auth=(email, password)) | |
.json()["url"] | |
.strip() | |
) | |
download_url = ( | |
response = requests.get(download_path(build, dataset, version, file), auth=(email, password)) | |
response.raise_for_status() | |
json_response = response.json() | |
download_url = json_response.get("url", "").strip() | |
if not download_url: | |
raise Exception("Failed to retrieve download URL from the response.") | |
) |
def available_datasets(build: str) -> List[str]: | ||
datasets = requests.get(f"{COSMIC_URL}/{build}").json() | ||
return [d.rpartition("/")[-1] for d in datasets] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to available_datasets
function
Similar to available_builds
, the available_datasets
function should handle potential HTTP errors to ensure robustness.
Apply this diff:
def available_datasets(build: str) -> List[str]:
- datasets = requests.get(f"{COSMIC_URL}/{build}").json()
+ response = requests.get(f"{COSMIC_URL}/{build}")
+ response.raise_for_status()
+ datasets = response.json()
return [d.rpartition("/")[-1] for d in datasets]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def available_datasets(build: str) -> List[str]: | |
datasets = requests.get(f"{COSMIC_URL}/{build}").json() | |
return [d.rpartition("/")[-1] for d in datasets] | |
def available_datasets(build: str) -> List[str]: | |
response = requests.get(f"{COSMIC_URL}/{build}") | |
response.raise_for_status() | |
datasets = response.json() | |
return [d.rpartition("/")[-1] for d in datasets] |
This PR adds a wrapper for downloading files from COSMIC.
It expects the following parameters to be set:
build
(at the moment either "GRCh37" or "GRCh38")dataset
(at the moment either "cosmic" or "cell_lines")version
(at the moment "v83" to "v92" or "latest")file
(which in the example Snakefile is derived from the output via the{db}
wildcard)Since new builds, datasets, versions and files may be added in the future, this wrapper will query the endpoints in order to check if the requested combination is actually available (at the time this wrapper is executed).
Since COSMIC needs authentication, email and password have to be given in the env vars
COSMIC_EMAIL
andCOSMIC_PW
.Summary by CodeRabbit
New Features
Bug Fixes
Documentation