Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add interval logic for l2g features #812

Open
wants to merge 43 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
9c31f43
feat: add interval logic for l2g features
xyg123 Oct 3, 2024
330b79e
chore: fix docstrings
xyg123 Oct 3, 2024
183c827
chore: fix attribute errors
xyg123 Oct 3, 2024
500bae8
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 3, 2024
7cb4b5f
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 7, 2024
2035a52
fix: multiple input lines from merge
xyg123 Oct 7, 2024
985a901
fix: change to mean comparison, add additional interval features
xyg123 Oct 7, 2024
b01b4e8
fix: change to mean comparison, add additional interval features
xyg123 Oct 7, 2024
688c73a
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 7, 2024
6837df3
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 15, 2024
f194098
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 16, 2024
a9c0f6b
fix: change interval schema, reorganise interval processing, begin ad…
xyg123 Oct 17, 2024
63d6db6
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 17, 2024
374a7c3
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 18, 2024
29ad08b
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Oct 21, 2024
42e4ce9
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Nov 21, 2024
55f947f
fix: schema fixes
xyg123 Nov 22, 2024
1de5fcf
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Dec 10, 2024
c332d93
Added working tests for interval + nbh features
xyg123 Dec 11, 2024
ee8c4f2
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Dec 11, 2024
737a827
fix: l2g_feature_matrix tests
xyg123 Dec 11, 2024
921c820
Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…
xyg123 Dec 11, 2024
0e23427
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Dec 11, 2024
6ac2d12
fix l2g_feature_matrix tests
xyg123 Dec 11, 2024
b1b2aa5
Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…
xyg123 Dec 11, 2024
4f893fb
fix l2g_feature_matrix tests
xyg123 Dec 11, 2024
2bbf69c
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Dec 16, 2024
aed12ec
fix l2g step for intervals
xyg123 Dec 17, 2024
37109e3
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Dec 17, 2024
054eaa3
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Dec 17, 2024
ad934c4
generate features by overlapping studyLocus variants
xyg123 Dec 17, 2024
0eea3aa
Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…
xyg123 Dec 17, 2024
8140d5a
fix on l2g step mypy
xyg123 Dec 17, 2024
24dc8c3
type hint issue
xyg123 Dec 17, 2024
9aeb302
add datasource step to process intervals
xyg123 Dec 17, 2024
155fcdb
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Dec 17, 2024
53a6ff3
add interval doc .md
xyg123 Dec 19, 2024
b8914a7
Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…
xyg123 Dec 19, 2024
78f661b
changes to config
xyg123 Dec 19, 2024
cf8b260
Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…
xyg123 Dec 19, 2024
880cacf
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Dec 19, 2024
c076e17
address feature name comments and tests
xyg123 Dec 19, 2024
b074bc4
Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…
xyg123 Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/python_api/datasets/l2g_features/intervals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: From intervals
---

## List of features

::: gentropy.dataset.l2g_features.intervals.PchicMeanFeature
::: gentropy.dataset.l2g_features.intervals.PchicMeanNeighbourhoodFeature
::: gentropy.dataset.l2g_features.intervals.EnhTssCorrelationMeanFeature
::: gentropy.dataset.l2g_features.intervals.EnhTssCorrelationMeanNeighbourhoodFeature
::: gentropy.dataset.l2g_features.intervals.DhsPmtrCorrelationMeanFeature
::: gentropy.dataset.l2g_features.intervals.DhsPmtrCorrelationMeanNeighbourhoodFeature

## Common logic

::: gentropy.dataset.l2g_features.intervals.common_interval_feature_logic
::: gentropy.dataset.l2g_features.intervals.common_neighbourhood_interval_feature_logic
8 changes: 7 additions & 1 deletion src/gentropy/assets/schemas/intervals.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,16 @@
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "variantId",
"nullable": true,
"type": "string"
},
{
"metadata": {},
"name": "geneId",
"nullable": false,
"nullable": true,
"type": "string"
},
{
Expand Down
51 changes: 35 additions & 16 deletions src/gentropy/config.py
xyg123 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ class SessionConfig:
write_mode: str = "errorifexists"
spark_uri: str = "local[*]"
hail_home: str = os.path.dirname(hail_location)
extended_spark_conf: dict[str, str] | None = field(
default_factory=dict[str, str])
extended_spark_conf: dict[str, str] | None = field(default_factory=dict[str, str])
output_partitions: int = 200
_target_: str = "gentropy.common.session.Session"

Expand All @@ -40,8 +39,7 @@ class ColocalisationConfig(StepConfig):
credible_set_path: str = MISSING
coloc_path: str = MISSING
colocalisation_method: str = MISSING
colocalisation_method_params: dict[str, Any] = field(
default_factory=dict[str, Any])
colocalisation_method_params: dict[str, Any] = field(default_factory=dict[str, Any])
_target_: str = "gentropy.colocalisation.ColocalisationStep"


Expand Down Expand Up @@ -126,8 +124,7 @@ class EqtlCatalogueConfig(StepConfig):
eqtl_catalogue_paths_imported: str = MISSING
eqtl_catalogue_study_index_out: str = MISSING
eqtl_catalogue_credible_sets_out: str = MISSING
mqtl_quantification_methods_blacklist: list[str] = field(
default_factory=lambda: [])
mqtl_quantification_methods_blacklist: list[str] = field(default_factory=lambda: [])
eqtl_lead_pvalue_threshold: float = 1e-3
_target_: str = "gentropy.eqtl_catalogue.EqtlCatalogueStep"

Expand Down Expand Up @@ -217,6 +214,18 @@ class LDBasedClumpingConfig(StepConfig):
_target_: str = "gentropy.ld_based_clumping.LDBasedClumpingStep"


@dataclass
class IntervalConfig(StepConfig):
"""Interval step configuration."""

gene_index_path: str = MISSING
liftover_chain_file_path: str = MISSING
max_distance: int = 250_000
interval_sources: dict[str, str] = MISSING
processed_interval_path: str = MISSING
_target_: str = "gentropy.intervals.IntervalStep"


@dataclass
class LocusToGeneConfig(StepConfig):
"""Locus to gene step configuration."""
Expand Down Expand Up @@ -263,6 +272,13 @@ class LocusToGeneConfig(StepConfig):
"vepMaximumNeighbourhood",
"vepMean",
"vepMeanNeighbourhood",
# intervals
"pchicMean",
"pchicMeanNeighbourhood",
"enhTssCorrelationMean",
"enhTssCorrelationMeanNeighbourhood",
"dhsPmtrCorrelationMean",
"dhsPmtrCorrelationMeanNeighbourhood",
# other
"geneCount500kb",
"proteinGeneCount500kb",
Expand Down Expand Up @@ -306,6 +322,7 @@ class LocusToGeneFeatureMatrixConfig(StepConfig):
colocalisation_path: str | None = None
study_index_path: str | None = None
gene_index_path: str | None = None
interval_path: str | None = None
feature_matrix_path: str = MISSING
features_list: list[str] = field(
default_factory=lambda: [
Expand Down Expand Up @@ -340,6 +357,13 @@ class LocusToGeneFeatureMatrixConfig(StepConfig):
"vepMaximumNeighbourhood",
"vepMean",
"vepMeanNeighbourhood",
# intervals
"pchicMean",
"pchicMeanNeighbourhood",
"enhTssCorrelationMean",
"enhTssCorrelationMeanNeighbourhood",
"dhsPmtrCorrelationMean",
"dhsPmtrCorrelationMeanNeighbourhood",
# other
"geneCount500kb",
"proteinGeneCount500kb",
Expand Down Expand Up @@ -681,8 +705,7 @@ class Config:
"""Application configuration."""

# this is unfortunately verbose due to @dataclass limitations
defaults: List[Any] = field(default_factory=lambda: [
"_self_", {"step": MISSING}])
defaults: List[Any] = field(default_factory=lambda: ["_self_", {"step": MISSING}])
step: StepConfig = MISSING
datasets: dict[str, str] = field(default_factory=dict)

Expand Down Expand Up @@ -716,8 +739,7 @@ def register_config() -> None:
name="gwas_catalog_top_hit_ingestion",
node=GWASCatalogTopHitIngestionConfig,
)
cs.store(group="step", name="ld_based_clumping",
node=LDBasedClumpingConfig)
cs.store(group="step", name="ld_based_clumping", node=LDBasedClumpingConfig)
cs.store(group="step", name="ld_index", node=LDIndexConfig)
cs.store(group="step", name="locus_to_gene", node=LocusToGeneConfig)
cs.store(
Expand All @@ -735,8 +757,7 @@ def register_config() -> None:

cs.store(group="step", name="pics", node=PICSConfig)
cs.store(group="step", name="gnomad_variants", node=GnomadVariantConfig)
cs.store(group="step", name="ukb_ppp_eur_sumstat_preprocess",
node=UkbPppEurConfig)
cs.store(group="step", name="ukb_ppp_eur_sumstat_preprocess", node=UkbPppEurConfig)
cs.store(group="step", name="variant_index", node=VariantIndexConfig)
cs.store(group="step", name="variant_to_vcf", node=ConvertToVcfStepConfig)
cs.store(
Expand Down Expand Up @@ -769,7 +790,5 @@ def register_config() -> None:
name="locus_to_gene_associations",
node=LocusToGeneAssociationsStepConfig,
)
cs.store(group="step", name="finngen_ukb_meta_ingestion",
node=FinngenUkbMetaConfig)
cs.store(group="step", name="credible_set_qc",
node=CredibleSetQCStepConfig)
cs.store(group="step", name="finngen_ukb_meta_ingestion", node=FinngenUkbMetaConfig)
cs.store(group="step", name="credible_set_qc", node=CredibleSetQCStepConfig)
36 changes: 35 additions & 1 deletion src/gentropy/dataset/intervals.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,19 @@
from dataclasses import dataclass
from typing import TYPE_CHECKING

import pyspark.sql.functions as f

from gentropy.common.Liftover import LiftOverSpark
from gentropy.common.schemas import parse_spark_schema
from gentropy.dataset.dataset import Dataset
from gentropy.dataset.gene_index import GeneIndex
from gentropy.dataset.variant_index import VariantIndex

if TYPE_CHECKING:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType



@dataclass
class Intervals(Dataset):
"""Intervals dataset links genes to genomic regions based on genome interaction studies."""
Expand Down Expand Up @@ -71,3 +73,35 @@ def from_source(
source_class = source_to_class[source_name]
data = source_class.read(spark, source_path) # type: ignore
return source_class.parse(data, gene_index, lift) # type: ignore

def overlap_variant_index(
self: Intervals, variant_index: VariantIndex
) -> Intervals:
"""Overlaps intervals with a variant index.

Args:
variant_index (VariantIndex): Variant index dataset

Returns:
Intervals: Variant-to-gene intervals dataset
"""
return Intervals(
_df=(
self.df.alias("interval")
.join(
variant_index.df.selectExpr(
"chromosome as vi_chromosome", "variantId", "position"
).alias("vi"),
on=[
f.col("vi.vi_chromosome") == f.col("interval.chromosome"),
f.col("vi.position").between(
f.col("interval.start"), f.col("interval.end")
),
],
how="inner",
)
.drop("vi_chromosome", "position")
# .drop("start", "end", "vi_chromosome", "position")
),
_schema=Intervals.get_schema(),
)
Loading
Loading