Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2 #311

Merged
merged 56 commits into from
Sep 26, 2024
Merged

v0.2 #311

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
3799d7d
[FIX] lowercase letters break existing() alignment protocol, fixes #258
thomashopf Mar 8, 2021
741a789
[FIX] wrong sequence index in error message, fixes #259
thomashopf Mar 8, 2021
f6ee1a8
[FIX] fix extract_annotation crash, fixes #249
thomashopf Mar 8, 2021
d6627a2
[FIX] lowercase columns in frequency stable still led to crash
thomashopf Mar 13, 2021
380752e
[FEATURE] local submitter can now run jobs in parallel depending on u…
thomashopf Mar 13, 2021
22be035
Merge pull request #260 from debbiemarkslab/fix/alignment_protocols
thomashopf Mar 19, 2021
f3b733a
Update README.md
aggreen May 11, 2021
594b45a
Fix typo in complex_probability()
kpgbrock Sep 30, 2021
5de7bab
Update requirements.txt
aaronkollasch May 23, 2022
791c5fa
[FIX] import Iterable/Mapping from collections.abc, closes #275
thomashopf Aug 4, 2022
a50bd0f
[FIX] Update single sequence retrieval Uniprot API endpoint
thomashopf Aug 4, 2022
2495dc7
[FIX] Apply sklearn->scikit-learn update also in setup, not just requ…
thomashopf Aug 4, 2022
792150e
[FIX] Upgrade fetch_uniprot_mapping and SIFTS sequence table creation…
thomashopf Aug 5, 2022
1f85d74
Version bump
thomashopf Aug 5, 2022
22761e5
[FIX] pandas deprecation and warning
thomashopf Aug 5, 2022
90a4296
[FIX] safeguard insertion code checking
thomashopf Aug 5, 2022
7e0076d
Merge pull request #280 from debbiemarkslab/fix/uniprot-api
thomashopf Aug 5, 2022
d299279
change name inputs to bokeh
Nov 11, 2022
2590f9e
Merge pull request #281 from zachcp/20221111-bokeh
thomashopf Nov 14, 2022
721875a
Update build_and_test.yml
thomashopf Mar 16, 2023
af27842
Update build_test_and_push.yml
thomashopf Mar 16, 2023
d472e4b
Support a3m format in existing alignment protocol
aaronkollasch Mar 2, 2023
7192161
Prepare tests for new version, update readme and setup
thomashopf May 10, 2023
b4fc441
Keep as is due to github security restrictions
thomashopf May 10, 2023
e136240
Fix numpy dtype deprecation, fixes #291
thomashopf May 10, 2023
7863312
fix deprecation in pandas df equality checking in tests
thomashopf May 10, 2023
7eab82c
Merge pull request #286 from aaronkollasch/a3m-support
thomashopf May 10, 2023
a880fea
pin yaml due to upcoming API change
thomashopf May 11, 2023
5cd4033
Fix alignment identifier memory usage, add header splitting, fixes #289
thomashopf May 11, 2023
2324ed6
Correct wrong exception raise, fixes #199
thomashopf May 11, 2023
9d34d5e
better handler around occasional plmc segfaults, fixes #292
thomashopf May 11, 2023
5b3b3f0
Numpy deprecation fix, closes #248
thomashopf May 12, 2023
b9951e4
bump to python3.10
Oct 3, 2023
55ccdf5
add yml files
Nov 24, 2023
5a50ab9
build
Nov 26, 2023
686ca3c
add build
Nov 26, 2023
b4eb674
Fix PDB chain retrieval regression by pandas/numpy updates
thomashopf Jan 24, 2024
8fbf92d
Merge pull request #301 from zachcp/v0.2-patch
thomashopf Jan 24, 2024
c2a766d
Replace MMTF with bCIF
thomashopf Jul 5, 2024
cde87ae
Make model handling consistent to previous class
thomashopf Jul 6, 2024
2872d3d
Replace pd.append with pd.concat
thomashopf Jul 6, 2024
4778267
fix overlapping secondary structure elements
thomashopf Jul 7, 2024
cccc1ca
fix numpy warning (#306)
thomashopf Jul 7, 2024
05c6423
fix scikit-learn warning, cf. #306
thomashopf Jul 7, 2024
6567251
better backward compatibility for chain extraction
thomashopf Jul 8, 2024
7e2c0bb
fix charge handling
thomashopf Jul 8, 2024
58aa2c0
fix charge handling pt2
thomashopf Jul 8, 2024
9bfa113
fix alt_loc
thomashopf Jul 8, 2024
476700b
fix yet another pandas deprecation
thomashopf Jul 8, 2024
374d4c5
Merge pull request #293 from debbiemarkslab/v0.2
thomashopf Jul 8, 2024
f32491f
fix KeyError due to absence of secondary structure sections
thomashopf Jul 23, 2024
b7abe35
Merge pull request #308 from debbiemarkslab/fix_mmcif_parser
thomashopf Jul 23, 2024
35d93d0
fix atom_id overwrite by coord_id
thomashopf Aug 27, 2024
595472a
Merge branch 'master' into develop
thomashopf Sep 26, 2024
522903b
ensure chain IDs are always str
thomashopf Sep 26, 2024
188db86
Merge pull request #312 from debbiemarkslab/fix_auth_id_type
thomashopf Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ jobs:

strategy:
matrix:
python-version: [3.6]
python-version: ['3.10', '3.11']

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install and configure conda
Expand All @@ -29,14 +29,17 @@ jobs:
source activate test-environment
- name: Run setup.py
run: |
python setup.py sdist --formats=zip -k
pip install build
python setup.py sdist --formats=zip -k
python -m build
find ./dist -iname "*.zip" -print0 | xargs -0 pip install
pip install codecov
- name: Download test files
run: |
wget https://marks.hms.harvard.edu/evcouplings_test_cases/data/evcouplings_test_cases.tar.gz
tar -xf evcouplings_test_cases.tar.gz -C $HOME/
- name: Run tests in headless xvfb environment
uses: GabrielBB/xvfb-action@v1
uses: coactions/setup-xvfb@v1
with:
run: coverage run -m unittest discover -s test -p "Test*.py"
working-directory: ./ #optional
10 changes: 6 additions & 4 deletions .github/workflows/build_test_and_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6]
python-version: ['3.10', '3.11']

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install and configure conda
Expand All @@ -31,16 +31,18 @@ jobs:
- name: Run setup.py
run: |
python setup.py sdist --formats=zip -k
python setup.py bdist_wheel
find ./dist -iname "*.zip" -print0 | xargs -0 pip install
pip install codecov
- name: Download test files
run: |
wget https://marks.hms.harvard.edu/evcouplings_test_cases/data/evcouplings_test_cases.tar.gz
tar -xf evcouplings_test_cases.tar.gz -C $HOME/
- name: Run tests in headless xvfb environment
uses: GabrielBB/xvfb-action@v1
uses: coactions/setup-xvfb@v1
with:
run: coverage run -m unittest discover -s test -p "Test*.py"
working-directory: ./ #optional
- name: Publish evcouplings to test PyPI
if: startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@master
Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@ Predict protein structure, function and mutations using evolutionary sequence co

### Installing the Python package

If you are simply interested in using EVcouplings as a library, installing the Python package is all you need to do (unless you use functions that depend on external tools). If you want to run the *evcouplings* application (alignment generation, model parameter inference, structure prediction, etc.) you will also need to follow the sections on installing external tools and databases.
* If you are simply interested in using EVcouplings as a library, installing the Python package is all you need to do (unless you use functions that depend on external tools).
* If you want to run the *evcouplings* application (alignment generation, model parameter inference, structure prediction, etc.) you will also need to follow the sections on installing external tools and databases.

#### Requirements

EVcouplings requires a Python >= 3.5 installation. Since it depends on some packages that can be tricky to install using pip (numba, numpy, ...), we recommend using the [Anaconda Python distribution](https://www.continuum.io/downloads). In case you are creating a new conda environment or using miniconda, please make sure to run `conda install anaconda` before running pip, or otherwise the required packages will not be present.
EVcouplings actively supports Python >= 3.10 installations.

#### Installation

Expand All @@ -27,8 +28,6 @@ and to update to the latest version after previously installing EVcouplings from

pip install -U --no-deps https://github.com/debbiemarkslab/EVcouplings/archive/develop.zip

Installation will take seconds.

### External software tools

*After installation and before running compute jobs, the paths to the respective binaries of the following external tools have to be set in your EVcouplings job configuration file(s).*
Expand Down Expand Up @@ -141,7 +140,7 @@ Hopf, T. A., Schärfe, C. P. I., Rodrigues, J. P. G. L. M., Green, A. G., Kohlba

Hopf, T. A., Ingraham, J. B., Poelwijk, F.J., Schärfe, C.P.I., Springer, M., Sander, C., & Marks, D. S. (2017). Mutation effects predicted from sequence co-variation. *Nature Biotechnology* **35**, 128–135 doi:10.1038/nbt.3769

Green, A. G. and Elhabashy, H., Brock, K. P., Maddamsetti, R., Kohlbacher, O., Marks, D. S. (2019) Proteom-scale discovery of protein interactions with residue-level resolution using sequence coevolution. BioRxiv (in review). https://doi.org/10.1101/791293
Green, A. G. and Elhabashy, H., Brock, K. P., Maddamsetti, R., Kohlbacher, O., Marks, D. S. (2021) Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. *Nature Communications* **12**, 1396. https://doi.org/10.1038/s41467-021-21636-z

## Contributors

Expand Down
8 changes: 7 additions & 1 deletion config/sample_config_complex.txt
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,12 @@ environment:
memory: 15000
time: 2-0:0:0

# Special setting for "local" engine to define number of workers running in parallel
# (note that "cores" has to be defined above to make sure each job only uses a defined
# number of cores). If not defined or None, will default to number of cores / cores per job;
# otherwise specify integer to limit number of workers (1 for serial execution of subjobs)
# parallel_workers: 1

# command that will be executed before running actual computation (can be used to set up environment)
configuration:

Expand All @@ -500,7 +506,7 @@ databases:
uniref90: /n/groups/marks/databases/jackhmmer/uniref90/uniref90_current.o2.fasta

# URL do download sequences if sequence_file is not given. {} will be replaced by sequence_id.
sequence_download_url: http://www.uniprot.org/uniprot/{}.fasta
sequence_download_url: http://rest.uniprot.org/uniprot/{}.fasta

# Directory with PDB MMTF structures (leave blank to fetch structures from web)
pdb_mmtf_dir:
Expand Down
8 changes: 7 additions & 1 deletion config/sample_config_monomer.txt
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,12 @@ environment:
memory: 15000
time: 2-0:0:0

# Special setting for "local" engine to define number of workers running in parallel
# (note that "cores" has to be defined above to make sure each job only uses a defined
# number of cores). If not defined or None, will default to number of cores / cores per job;
# otherwise specify integer to limit number of workers (1 for serial execution of subjobs)
# parallel_workers: 1

# command that will be executed before running actual computation (can be used to set up environment)
configuration:

Expand All @@ -397,7 +403,7 @@ databases:
uniref90: /n/groups/marks/databases/jackhmmer/uniref90/uniref90_current.o2.fasta

# URL do download sequences if sequence_file is not given. {} will be replaced by sequence_id.
sequence_download_url: http://www.uniprot.org/uniprot/{}.fasta
sequence_download_url: http://rest.uniprot.org/uniprot/{}.fasta

# Directory with PDB MMTF structures (leave blank to fetch structures from web)
pdb_mmtf_dir:
Expand Down
35 changes: 28 additions & 7 deletions evcouplings/align/alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import re
from collections import namedtuple, OrderedDict, defaultdict
from copy import deepcopy
from pathlib import Path

import numpy as np
from numba import jit
Expand Down Expand Up @@ -326,7 +327,7 @@ def write_a3m(sequences, fileobj, insert_gap=INSERT_GAP, width=80):
fileobj.write(seq.replace(insert_gap, "") + "\n")


def detect_format(fileobj):
def detect_format(fileobj, filepath=""):
"""
Detect if an alignment file is in FASTA or
Stockholm format.
Expand All @@ -335,10 +336,12 @@ def detect_format(fileobj):
----------
fileobj : file-like obj
Alignment file for which to detect format
filepath : string or path-like obj
Path of alignment file

Returns
-------
format : {"fasta", "stockholm", None}
format : {"fasta", "a3m", "stockholm", None}
Format of alignment, None if not detectable
"""
for i, line in enumerate(fileobj):
Expand All @@ -348,6 +351,9 @@ def detect_format(fileobj):

# This indicates a FASTA file
if line.startswith(">"):
# A3M files have extension .a3m
if Path(filepath).suffix.lower() == ".a3m":
return "a3m"
return "fasta"

# Skip comment lines and empty lines for FASTA detection
Expand Down Expand Up @@ -422,7 +428,7 @@ def sequences_to_matrix(sequences):

N = len(sequences)
L = len(next(iter(sequences)))
matrix = np.empty((N, L), dtype=np.str)
matrix = np.empty((N, L), dtype=str)

for i, seq in enumerate(sequences):
if len(seq) != L:
Expand Down Expand Up @@ -569,7 +575,12 @@ def __init__(self, sequence_matrix, sequence_ids=None, annotation=None,
)

# make sure we get rid of iterators etc.
self.ids = np.array(list(sequence_ids))
self.ids = list(sequence_ids)

# turn identifiers into numpy array for consistency with previous implementation;
# but use dtype object to avoid memory usage issues of numpy string datatypes (longest
# sequence defines memory usage otherwise)
self.ids = np.array(self.ids, dtype=np.object_)

self.id_to_index = {
id_: i for i, id_ in enumerate(self.ids)
Expand Down Expand Up @@ -607,7 +618,7 @@ def from_dict(cls, sequences, **kwargs):
@classmethod
def from_file(cls, fileobj, format="fasta",
a3m_inserts="first", raise_hmmer_prefixes=True,
**kwargs):
split_header=False, **kwargs):
"""
Construct an alignment object by reading in an
alignment file.
Expand All @@ -625,6 +636,9 @@ def from_file(cls, fileobj, format="fasta",
HMMER adds number prefixes to sequence identifiers in Stockholm
files if identifiers are not unique. If True, the parser will
raise an exception if a Stockholm alignment has such prefixes.
split_header: bool, optional (default: False)
Only store identifier portion of each header (before first whitespace)
in identifier list, rather than full header line
**kwargs
Additional arguments to be passed to class constructor

Expand Down Expand Up @@ -664,6 +678,12 @@ def from_file(cls, fileobj, format="fasta",
else:
raise ValueError("Invalid alignment format: {}".format(format))

# reduce header lines to identifiers if requested
if split_header:
seqs = {
header.split()[0]: seq for header, seq in seqs.items()
}

return cls.from_dict(seqs, **kwargs)

def __getitem__(self, index):
Expand Down Expand Up @@ -777,7 +797,8 @@ def select(self, columns=None, sequences=None):
def apply(self, columns=None, sequences=None, func=np.char.lower):
"""
Apply a function along columns and/or rows of alignment matrix,
or to entire matrix.
or to entire matrix. Note that column and row selections are
applied independently in this particular order.

Parameters
----------
Expand Down Expand Up @@ -811,7 +832,7 @@ def apply(self, columns=None, sequences=None, func=np.char.lower):
mod_matrix[sequences, :] = func(mod_matrix[sequences, :])

return Alignment(
mod_matrix, np.copy(self.ids), deepcopy(self.annotation),
mod_matrix, deepcopy(self.ids), deepcopy(self.annotation),
alphabet=self.alphabet
)

Expand Down
24 changes: 19 additions & 5 deletions evcouplings/align/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

"""

from collections import OrderedDict, Iterable
from collections import OrderedDict
from collections.abc import Iterable
import re
from shutil import copy
import os
Expand Down Expand Up @@ -522,11 +523,15 @@ def describe_frequencies(alignment, first_index, target_seq_index=None):
fi = alignment.frequencies
conservation = alignment.conservation()

fi_cols = {c: fi[:, i] for c, i in alignment.alphabet_map.items()}
# careful not to include any characters that are non-match state (e.g. lowercase letters)
fi_cols = {
c: fi[:, alignment.alphabet_map[c]] for c in alignment.alphabet
}

if target_seq_index is not None:
target_seq = alignment[target_seq_index]
else:
target_seq = np.full((alignment.L), np.nan)
target_seq = np.full((alignment.L, ), np.nan)

info = pd.DataFrame(
{
Expand All @@ -539,6 +544,11 @@ def describe_frequencies(alignment, first_index, target_seq_index=None):
# reorder columns
info = info.loc[:, ["i", "A_i", "conservation"] + list(alignment.alphabet)]

# do not report values for lowercase columns
info.loc[
info.A_i.str.lower() == info.A_i, ["conservation"] + list(alignment.alphabet)
] = np.nan

return info


Expand Down Expand Up @@ -679,7 +689,7 @@ def existing(**kwargs):

# first try to autodetect format of alignment
with open(input_alignment) as f:
format = detect_format(f)
format = detect_format(f, filepath=input_alignment)
if format is None:
raise InvalidParameterError(
"Format of input alignment {} could not be "
Expand Down Expand Up @@ -1502,6 +1512,8 @@ def standard(**kwargs):
annotation_file = prefix + "_annotation.csv"
annotation = extract_header_annotation(ali_raw)
annotation.to_csv(annotation_file, index=False)
else:
annotation_file = None

# center alignment around focus/search sequence
focus_cols = np.array([c != "-" for c in ali_raw[0]])
Expand All @@ -1516,9 +1528,11 @@ def standard(**kwargs):
outcfg = {
**jackhmmer_outcfg,
**mod_outcfg,
"annotation_file": annotation_file
}

if annotation_file is not None:
outcfg["annotation_file"] = annotation_file

# dump output config to YAML file for debugging/logging
write_config_file(prefix + ".align_standard.outcfg", outcfg)

Expand Down
8 changes: 3 additions & 5 deletions evcouplings/compare/distances.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ def _add_axis(df, axis):
else:
res_i = _add_axis(self.residues_i, "i")
res_j = _add_axis(self.residues_j, "j")
residues = res_i.append(res_j)
residues = pd.concat([res_i, res_j])

# save residue table
residue_table_filename = filename + ".csv"
Expand Down Expand Up @@ -770,7 +770,7 @@ def _get_col_name(col_name):
# extract coverage segments for all individual structures
segments = {
_get_col_name(col_name): find_segments(series.dropna().sort_index().index)
for col_name, series in coverage_cols.iteritems()
for col_name, series in coverage_cols.items()
}

return segments
Expand Down Expand Up @@ -1273,9 +1273,7 @@ def _get_chains(sifts_result):
# if no structures given, or path to files, load first
structures = _prepare_structures(
structures,
sifts_result_i.hits.pdb_id.append(
sifts_result_j.hits.pdb_id
),
set(sifts_result_i.hits.pdb_id) | set(sifts_result_j.hits.pdb_id),
raise_missing
)

Expand Down
Loading
Loading