Implement node assembly and standardize identifiers #24

bgyori · 2021-05-13T20:02:21Z

This PR makes the following changes:

Implement a simple node assembler which aggregates nodes by ID and merges labels and properties.
Includes node assembly in the sources CLI. For now, the individual processor node files are also dumped for traceability but these are redundant and could in principle be removed or made optional. Fixes New importer doesn't take ordering into account, properties missing #21.
Adds source as one of the properties on every relation. Fixes Add source info to each relation #11.
Adds a generic Statement label to every INDRA Statement relation. Fixes Tag statements with common tag #20.
Uses INDRA Standard IDs in every source and in the Node and Relation classes which now separate identifier into two separate ns and id parts to make conversions easier. Curies are created only when exporting Nodes and Relations. In another PR, the Neo4jClient will be adapted to these changes. Fixes Pathway sources should use INDRA-standard gene/protein IDs #22. Partially fixes Identifier standards need to be enforced when querying #23.
Adds validation when dumping nodes and relations and skips invalid nodes and relations. Fixes Namespace-identifier pairs should be validated from each source #25.

bgyori · 2021-05-13T20:04:11Z

I will test this now in practice to try to make a new build and will push fixes here. Until then, feel free to review/comment @cthoyt .

bgyori · 2021-05-13T23:41:42Z

src/indra_cogex/sources/pathways/__main__.py

+
+if __name__ == "__main__":
+    ReactomeProcessor.cli()
+    WikipathwaysProcessor.cli()


The way cli is implemented somehow the second call here to WikipathwaysProcessor is never reached. @cthoyt if you know why, please push a fix.

hmm I don't know why this isn't working

bgyori · 2021-05-14T02:16:27Z

This is basically working now. One thing I didn't figure out how to refactor best is the integration of node assembly with cached node generation from individual processors. Currently, indra_cogex.representation.Node objects use INDRA-standardized namespace and ID pairs whereas the dumped tables use a single CURIE string with a different namespace and ID standard. Assembly works directly on Node objects which means that assembly can only happen if nodes are collected from processors. In principle we could re-read the dumped node tables into Node objects, then assemble them, and dump them again into an assembled table but that seems a bit "dirty" to me (instead of just working with the original Node objects that processors generate).

src/indra_cogex/sources/indra_db/__init__.py

cthoyt · 2021-05-14T21:36:54Z

Overall looks good. I think we should have started by keeping the prefix/id separate for each node. I made a comment on the code, but the operation of validating prefix/id pairs and canonicalizing them should be operations implemented in INDRA, right?

Also, now we have the nice CI tests for code docs, we have to do all of them ;)

bgyori · 2021-05-16T02:41:55Z

The tests are erroring on GHA with ModuleNotFoundError: No module named 'indra_cogex', with this setup I'm not sure what to change for the package to be visible.

bgyori · 2021-05-16T02:50:27Z

I'm honestly not sure this much strict linting and mypy is productive, we need to be able to move on and focus on functionality.

bgyori · 2021-05-16T20:45:01Z

The tests are erroring on GHA with ModuleNotFoundError: No module named 'indra_cogex', with this setup I'm not sure what to change for the package to be visible.

@cthoyt I'm not sure why this happens, I haven't used tox with pytest in this way before so am a bit lost. What would we need to change to avoid the "ModuleNotFoundError: No module named 'indra_cogex'" errors on GHA? See e.g., https://github.com/bgyori/indra_cogex/runs/2593007487?check_suite_focus=true.

I thought the `__init__.py` wasn't required anymore but maybe it still is

bgyori added 10 commits May 13, 2021 11:31

Implement initial node assembler

0acc17d

Split namespace and ID in representation

1fe8a0c

Update pathway source identifiers

9c1d8e6

Update namespace/ID standards in sources

ab08e7d

Restructure node generation

2a92bed

Restructure relation generation

2e48902

Reformat processor with new ID structure

e554bcf

Test ID normalization

1eee55a

Implement node assembly upon import

5950634

Use new representation in assembler

482ddd7

bgyori requested a review from cthoyt May 13, 2021 20:04

bgyori mentioned this pull request May 13, 2021

Namespace-identifier pairs should be validated from each source #25

Closed

bgyori added 6 commits May 13, 2021 16:29

Fix Bgee source identifiers

e36e966

Fix a corner case for identifiers mapping

d6174fb

Implement validation and fix Relation str

104e087

Fix GO node construction

877e41d

Fix pathways prefix and add main

857d4e4

Handle UP isoforms

259d75a

bgyori commented May 13, 2021

View reviewed changes

bgyori added 6 commits May 13, 2021 21:17

Implement ID fixing for SIF dump

58b3b92

Improve the import approach, still issues to fix

86df4c2

Switch to f-string

1d57332

Fix assembly code

3b7b2cf

Change implementation of conflicts and test

434f9a0

Fix assembled node dumping

b890977

bgyori and others added 3 commits May 14, 2021 00:40

Change labels to rel_type and fix node label separator

a93c41a

Fix more ID issues on import

84fa3f4

Update meta

ea9286c

cthoyt reviewed May 14, 2021

View reviewed changes

src/indra_cogex/sources/indra_db/__init__.py Outdated Show resolved Hide resolved

cthoyt reviewed May 14, 2021

View reviewed changes

src/indra_cogex/sources/indra_db/__init__.py Outdated Show resolved Hide resolved

bgyori added 2 commits May 15, 2021 22:34

Generalize ID fixing to name spaces

5571b78

Add total and description

ba185df

Fix mypy issues

3538529

Just test the code

c3802d2

cthoyt and others added 5 commits May 16, 2021 16:52

Address issue with install

736550e

I thought the `__init__.py` wasn't required anymore but maybe it still is

Fix import

7d876c0

Add pyobo as requirement

be3300e

Install INDRA from github

983d576

Fix ID fixing test

e7bb294

This was referenced May 17, 2021

python -m indra_cogex.sources.pathways misses second cli #26

Closed

Improve processing/import CLI workflow #27

Closed

bgyori merged commit 477df79 into main May 17, 2021

bgyori deleted the inputs branch May 17, 2021 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement node assembly and standardize identifiers #24

Implement node assembly and standardize identifiers #24

bgyori commented May 13, 2021 •

edited

Loading

bgyori commented May 13, 2021

bgyori May 13, 2021

cthoyt May 14, 2021

bgyori commented May 14, 2021

cthoyt commented May 14, 2021

bgyori commented May 16, 2021

bgyori commented May 16, 2021

bgyori commented May 16, 2021

Implement node assembly and standardize identifiers #24

Implement node assembly and standardize identifiers #24

Conversation

bgyori commented May 13, 2021 • edited Loading

bgyori commented May 13, 2021

bgyori May 13, 2021

Choose a reason for hiding this comment

cthoyt May 14, 2021

Choose a reason for hiding this comment

bgyori commented May 14, 2021

cthoyt commented May 14, 2021

bgyori commented May 16, 2021

bgyori commented May 16, 2021

bgyori commented May 16, 2021

bgyori commented May 13, 2021 •

edited

Loading