Dumping updates and data validation #149

kkaris · 2024-01-10T06:10:31Z

This PR adds the latest iteration of updates related to running the full dump after dependencies have been updated. It also adds data validation to the processors.

Dump and content updates:

PublicationType tags are added to Publication nodes, following updates in INDRA
A boolean is added to Publication, Evidence nodes and indra_rel relationships which is true if Publication/Evidence is retracted or if a statement has at least one Evidence from a retracted source
Add a helper function to get the proper Neo4j boolean values from a Python boolean or condition and use it in all places that generate boolean metadata
Put locations for processor files in their own files in indra_db and pubmed sub modules to avoid circular imports
A typo for a key previously lead to indra evidence jsons in Evience nodes not being validated, the typo is now fixed (see assert_valid_node())
Beliefs are set in the statement json of indra_rel relations. Previously, this was only done for the belief property of the indra_rel relations
Various improvements to logging and comments

Data validation:

The data validation checks that a value and its type are as expected for a given Neo4j data type as specified in the ingestion tsv file header. See the documentation for more info on Neo4j data types.
The data validation is done at dump time and is added in assert_valid_node(), which is called from both validate_nodes() and validate_relations().
Tests are added for the data validator

todo:

Process and dump nodes edges from indra_db
Fix missing data for Publication nodes
Rectify inconsistencies in return type (source counts vs evidence counts) in curator_blueprint.py
Fix instances where duplicate nodes are generated (check import.report for hints). Duplication can happen either at the processor level (e.g. a node appears multiple times) or at the assembly level (e.g. the same node is produced by different processors).
Check unicode issue (see Unicode fix #141 and Test frontend after unicode cleaning update #142).

src/indra_cogex/sources/indra_db/__init__.py

cthoyt · 2024-01-23T10:11:12Z

I already fixed the bioregistry error in biopragmatics/bioregistry#1030 but the autorelease broke because another data source stopped working... fixing now in biopragmatics/bioregistry#1031.

src/indra_cogex/sources/indra_ontology/__init__.py

src/indra_cogex/sources/goa/__init__.py

src/indra_cogex/sources/clinicaltrials/__init__.py

…essor

kkaris requested review from cthoyt and bgyori January 10, 2024 06:10

kkaris force-pushed the validate-data branch 5 times, most recently from 84c3d28 to 8b1bea7 Compare January 22, 2024 19:52

cthoyt reviewed Jan 22, 2024

View reviewed changes

src/indra_cogex/sources/indra_db/__init__.py Show resolved Hide resolved

kkaris self-assigned this Jan 22, 2024

kkaris marked this pull request as ready for review January 24, 2024 01:48

kkaris linked an issue Jan 24, 2024 that may be closed by this pull request

Test frontend after unicode cleaning update #142

Closed

2 tasks

bgyori reviewed Jan 31, 2024

View reviewed changes

src/indra_cogex/sources/indra_ontology/__init__.py Outdated Show resolved Hide resolved

bgyori reviewed Jan 31, 2024

View reviewed changes

src/indra_cogex/sources/goa/__init__.py Outdated Show resolved Hide resolved

bgyori reviewed Jan 31, 2024

View reviewed changes

src/indra_cogex/sources/clinicaltrials/__init__.py Outdated Show resolved Hide resolved

kkaris force-pushed the validate-data branch from 763466a to 3c658bb Compare February 5, 2024 15:58

kkaris added 14 commits February 5, 2024 07:59

Make Py 3.7, 3.8 compatible

4809093

Extract article PublicationType

6f33bdc

Put PublicationType tags in pmid year file. Update file name.

c7bfe69

Add PublicationType tags to Publication Nodes

faaa266

Fix file path in wikidata processor

30ba860

Add retraction boolean to Publication Nodes

1ab476c

Check for principal DB connection before starting raw export script

469a864

Log successful detection of prerequisite resources

22bf10a

Make pmid year types file tab delimited

ce42148

Handle missing main issn value

dc9b4fe

Add shorthand for getting neo4j boolean

07dc37d

Use boolean helper in all boolean exports

d170939

Update docstring for helper

9fa8c53

Circular import fix for indra_db, pubmed

f651905

kkaris and others added 25 commits February 5, 2024 07:59

Update so data display can be run on its own

e818d87

Remove now unnecessary unicode escaping

b2ed050

Remove unused functions, tests related to old unicode escaping

7145b71

Filter out duplicates for ResearchProject nodes in NihReporter

af0150f

Fix warning

e6d29dd

Set node labels to assemble automatically

0fdf26c

Fix f-string

efd0419

Add pusher checks

1f960e6

Set evidence codes as string array

fc96467

Remove redundant name setting

28c17e2

Restore to un-standardized nodes in GoaProcessor

e325615

Make ec-codes string array

1f2d561

Catch and raise error when no nodes/relations are generated from proc…

fcdc00b

…essor

Straighten out node standardization in ClinicaltrialsProcessor

fc40e08

Unstandardize GO, HGNC in InterproProcessor

fb6b100

Unstandardize HGNC nodes for CCLE Mutations and Cna processores

f03d012

Unstandardize drug mapping nodes, use provided name instead

a51d706

Handle null values in data validation

caaf350

Handle non-string values in data validation

b948836

Filter out null data in arrays

05d5494

Add mock nodes and relation in mock processor testing

05cfbc0

Return hgnc entries only

8778435

Update test for get genes for go term

620a94f

Update drug in clinicaltrial

8364132

Compress isni in relations as well

7ef4314

kkaris force-pushed the validate-data branch from 3c658bb to 7ef4314 Compare February 5, 2024 16:00

kkaris mentioned this pull request Feb 5, 2024

Remove scipy dependency constraint #156

Merged

bgyori approved these changes Feb 5, 2024

View reviewed changes

bgyori merged commit fa9a217 into gyorilab:main Feb 5, 2024
4 checks passed

kkaris deleted the validate-data branch February 5, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dumping updates and data validation #149

Dumping updates and data validation #149

kkaris commented Jan 10, 2024 •

edited

Loading

cthoyt commented Jan 23, 2024 •

edited

Loading

Dumping updates and data validation #149

Dumping updates and data validation #149

Conversation

kkaris commented Jan 10, 2024 • edited Loading

cthoyt commented Jan 23, 2024 • edited Loading

kkaris commented Jan 10, 2024 •

edited

Loading

cthoyt commented Jan 23, 2024 •

edited

Loading