Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dumping updates and data validation #149

Merged
merged 110 commits into from
Feb 5, 2024
Merged

Conversation

kkaris
Copy link
Collaborator

@kkaris kkaris commented Jan 10, 2024

This PR adds the latest iteration of updates related to running the full dump after dependencies have been updated. It also adds data validation to the processors.

Dump and content updates:

  • PublicationType tags are added to Publication nodes, following updates in INDRA
  • A boolean is added to Publication, Evidence nodes and indra_rel relationships which is true if Publication/Evidence is retracted or if a statement has at least one Evidence from a retracted source
  • Add a helper function to get the proper Neo4j boolean values from a Python boolean or condition and use it in all places that generate boolean metadata
  • Put locations for processor files in their own files in indra_db and pubmed sub modules to avoid circular imports
  • A typo for a key previously lead to indra evidence jsons in Evience nodes not being validated, the typo is now fixed (see assert_valid_node())
  • Beliefs are set in the statement json of indra_rel relations. Previously, this was only done for the belief property of the indra_rel relations
  • Various improvements to logging and comments

Data validation:

  • The data validation checks that a value and its type are as expected for a given Neo4j data type as specified in the ingestion tsv file header. See the documentation for more info on Neo4j data types.
  • The data validation is done at dump time and is added in assert_valid_node(), which is called from both validate_nodes() and validate_relations().
  • Tests are added for the data validator

todo:

  • Process and dump nodes edges from indra_db
  • Fix missing data for Publication nodes
  • Rectify inconsistencies in return type (source counts vs evidence counts) in curator_blueprint.py
  • Fix instances where duplicate nodes are generated (check import.report for hints). Duplication can happen either at the processor level (e.g. a node appears multiple times) or at the assembly level (e.g. the same node is produced by different processors).
  • Check unicode issue (see Unicode fix #141 and Test frontend after unicode cleaning update #142).

@kkaris kkaris requested review from cthoyt and bgyori January 10, 2024 06:10
@kkaris kkaris force-pushed the validate-data branch 5 times, most recently from 84c3d28 to 8b1bea7 Compare January 22, 2024 19:52
@kkaris kkaris self-assigned this Jan 22, 2024
@cthoyt
Copy link
Member

cthoyt commented Jan 23, 2024

I already fixed the bioregistry error in biopragmatics/bioregistry#1030 but the autorelease broke because another data source stopped working... fixing now in biopragmatics/bioregistry#1031.

@kkaris kkaris marked this pull request as ready for review January 24, 2024 01:48
@kkaris kkaris linked an issue Jan 24, 2024 that may be closed by this pull request
2 tasks
@bgyori bgyori merged commit fa9a217 into gyorilab:main Feb 5, 2024
4 checks passed
@kkaris kkaris deleted the validate-data branch February 5, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test frontend after unicode cleaning update
3 participants