Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGD IBA influx at source causes false positive(?) in sanity checks #2371

Open
kltm opened this issue Oct 7, 2024 · 4 comments
Open

SGD IBA influx at source causes false positive(?) in sanity checks #2371

kltm opened this issue Oct 7, 2024 · 4 comments

Comments

@kltm
Copy link
Member

kltm commented Oct 7, 2024

During the most recent snpashot run, we failed on an SGD sanity check.

Essentially, the SGD source file had 152076 annotations and the final file had 65609--a reduction of about 100k. This large reduction triggered a failsafe (good!).

Looking into it, I currently believe the issue is with IBAs.

The line count of filtered incoming IBAs is about 100477; the line count of injected IBAs is about 15330; that would account for the bulk of the drop.

As GOC is the canonical source, we're doing the right thing here and we can (and temporarily will) suppress the SGD sanity check, but the IBA noise does limit the use of this primitive check.

Tagging @pgaudet @suzialeksander

@kltm
Copy link
Member Author

kltm commented Oct 7, 2024

Looking around, this has been an "issue" since from around the last release, I'm guessing related to new code in one way or another. There has also been a reduction in the SGD upstream size.
I'm honestly not sure how the sanity checks have not been triggered before this. I'm going to pause the current snapshot attempt for the moment, waiting for feedback.

@kltm
Copy link
Member Author

kltm commented Oct 7, 2024

The reason it specifically seems to have ticked over into failure is that it crossed over the 50% reduction mark.

@dustine32
Copy link
Contributor

Checking the SGD report, this could be due to recent changes in ID checking code:

WARNING - Invalid identifier: GORULE:0000027: 2144215 does not match any id_syntax patterns for MGI in dbxrefs (MGI:MGI:2144215) -- SGD S000005027 SAL1 enables GO:0005347 GO_REF:0000033 IBA MGI:MGI:2144215 F ADP/ATP transporter YNL083W|Ca(2+)-binding ATP:ADP antiporter SAL1 protein taxon:559292 20231109 GO_Central UniProtKB:D6W196

The warning message points to matching 2144215 against MGI regex pattern MGI:[0-9]{5,}, which wouldn't be valid:

id_syntax: MGI:[0-9]{5,}

Though it lists only 8361 lines and they are just WARNINGs so not necessarily dropped lines. I also haven't really confirmed this in a debugger. @mugitty would you be able to debug these SGD IBA lines? This may not be the cause, mainly a hunch. What do you think?

@kltm
Copy link
Member Author

kltm commented Oct 9, 2024

@dustine32 It's not so much the warnings (which aren't great), but the fact that there are soooo many upstream IBAs that we get significantly closer to the sanity check trigger just by filtering them and injecting our own, as desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants