-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore use of RDF store on back-end #39
Comments
I'm assuming there are pieces I do not understand here about scaling and size of models, but I'm wondering about the added complexity of adding a triple-store, unless we get to move some of the API services minerva provides into the triple-store service. Is that the idea? As a developer, I like the use of the filesystem to store models. Under the current architecture, Minerva is presumed to 'own' (not share) the models directory, and it only needs to load an 'active' model that is being edited, as well as maintain an index of the models on disk. It currently is providing enough of an API to let Noctua get and save the models, while isolating clients like Noctua from the storage format. I think @cmungall is proposing that we isolate the model-serving capability that Minerva provides into a separate server, that this server will provide SPARQL-over-REST instead of the proprietary Minerva model serving API, and that this server will know about NGs instead of Models. Minerva will be a client of this new service, and Noctua may talk directly to the new service for index/metadata purposes, leaving Minerva to do what is left (reasoning and semantically correct model-manipulation?). Please let me know about flawed assumptions/conclusions above. |
I thought the owlapi automatically imported off of stored RDF triples (have done it before):
If these are small, storing them in the file-system (with some sort of locking mechanism) or whatever you happen to have lying around is fine. Is there a reason not to load the RDF into SciGraph? Creating an RDF loader (if it doesn’t exist) might not be worth it, but you could probably create one off of the OWLAPI. |
All said, SG is still on the cards as a possible solution here |
I have a working implementation of this now for Blazegraph. You can test it using the blazegraph-store branch. At the moment the Blazegraph store has replaced the file store, but I am planning to make it a choice. I need to do some refactoring to clean up duplicate code between the Blazegraph store and the file store. This first implementation is the most minimal usage of a triplestore backend: it just reads and writes entire models at a time, just replacing the file read and write steps. Things to know:
|
From our conversation, as you explore blazegraph and company for the backend, there are a few items to keep in mind when thinking about how minerva current operates. Once we have a better understanding of the backend, these may be better off as separate tickets for the experimental branch.
|
Users will eventually want more powerful queries for model finding, which ties in with geneontology/noctua#1 -- but the fields mentioned should be good for now
Yes, for model versioning, model per file is useful. For resolution, I'd imagine amigo pages. Also the issue with a single file is you either have to explicitly model how each axiom belongs to an ontology (using reification, ugh), or use a quad format like trig
Rejoice! |
@cmungall There are two use cases for the first item ("complete metadata get"), and while they have a lot of overlap, I think the divergence will eventually (not now, but later on) point to two implementations. I totally agree that for now we want epione gone ASAP and that all metadata should be gotten directly from the graph backend, via minerva. However, I have my doubts that this will scale long-term and/or will start making architecture and communication needlessly complicated. So looking down the road a bit, I just wanted to clarify what I think the use cases are. (I think we may already have a discussion of this in the tracker somewhere? I couldn't find it...)
I guess that's all to say that we should keep in mind that the increased number of fields are a (much better/sane) workaround for something that will need to be revisited for the long-term solution. |
@kltm I don't completely follow:
There is a lot of overlap in your lists. But anyway I don't see that the current metadata service is limiting the annotations returned at all. I did reimplement it so that it uses a SPARQL query to get annotations on all the stored models rather than loading each one from a file: fcffda1 Can you let me know if this version is sufficient for work to proceed on updating Noctua to rely on Minerva for this? Next I'll work on importing and exporting models to/from the database. |
@balhoff Sorry--I just listed everything in both sets; a replacement would be the union of the two. The fields needed for a replacement, and the removal of epione from service, would be: |
Remind me where the docs for the over-the-wire JSON is again? |
@cmungall which direction? |
@balhoff This is actually kinda interesting. My druthers would be that the Noctua production machine is not the same machine that is grinding and producing the GAF (a different Jenkins profile, etc.). That would mean dumping to a repo so somebody else could produce the GAF. Any thoughts on this? |
both, but any better than none.. good point, we need to make sure this continues to run.
We can leave the job unaltered if we continue to dump model OWL from blazegraph. But that may not be the optimal way to distribute things moving forward. See also: geneontology/go-site#172 This is the command that makes the GAFs, from the build above:
I don't believe it actually needs things split into model-per-file, a single triple dump would work as well |
Not to drag this issue farther away, but documentation to barista/minerva is mostly tests and the request API overview: |
For a single file dump, you don't think that versioning would be a little weird? I think we might also hit repo limits faster that way... |
FWIW the multi-file dump is implemented now. |
But it uses the Sesame Turtle writer and not OWL API. So we may need to evaluate how consistent the output order is, and configuration of prefixes. |
That's great. What is the workflow for this? Has it been implemented as an internal command, maybe added to or taking the place of the current save wire command? |
Right now you can use command |
@kltm @cmungall I've put up instructions for Blazegraph mode here: https://github.com/geneontology/minerva/blob/blazegraph-store/INSTRUCTIONS.md#using-the-blazegraph-model-store Let me know if anything is missing. |
All good this far:
Then:
|
What's |
No idea, blew it away. |
What is the correct catalog that we should be using?
|
I think it would be worthwhile to just check this again the GO, OBO, and Monarch models, if that hasn't been done already. Also, could you add the proper command(s) to just run minerva in standard server mode? It looks like the current examples are for journal creation and dumps--what would the command be (the one I'm assuming we'd after we create the first journal) to just run Minerva as normal? |
We should never rely on catalogs. An ontology import chain should always resolve in the absence of a catalog. However, catalogs are useful for caching purposes. That's an odd error message, I'd expect a list of output from each parser. I don't know if this was just an I/O error (which will hopefully be helped after we cloudfront more) or something to do with the fact that you're using what is likely a stale abandoned catalog in the experimental folder of svn. We'll want jenkins jobs that simulate minerva startup (at least for GO and Monarch instances) to help track down things like this |
@kltm the section here was meant to be the instructions for running in standard server mode. In particular the Did you get past the ontology loading problem? |
Ah, I think I see what you're saying. Duh. I read "Start the Minerva Server with configuration for Blazegraph journal and model dump folder" as a temporal sequence--I thought it was going to run and dump. If that's the actual command for running the server now, I can easily map it into what we have (and extend the gulpfiles). I have not gotten passed the loading issue yet. @cmungall ? |
So the thought here with migration then is that we'll have a one-time creation of the journal and TTL files, and then the file dumps are just backups/rollbacks. Is the rest of the pipeline (GAF gen, etc.) okay with this? |
Looking at the client API docs a little and pondering a bit, maybe to break out into other items later, if needed
|
From @cmungall, we are going to give up on the obo-models side of things for now; he also said that he would take care of testing against the monarch models. Has that been done? Moving forward, to make sure we can upgrade all servers at more or less the same time, it will be worthwhile to just check this again the GO, OBO, and Monarch models. @balhoff Assuming that #39 (comment) is addressed, I think we're getting close. I've done some more testing, but could not make your command work for actually starting the server in several variations. For example, the minimal
gives me:
I've tried variations removing the catalog, etc. However, when I model more closely on what we've been running with, I seem fine:
I was wondering if the command you give works for you? |
Maybe it's the different catalog in use? |
|
@kltm and i looked at this today. I think it's something peculiar to |
Hm. I think that the minerva-server was a more generalized code set that we kept around for some reason? I have nothing in my notes about it, but I remember that minerva-cli was a superset of functionality? |
I thought you were running with |
I have not actually been running from a jar locally, instead just running the class out of Eclipse. I'm talking a look at the jar outputs now though. |
So Inside
Inside
|
… optional groups; work on geneontology/noctua#350, geneontology/noctua#371, and geneontology/minerva#39
@balhoff I was wondering something playing with the API (geneontology/noctua#371 (comment)) and looking at #70 . The "export-all" functionality will be the first operation that we have that operates on all models at the same time. Unlike the usual imprint left by uid (and I assume provided-by) on "action" operations, we do not want an imprint left in this case. I'm not sure if the fact that it is a meta/query will prevent that from happening; IIRC, the "store-model" op did leave an imprint as it was an action/rebuild op. I'll try and test for it once I can complete the op (see geneontology/noctua/issues/371), but I was wondering if you might have any insight beforehand? |
I don't think there should be any metadata added. The uid is not passed on to the model dump functions. |
@balhoff Okay, I found a missing bit in the protocol. When doing the "export-all" operation: message and message-type must be defined for all non-error packets. In this case "success" would probably be fine. |
@balhoff Note that I've since corrected some typos in the above comment. As another request, would it be possible to add store-all in addition to or as a replacement for export-all? For consistency, the "store" functions have been server facing, while the "export" functions have been client facing. Things seem to be working pretty well so far, which is great. The one tricky thing that we'll need to iron out at some point (probably not now) is how queries to the backend are signaled. With export-all, it's a meta/query operation that only returns a message. This is different than typical meta/query which have special data stored in the data['meta'] area that helps signal what was asked for, and then drive the display. Without extra structure, there is no way the client client really knowing why it's getting a message (unless the client happens to have a message text lookup). In the future (again, not now, but as we do more stuff with the backend), I would suggest a new category of operation like "backend", so that the message can be sorted out properly. |
@balhoff I'm getting close to the end of the testing (I have finally wired-up the noctua-repl to deal with all the new stuff), but have run into a couple of (possibly related) problems. The first is that the store model operation (not bulk, but a single model) seems to fail on modified models. If I store an unmodified model, all is good. But if I add an individual and try storing, I get an error like:
The second issue may actually be a feature, but I'm thinking "not" right now. |
By "store model", do you mean "Save" from the Model menu? That is working for me. How did you add the individual? For the second question (export-all), do you mean that unsaved modifications are lost from memory when the database is written to turtle files? I hope not. If you mean that unsaved modifications are not written out to the turtle files, I would say that is what I expect. |
|
Re: (2), I think everything is working correctly. Think of |
@balhoff As far as copying all of the saved stuff to disk, that's good and expected. It's the losing everything that's _un_saved that's worrying. It feels like there's a complete reload of the internal state after the flush. |
@balhoff I've thrown everything out and rebuilt everything from scratch and I can no longer replicate the store-model and export-all errors. So great there; I'll try it again tomorrow and keep an eye on it to see if I can replicate it, but hopefully that was all something at my end. That leaves only #39 (comment) . I'm not adding any UI for the operation (right now), but it still needs to follow the protocol for the response instantiation and the REPL (where early testing and scripting happens). |
@kltm glad the errors are gone for now. I added |
Great. I'm doing some final testing now. If it goes well, I will merge it on to master, then use that to start closing the tickets on the Noctua side (@DoctorBud I'll do a merge branch off of noctua master with a summary as I collapse the issue branches back on, then merge that to master after testing), with an eye to get this all out this evening. |
@balhoff It looks like there is another protocol issue: "signal" needs to be defined for successful returns. In this case "signal": "meta". |
With @balhoff on Skype; I think we now have a passing set, ready to be merged back onto master? |
Done via #71. |
The choice is not so important. Blazegraph has nice features, jena tbd may be easier.
The mapping from what we currently implement to a triplestore should be simple.
Currently we store each model in its own file. Here, each model would be in its own Named Graph (NG), which would have an IRI the same as the model IRI. Note with a triplestore it's easy to flush and replace a whole NG (just as we currently do for models on the filesystem).
There may be an existing owlapi<->triplestore bridge. If not this should be relatively simple, as we are dealing with a subset of owl.
We could maintain the github repo as a primarily write-only backup and additional way in which to access the models. But for most future operations, the flexibility of sparql will be better than per-file operations.
Note this would obviate the need for epione. Any server-side or client-side component would be able to fetch any kind of model metadata using a simple SPARQL-over-REST call.
cc @hdietze @kltm @DoctorBud @balhoff
The text was updated successfully, but these errors were encountered: