Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend documentation by describing reuse options #159

Open
apiskunovs opened this issue Feb 20, 2022 · 3 comments
Open

Extend documentation by describing reuse options #159

apiskunovs opened this issue Feb 20, 2022 · 3 comments

Comments

@apiskunovs
Copy link

Would be beneficial to see what and where needs to be done and/or what commands has to be executed in order to reuse the tool for other scenarios, such as :

  • how to feed tool with data from other endpoints ?
  • how to integrate it with other client other than RDFExplorer ?
  • how to cleanup/update indexes ?
  • how to keep indexes from two or more endpoints?

p.s. i am just a beginner in this area, thus sorry if some of these were already addressed. At least until now i were not able to find clear answers on these questions.

@gabrieldelaparra
Copy link
Owner

Hi @apiskunovs,

Thanks for your interest in the project.
This project works with a RDF Dump (NTriples) and not directly with an endpoint. You would need to download a dump first. Is this available to you?
To integrate it with another client, other than RDFExplorer, once the index is built, there is an API (See https://github.com/gabrieldelaparra/SPARQLforHumans/blob/master/SparqlForHumans.Server/Controllers/QueryGraphController.cs) that can be queried, or you can query the index directly (See https://github.com/gabrieldelaparra/SPARQLforHumans/blob/master/SparqlForHumans.UnitTests/Query/QueryGraphResultsTests.cs).

My recommendation would be to convert your model to a graph and send it to the API.
But you can additionally input a SPARQL query (See https://github.com/gabrieldelaparra/SPARQLforHumans/blob/master/SparqlForHumans.Benchmark/BenchmarkRunner.cs) for an example.

Currently it is not possible to update the index with a new dump. This was not in the scope of the project and I know that it is a research topic (due to the sizes of the dumps, it is not as trivial as doing a diff).

To handle 2 endpoints, I would recommend having 2 instances running. The location of the index paths is not setteable externally, so it must be hardcoded (See https://github.com/gabrieldelaparra/SPARQLforHumans/blob/master/SparqlForHumans.Lucene/LuceneDirectoryDefaults.cs)

Let me know if this answers your questions.

@apiskunovs
Copy link
Author

Hi @gabrieldelaparra ,

Following your suggestions I did some improvements. I use your project to make it possible to query suggestions, e.g., finding "human (Q5) instances with certain words in labels" and few more scenarios. I managed to improve the code to use custom paths to Lucene index directories, so now running 2 instances in parallel sounds possible.

I have few more questions. Will appreciate if you guide me further :)

  • were there done some performance tests enforcing simultaneous requests. Not an expert here, but I run JMeter test to simulate 100 parallel request and pretty quickly noticed the RAM memory jumped from 15GB to all 32GB usage (obviously response time significantly increased). Wondering if you have any suggestions for what might cause it.
  • LuceneBinaries folder contain some *.dll files and some comments indicate "Manually added dependencies for Lucene.Net, based on https://github.com/gabrieldelaparra/lucenenet. Nuget pckg is not netstandard2.0 compatible for Lucene .Queries and .SandBox". The mentioned GitHub project is not accessible, thus could you elaborate more what was the problem and how would it be possible to update those *.dll from latest releases/prereleases?

Sorry if i am troubling you. Soon this should be over :)

@gabrieldelaparra
Copy link
Owner

Hi @apiskunovs,

Sorry for the late reply.

Regarding lucenet: At the moment of development, lucenenethad not had much movement or updates. The existing version was not compatible with netstandard2.0. I recall that it was .NetFramework compatible only. I cloned the repo and modified it to be compatible. The latest version seems to be compatible, but I am not using it, also I am not sure what will break in doing such update.

To update to the latest version, you should be able to remove the .dll dependencies (v3.0.3) and reference the nuget package (v.4.8.0). As I mentioned before, I am not sure if there will be any breaking changes while doing so, but I believe that there will.

Regarding the performance tests. Not sure, but since you are running multiple instances, the system loads the data in memory (See InMemoryQueryEngine usages).

As per our previous messages, I never considered having multiple threads running in parallel, so I did not optimize for this. A proposal would be to use a different approach (both a graphDB or a relationalDB would work), but that is not in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants