Submit topology to remote server #896

vandervagos · 2021-07-16T04:57:33Z

vandervagos
Jul 16, 2021

Hello again! I will start with the good news, as you can guess stormcrawler is eventually the selected crawling technology for our project. 😀

The background story of my question is that we have our configurations in a microservice which eventually this microservice can produce a yml file and submit the topology remotely using Flux. We do that because we submit the same jar (our crawling code doesn't change often) but with different configuration as we need different configuration per set of sources.

The problem is that storm doesn't support submitting extra configuration remotely (sources, filters and parsers files). So we will programmatically combine all of the in a big flux file. We will have to re-work the FileSpout to read from a config param or maybe use the MemorySpout.

But the Filters and Parsers are looking always to read from a file. So we are thinking that we will have to adapt them to read from the main config. By the way is there any particular reason why the Filters and Parsers config are in json and not in yaml?

Please feel free to suggest any different approach as we may missed something obvious at any point.

jnioche · 2021-07-16T13:31:48Z

jnioche
Jul 16, 2021
Collaborator

stormcrawler is eventually the selected crawling technology for our project

Great! Can you tell us more about what you will be crawling?
Anything I can brag about on social media? ;-)

with different configuration as we need different configuration per set of sources

what would change from one source to the other?

We will have to re-work the FileSpout to read from a config param or maybe use the MemorySpout.

Neither are meant to be used on a large scale or continuously. The File one is for injecting seeds mainly, the Memory one for demo and playing. What you should use depends on question 1 above.

By the way is there any particular reason why the Filters and Parsers config are in json and not in yaml?

does anyone like Yaml? no particular reason to be honest. Did you look at JSONResource and JSONURLFilterWrapper? It is an example of how filters could be stored e.g. in ES and loaded dynamically. The json files would only reference the dynamic components.

Please note that some filters can be configured via the 'normal' Yaml config, others like max depth can be defined per seed.

In general I tend to avoid having multiple crawlers and prefer having a single one handling the different sources where possible.

1 reply

vandervagos Jul 16, 2021
Author

It is a project developed by Eurostat and I will contact you again in the future when I will have all the formal details on how to mention the European Commission's name.

The project very briefly, collects public available data for statistical reasons. Sources are grouped separately based on the use case. One use case could be online job ads. Another one could be online prices for hospitality. Of course domain owners will be contacted and get approval, we will not crawl aggressively and we will respect robots.txt etc.

The cases are totally independent. This means that the collection of the data for each case is happening in different periods and for each case we are interested in different data that are in the page. The scheduling/workflow will be handled from an external system, but for the collection of the data we will use the stormcrawler's configuration. So one use case may be interested in the whole html, another use case may be interested in a specific div that has a particular id etc. Url in/exclusions are also independent.

Any ideas are more than welcome

jnioche · 2021-07-21T08:58:47Z

jnioche
Jul 21, 2021
Collaborator

Makes sense to have different crawlers when the use cases are very different

0 replies

vandervagos · 2021-07-21T09:24:10Z

vandervagos
Jul 21, 2021
Author

So to complete this question, the only workaround we found to submit topologies remotely and have an external configuration for parse/url filters is the following:

We extend bolts that the call the methods ParseFliters.fromConf and UrlFliters.fromConf.
The new classes will try to get the parsfilters.json and urfilters.json file from ES and store then in /conf. This works only if /conf is included in your classpath. In storm docker they have included /conf by default.
The method to get and store the classes is synchonized and executed only if the files don't exist. That's how we force the files to be loaded only once.

Except from this workaround we looked on possible options to solve this in a better way and we had some ideas:

Instead of calling the static methods ParseFilters.fromConf/UrlFilters.fromConf we may have a factory that based on a (new) configuration value can either use the existing class ParseFilters/UrlFilters or a (new) JSON wrapper. Similar to the JsonWrapper filter.
Another option is to update the existing ParseFilters/UrlFilters to load the JSON from ES instead of a resource based on a (new) configuration value.

@jnioche can you share your thoughts on that and could you please suggest your preferred way for us to contribute directly?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit topology to remote server #896

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Submit topology to remote server #896

vandervagos Jul 16, 2021

Replies: 3 comments · 1 reply

jnioche Jul 16, 2021 Collaborator

vandervagos Jul 16, 2021 Author

jnioche Jul 21, 2021 Collaborator

vandervagos Jul 21, 2021 Author

vandervagos
Jul 16, 2021

Replies: 3 comments 1 reply

jnioche
Jul 16, 2021
Collaborator

vandervagos Jul 16, 2021
Author

jnioche
Jul 21, 2021
Collaborator

vandervagos
Jul 21, 2021
Author