Submit topology to remote server #896
Replies: 3 comments 1 reply
-
Great! Can you tell us more about what you will be crawling?
what would change from one source to the other?
Neither are meant to be used on a large scale or continuously. The File one is for injecting seeds mainly, the Memory one for demo and playing. What you should use depends on question 1 above.
does anyone like Yaml? no particular reason to be honest. Did you look at JSONResource and JSONURLFilterWrapper? It is an example of how filters could be stored e.g. in ES and loaded dynamically. The json files would only reference the dynamic components. Please note that some filters can be configured via the 'normal' Yaml config, others like max depth can be defined per seed. In general I tend to avoid having multiple crawlers and prefer having a single one handling the different sources where possible. |
Beta Was this translation helpful? Give feedback.
-
Makes sense to have different crawlers when the use cases are very different |
Beta Was this translation helpful? Give feedback.
-
So to complete this question, the only workaround we found to submit topologies remotely and have an external configuration for parse/url filters is the following:
Except from this workaround we looked on possible options to solve this in a better way and we had some ideas:
@jnioche can you share your thoughts on that and could you please suggest your preferred way for us to contribute directly? |
Beta Was this translation helpful? Give feedback.
-
Hello again! I will start with the good news, as you can guess stormcrawler is eventually the selected crawling technology for our project. 😀
The background story of my question is that we have our configurations in a microservice which eventually this microservice can produce a yml file and submit the topology remotely using Flux. We do that because we submit the same jar (our crawling code doesn't change often) but with different configuration as we need different configuration per set of sources.
The problem is that storm doesn't support submitting extra configuration remotely (sources, filters and parsers files). So we will programmatically combine all of the in a big flux file. We will have to re-work the FileSpout to read from a config param or maybe use the MemorySpout.
But the Filters and Parsers are looking always to read from a file. So we are thinking that we will have to adapt them to read from the main config. By the way is there any particular reason why the Filters and Parsers config are in json and not in yaml?
Please feel free to suggest any different approach as we may missed something obvious at any point.
Beta Was this translation helpful? Give feedback.
All reactions