-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044
Comments
Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore potential ways to implement a CDX indexer:
Given that there is a more general interest, I'd continue to explore variant 1 - but I cannot promise when and whether this will be successful. Any suggestions or help are welcome! |
Hello @sebastian-nagel, thank you for your answer!:) Personally, I would really appreciate this, because being aware of the WARC record location is an important (but not central) aspect on our use of the StormCrawler. Therefore, I would also be willed to investigate into this issue someday. What a pity, that the HdfsBolt is constructed as dead-end.. |
Another thing that came up on our end regarding this issue: Background of this question is that we want to trigger further processing of the WARC files when the WARC file is completely written. So I'm wondering whether the crawler can provide us with the info "Now WARC file ready". |
Could just check the filesystem for new files from time to time. This seems reasonable since WARC files usually hold several 10,000 records and, consequently, aren't finished too often. |
Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.
The text was updated successfully, but these errors were encountered: