Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command: logreport #531

Open
jpmckinney opened this issue Oct 23, 2020 · 4 comments
Open

New command: logreport #531

jpmckinney opened this issue Oct 23, 2020 · 4 comments
Labels
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Oct 23, 2020

This command would implement the steps in the logs documentation, to report the most relevant lines from the log file. https://kingfisher-collect.readthedocs.io/en/latest/logs.html

We might want to make this a separate package, and extract the ScrapyLogFile class from Kingfisher Archive: https://github.com/open-contracting/kingfisher-archive/blob/master/ocdskingfisherarchive/scrapy_log_file.py

@jpmckinney
Copy link
Member Author

@jpmckinney
Copy link
Member Author

Note: scrapy-log-analyzer's logparser dependency is GPL. Make the command optional as documented at https://ocp-software-handbook.readthedocs.io/en/latest/python/preferences.html#license-compliance

@jpmckinney
Copy link
Member Author

Here's the stub I had started (was in a git stash). It would mostly call the scrapy-log-analyzer package.

from scrapy.commands import ScrapyCommand
from scrapy.exceptions import UsageError
from scrapyloganalyzer import ScrapyLogFile


class LogReport(ScrapyCommand):
    def short_desc(self):
        return "Analyze a crawl's log file to assess the quality of the crawl"

    def syntax(self):
        return '[options] <logfile>'

    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("Exactly one log file must be provided.")

@jpmckinney
Copy link
Member Author

Another idea from #1048. The advantage is that it can interrupt a crawl, instead of waiting for it to end. Can maybe use the same approach as #1055

A more intensive option is to add a new feature, that checks the rate of 500 errors and cancels the crawl if too high. This should also send a new type of message to Kingfisher Process, to cancel processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant