Skip to content
This repository has been archived by the owner on Sep 19, 2023. It is now read-only.

Crawler contribution docs clarification #376

Open
jbothma opened this issue May 17, 2023 · 1 comment
Open

Crawler contribution docs clarification #376

jbothma opened this issue May 17, 2023 · 1 comment

Comments

@jbothma
Copy link
Contributor

jbothma commented May 17, 2023

Include verbose logging in your crawler. Make sure that new fields or enum values introduced upstream (e.g. a new country code or sanction program) will cause a warning to be emitted.

One way to read this is that to log new countries or sanction programs, a crawler should query for existing countries or programs and log when new ones are being added. Is that right? If so, the Context could be doing that for you., right?

Also, should the reader take the following to mean generally too?

Include verbose logging in your crawler.

I'm guessing you don't mean you want log statements like this:

context.log.info(f"Scraping { first_name } { last_name }")

But I do see things that are probably interesting for a given scraper, like which pages are being fetched. And perhaps logging some data that can't be parsed correctly. Is that more the intent of this?

@pudo
Copy link
Member

pudo commented May 18, 2023

I think the first paragraph refers to the general idea of making crawlers as brittle as possible: if something unexpected happens, it is much better for the crawler to complain and crash than for it to gloss over the issue. In particular, any log message with a level >= WARN will be stored to the database and we can review it later. So having check points like these is really useful:

https://github.com/opensanctions/opensanctions/blob/main/opensanctions/crawlers/gb_hmt_sanctions.py#L80-L83

Regarding the "verbose" logging: any error message below level info is hidden by default (in practice: log.debug), but you can make them visible by calling opensanctions with the -v flag. That gets super super verbose, though, and to be very honest I do a lot of print() debugging once I know there's an issue....

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants