Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article checker #182

Merged
merged 2 commits into from
Oct 26, 2023
Merged

Article checker #182

merged 2 commits into from
Oct 26, 2023

Conversation

mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Sep 3, 2023

Every 4h goes picks 100 articles that haven't been checked in the last 4 weeks and checks:

  • whether there is new/better data
  • whether the url is reachable

@mruwnik mruwnik requested review from Thomas-Lemoine, ccstan99 and henri123lemoine and removed request for Thomas-Lemoine and ccstan99 September 3, 2023 19:37
@@ -23,6 +23,62 @@
logger = logging.getLogger(__name__)



def normalize_url(url: str | None) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this is here, shouldn't normalize_urls also be here?

def normalize_urls(urls: Iterable[str]) -> Set[str]:
    return {normalize_url(url) for url in urls}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno. I only wanted to extract what I actually need. I was also considering to move these function to a separate module or something, rather than having them here. What would you prefer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the reason you need it is that you want to be able to use normalize_url and others without using AlignmentDataset at all, correct? Since normalize_urls is never used without having access to a dataset obj, it can stay in AlignmentDataset?

Copy link
Collaborator

@Thomas-Lemoine Thomas-Lemoine Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small preference for keeping it in the align_data/common/. What do you think about some format.py file in common/ that deals with common ways to reformat different objects throughout the code, like formatting url strings to a standard-ish format, formatting authors list to a standard authors string format, format an entry's contents dict into some standard-but-flexible article_data_dict format which make_data_entry expects, etc.?

@@ -0,0 +1,84 @@
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on it, but would it make more sense to have this file in align_data/db, since it's used to validate db objects, rather than any source in particular (afaict)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort of. It also makes use of the parsers, as it fetches data from the appropriate source etc., so dunno :/

for k, v in value.items():
meta_val = article.meta.get(k)
if not meta_val or v > meta_val:
article.meta[k] = v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values can't be compared, does it throw an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I suppose metadata keys can be ignored in those cases...

elif isinstance(value, datetime) and value > article_val:
setattr(article, field, value)
elif isinstance(value, str) and len(normalize_text(value) or '') > len(normalize_text(article_val) or ''):
setattr(article, field, normalize_text(value))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the > heuristic seems fine since it will only do anything in the case of comparables like dates and numbers and idk what else, but imo, in other cases it would always update it? Like, if the text changed it would always be updated, rather than only updated when the length of the text grows, which seems like what's going on here. Lmk if I'm wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text changes seem to be:

  • slight edits for stylistic reasons etc., which often have a note attached to say when the edit happened - I'm hoping these will result in slightly larger texts
  • things being ammended - this should result in a larger article
  • things being removed, or the article disappearing with only an error message displayed

I didn't want to just overwrite the text if it changed, because the changes might not be for the better. So this seemed like a heuristic that might be better than that. Albeit not much..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you thinking would be situations where the more recent changes are worse? If I'm understanding it right, "contents" will only be successful (no errors, non empty) when the url corresponds to a parsers-friendly url, ie a url that was originally fetched using the parsers, right? In that case, the newly fetched contents will be created through the same method as the original contents of the article, but will have (as the only difference) more recent data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's meant to also work with any pdf and epub. Is that the reason why the contents may be of lesser quality? if the url is not the thing we used to get the pdf in the first place, maybe this generic pdf parser will be worse than a specialized one. Seems tricky.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly. Plus some websites return an error in HTML, but with a 200 code, or other such madness, so you can't really trust them :/
Though it might be better to just always replace things if possible...

setattr(article, field, normalize_text(value))


def check_article(article: Article) -> Article:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be worth either changing the name or creating a docstring, since the name "check_article" doesn't seem to imply much of what it does.
Seems like:

  1. it tries to get the source's url
  2. tries to fetch its contents, which should only ever work if the url is from a domain that has a parser implemented specifically for it.
  3. if it gets anything and it doesn't return an error (indicating a parser exists for this url and the parse was successful), it tries to update the fields of the corresponding article.
  4. if the url is not reachable, it is noted as such.
  5. the date_checked is updated.

Maybe "update_article" would be a sufficiently good name for that?
and update_article goes through the fields and runs update_article_fields, which makes sense?

return article


def check_articles(sources: List[str], batch_size=100):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly as above, maybe update_articles? since that's what it does, updates them to the newest contents if possible.

@Thomas-Lemoine
Copy link
Collaborator

Thomas-Lemoine commented Sep 4, 2023

If my understanding is right, validation's purpose is:

  1. for urls that have parsers, link to a pdf or link to an epub, it updates them somewhat.
  2. regardless of 1), check that the url works, and update the article's status accordingly.
  3. update the date_checked

It seems to me that 1) and 2+3 are pretty different tasks? Would it maybe make more sense to have the validator file have the following purpose:
It goes through all sources, and presses on some of the links to see if they work. if they don't, add article.status of "Unreachable url", but update the date_checked in any case. It could have a name like "check_article_url", to indicate that it's only really checking that the url hasn't closed or something, since it was fetched originally.

And then, we could just refetch parsers + pdfs + epubs more often?

In other words. validator.py would essentially validate that urls work, whereas regular fetching of item_metadata stuff would update some of the fields of the article if needed. In that case, validator would probably make even more sense in common/, but idk.

it's possible that you preferred that parsable urls or pdfs or epubs would benefit from being updated batch by batch, some every 4 hours, rather than them being updated when fetching, every week or so. If so I'm curious why

@mruwnik
Copy link
Collaborator Author

mruwnik commented Sep 4, 2023

Regular fetching more often won't work, as the regular fetchers only get new data - they ignore things they've already seen. The basic assumption is that things won't change that often, but it's worth checking every now and then if there are updates. It would be possible to always recheck all urls, but that would take a LOT more time than the current process of only looking for new items.
Ideally, I'd like to move away from there being multiple ways of fetching data to having the list of parsers also contain things that are handled in other places (e.g. have parsers for LW) in there. This is also a mechanism that would sorta-automatically fix any articles that previously couldn't be parsed, e.g. because they didn't have a domain handler.

@Thomas-Lemoine
Copy link
Collaborator

Hmmm, I see. would something like

  • regular fetching ever X days
  • fetching with --rebuild, or with an empty self._outputted_items essentially, at longer intervals
    work? my sense of how much slower it is to fetch all vs some is pretty weak. additionally, is fetching an individual article slower than updating it? I suppose it seems pretty similar

@@ -66,6 +71,7 @@ def check_article(article: Article) -> Article:


def check_articles(sources: List[str], batch_size=100):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really minor, but for consistency the first and second function should have the same name, except one plural, imo (ie update_article and update_articles or check_article and check_articles, since one just calls the other on a set of lots of articles

@@ -0,0 +1,60 @@
import re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@mruwnik
Copy link
Collaborator Author

mruwnik commented Sep 4, 2023

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours.
An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

article.meta[k] = v
except Exception as e:
# Ignore exceptions here - the metadata isn't that important
logger.info('Error checking metadata value: %s', value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT - I'm imagining seeing the error logged and having no idea what article it's for; would adding the article url, or something, be beneficial

@Thomas-Lemoine
Copy link
Collaborator

Thomas-Lemoine commented Sep 4, 2023

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours.

I see. I'm completely on board for doing manageable batches at a time for updating. Suppose we call updating_article the act of making a single article more up-to-date, which currently only works for parsers + pdfs + epubs, and we call validating_url the act of checking the status code of a fetch of a url and seeing if it works (and modifying .status accordingly), as well as updating the date_checked. Then, I have a slight preference for separating them since, as far as I understand, this would save some (maybe significant? unsure) fetching time, and give us control over what we do. This is an example of what I have in mind:

  • every X days, validate all the urls, or every 4 hours, validate batches of urls
  • separately, update articles from certain sources at some frequency. The idea is that, then, we can decide if one source has a parser-focused updating method, and another has a (perhaps much rarer, but can be automatic) refetching of all the sources, --rebuild style. Like, for for each source s_i we decide if it is
  1. Never updated
  2. updated as one of (parsables + pdfs + epubs), either as batches every X hours or as a whole every N days
  3. updated with refetching + --rebuild or similar, every N_i days

This way, datasets that get very little marginal value from getting updated (like arxivs; there are a lot of them and very few are being meaningfully modified) can be updated very rarely or never, whereas datasets that get a lot of value from updates (aisafety.info especially) can be updated every day or so. Finally, datasets that can be updated url-by-url, which gives the advantage of being updateable only using mysql's url data (parsable urls), can be updated in small batches, which spreads out the updates.

Maybe this makes it needlessly more complex, though. Also kind of points to "date_updated" for update_article being implied by the existence of "date_checked" for check_article, but I'm not sure; an advantage of combining those two tasks is that those two dates are combined.

An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

I don't know for sure, but yeah in my mind upserts are always better, I think? We keep old information if it's not replaced by new, but any new information is assumed to be improved or better than the original. One potential risk is that some url has its text replaced with "This article is no longer available" or something, which would be a bummer. This relates to the fact that in the validator code, I would have a slight preference to the heuristic being something more like if len(new_text) <= 0.5*len(old_text): skip, otherwise update to new_text; an update that prunes or condensates a few paragraphs seems as worthy of replacing the old text as anything

Copy link
Collaborator

@ccstan99 ccstan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fundamental objections from me but I also didn't test it myself.

  • Probably a rare edge case but is there any concern that we happen to check a website that happens to be temporarily down but may be fine the next day? Like check the X num failures?
  • Also just curious, do we use some kind of mysql trigger after update to make sure pinecone stays synced?

@mruwnik
Copy link
Collaborator Author

mruwnik commented Oct 19, 2023

I considered adding special handling for that, but decided against it - the entry won't be deleted, it'll just have its status changed. So it should be possible to then just check items with that error and remove the error status, at which point it should be reindexed. Of course, if this starts being an issue it'll have to be addressed.
There isn't a mysql trigger, but the main updater task will add unadded items to pinecone, and remove items that are in it but have error statuses

@mruwnik mruwnik merged commit 471078b into main Oct 26, 2023
1 check passed
@mruwnik mruwnik deleted the article-cleaner branch October 26, 2023 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants