Article checker #182

mruwnik · 2023-09-03T19:37:55Z

Every 4h goes picks 100 articles that haven't been checked in the last 4 weeks and checks:

whether there is new/better data
whether the url is reachable

Thomas-Lemoine · 2023-09-03T20:19:53Z

align_data/common/alignment_dataset.py

@@ -23,6 +23,62 @@
 logger = logging.getLogger(__name__)


+
+def normalize_url(url: str | None) -> str | None:


It this is here, shouldn't normalize_urls also be here?

def normalize_urls(urls: Iterable[str]) -> Set[str]: return {normalize_url(url) for url in urls}

dunno. I only wanted to extract what I actually need. I was also considering to move these function to a separate module or something, rather than having them here. What would you prefer?

Ah, the reason you need it is that you want to be able to use normalize_url and others without using AlignmentDataset at all, correct? Since normalize_urls is never used without having access to a dataset obj, it can stay in AlignmentDataset?

I have a small preference for keeping it in the align_data/common/. What do you think about some format.py file in common/ that deals with common ways to reformat different objects throughout the code, like formatting url strings to a standard-ish format, formatting authors list to a standard authors string format, format an entry's contents dict into some standard-but-flexible article_data_dict format which make_data_entry expects, etc.?

Thomas-Lemoine · 2023-09-03T20:30:36Z

align_data/sources/validate.py

@@ -0,0 +1,84 @@
+import logging


I don't have a strong opinion on it, but would it make more sense to have this file in align_data/db, since it's used to validate db objects, rather than any source in particular (afaict)

sort of. It also makes use of the parsers, as it fetches data from the appropriate source etc., so dunno :/

Thomas-Lemoine · 2023-09-03T20:34:41Z

align_data/sources/validate.py

+        for k, v in value.items():
+            meta_val = article.meta.get(k)
+            if not meta_val or v > meta_val:
+                article.meta[k] = v


If the values can't be compared, does it throw an error?

yes. I suppose metadata keys can be ignored in those cases...

Thomas-Lemoine · 2023-09-03T20:38:01Z

align_data/sources/validate.py

+    elif isinstance(value, datetime) and value > article_val:
+        setattr(article, field, value)
+    elif isinstance(value, str) and len(normalize_text(value) or '') > len(normalize_text(article_val) or ''):
+        setattr(article, field, normalize_text(value))


the > heuristic seems fine since it will only do anything in the case of comparables like dates and numbers and idk what else, but imo, in other cases it would always update it? Like, if the text changed it would always be updated, rather than only updated when the length of the text grows, which seems like what's going on here. Lmk if I'm wrong

Text changes seem to be:

slight edits for stylistic reasons etc., which often have a note attached to say when the edit happened - I'm hoping these will result in slightly larger texts

things being ammended - this should result in a larger article

things being removed, or the article disappearing with only an error message displayed

I didn't want to just overwrite the text if it changed, because the changes might not be for the better. So this seemed like a heuristic that might be better than that. Albeit not much..

What are you thinking would be situations where the more recent changes are worse? If I'm understanding it right, "contents" will only be successful (no errors, non empty) when the url corresponds to a parsers-friendly url, ie a url that was originally fetched using the parsers, right? In that case, the newly fetched contents will be created through the same method as the original contents of the article, but will have (as the only difference) more recent data

Actually, it's meant to also work with any pdf and epub. Is that the reason why the contents may be of lesser quality? if the url is not the thing we used to get the pdf in the first place, maybe this generic pdf parser will be worse than a specialized one. Seems tricky.

exactly. Plus some websites return an error in HTML, but with a 200 code, or other such madness, so you can't really trust them :/
Though it might be better to just always replace things if possible...

Thomas-Lemoine · 2023-09-03T20:57:17Z

align_data/sources/validate.py

+        setattr(article, field, normalize_text(value))
+
+
+def check_article(article: Article) -> Article:


It seems like this would be worth either changing the name or creating a docstring, since the name "check_article" doesn't seem to imply much of what it does.
Seems like:

it tries to get the source's url

tries to fetch its contents, which should only ever work if the url is from a domain that has a parser implemented specifically for it.

if it gets anything and it doesn't return an error (indicating a parser exists for this url and the parse was successful), it tries to update the fields of the corresponding article.

if the url is not reachable, it is noted as such.

the date_checked is updated.

Maybe "update_article" would be a sufficiently good name for that?
and update_article goes through the fields and runs update_article_fields, which makes sense?

Thomas-Lemoine · 2023-09-03T20:58:21Z

align_data/sources/validate.py

+    return article
+
+
+def check_articles(sources: List[str], batch_size=100):


similarly as above, maybe update_articles? since that's what it does, updates them to the newest contents if possible.

Thomas-Lemoine · 2023-09-04T01:29:05Z

If my understanding is right, validation's purpose is:

for urls that have parsers, link to a pdf or link to an epub, it updates them somewhat.
regardless of 1), check that the url works, and update the article's status accordingly.
update the date_checked

It seems to me that 1) and 2+3 are pretty different tasks? Would it maybe make more sense to have the validator file have the following purpose:
It goes through all sources, and presses on some of the links to see if they work. if they don't, add article.status of "Unreachable url", but update the date_checked in any case. It could have a name like "check_article_url", to indicate that it's only really checking that the url hasn't closed or something, since it was fetched originally.

And then, we could just refetch parsers + pdfs + epubs more often?

In other words. validator.py would essentially validate that urls work, whereas regular fetching of item_metadata stuff would update some of the fields of the article if needed. In that case, validator would probably make even more sense in common/, but idk.

it's possible that you preferred that parsable urls or pdfs or epubs would benefit from being updated batch by batch, some every 4 hours, rather than them being updated when fetching, every week or so. If so I'm curious why

mruwnik · 2023-09-04T08:07:45Z

Regular fetching more often won't work, as the regular fetchers only get new data - they ignore things they've already seen. The basic assumption is that things won't change that often, but it's worth checking every now and then if there are updates. It would be possible to always recheck all urls, but that would take a LOT more time than the current process of only looking for new items.
Ideally, I'd like to move away from there being multiple ways of fetching data to having the list of parsers also contain things that are handled in other places (e.g. have parsers for LW) in there. This is also a mechanism that would sorta-automatically fix any articles that previously couldn't be parsed, e.g. because they didn't have a domain handler.

Thomas-Lemoine · 2023-09-04T14:09:08Z

Hmmm, I see. would something like

regular fetching ever X days
fetching with --rebuild, or with an empty self._outputted_items essentially, at longer intervals
work? my sense of how much slower it is to fetch all vs some is pretty weak. additionally, is fetching an individual article slower than updating it? I suppose it seems pretty similar

Thomas-Lemoine · 2023-09-04T14:13:21Z

align_data/sources/validate.py

@@ -66,6 +71,7 @@ def check_article(article: Article) -> Article:


 def check_articles(sources: List[str], batch_size=100):


this is really minor, but for consistency the first and second function should have the same name, except one plural, imo (ie update_article and update_articles or check_article and check_articles, since one just calls the other on a set of lots of articles

Thomas-Lemoine · 2023-09-04T14:13:29Z

align_data/common/formatters.py

@@ -0,0 +1,60 @@
+import re


mruwnik · 2023-09-04T14:13:39Z

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours.
An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

Thomas-Lemoine · 2023-09-04T14:16:54Z

align_data/sources/validate.py

+                    article.meta[k] = v
+            except Exception as e:
+                # Ignore exceptions here - the metadata isn't that important
+                logger.info('Error checking metadata value: %s', value)


NIT - I'm imagining seeing the error logged and having no idea what article it's for; would adding the article url, or something, be beneficial

Thomas-Lemoine · 2023-09-04T14:54:32Z

To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours.

I see. I'm completely on board for doing manageable batches at a time for updating. Suppose we call updating_article the act of making a single article more up-to-date, which currently only works for parsers + pdfs + epubs, and we call validating_url the act of checking the status code of a fetch of a url and seeing if it works (and modifying .status accordingly), as well as updating the date_checked. Then, I have a slight preference for separating them since, as far as I understand, this would save some (maybe significant? unsure) fetching time, and give us control over what we do. This is an example of what I have in mind:

every X days, validate all the urls, or every 4 hours, validate batches of urls
separately, update articles from certain sources at some frequency. The idea is that, then, we can decide if one source has a parser-focused updating method, and another has a (perhaps much rarer, but can be automatic) refetching of all the sources, --rebuild style. Like, for for each source s_i we decide if it is

Never updated
updated as one of (parsables + pdfs + epubs), either as batches every X hours or as a whole every N days
updated with refetching + --rebuild or similar, every N_i days

This way, datasets that get very little marginal value from getting updated (like arxivs; there are a lot of them and very few are being meaningfully modified) can be updated very rarely or never, whereas datasets that get a lot of value from updates (aisafety.info especially) can be updated every day or so. Finally, datasets that can be updated url-by-url, which gives the advantage of being updateable only using mysql's url data (parsable urls), can be updated in small batches, which spreads out the updates.

Maybe this makes it needlessly more complex, though. Also kind of points to "date_updated" for update_article being implied by the existence of "date_checked" for check_article, but I'm not sure; an advantage of combining those two tasks is that those two dates are combined.

An issue with the --rebuild option is that datasets would always have to do upserts, rather than inserts. Though that actually might be a good thing?

I don't know for sure, but yeah in my mind upserts are always better, I think? We keep old information if it's not replaced by new, but any new information is assumed to be improved or better than the original. One potential risk is that some url has its text replaced with "This article is no longer available" or something, which would be a bummer. This relates to the fact that in the validator code, I would have a slight preference to the heuristic being something more like if len(new_text) <= 0.5*len(old_text): skip, otherwise update to new_text; an update that prunes or condensates a few paragraphs seems as worthy of replacing the old text as anything

ccstan99

No fundamental objections from me but I also didn't test it myself.

Probably a rare edge case but is there any concern that we happen to check a website that happens to be temporarily down but may be fine the next day? Like check the X num failures?
Also just curious, do we use some kind of mysql trigger after update to make sure pinecone stays synced?

mruwnik · 2023-10-19T10:34:15Z

I considered adding special handling for that, but decided against it - the entry won't be deleted, it'll just have its status changed. So it should be possible to then just check items with that error and remove the error status, at which point it should be reindexed. Of course, if this starts being an issue it'll have to be addressed.
There isn't a mysql trigger, but the main updater task will add unadded items to pinecone, and remove items that are in it but have error statuses

mruwnik requested review from Thomas-Lemoine, ccstan99 and henri123lemoine and removed request for Thomas-Lemoine and ccstan99 September 3, 2023 19:37

Thomas-Lemoine reviewed Sep 3, 2023

View reviewed changes

mruwnik force-pushed the article-cleaner branch from b856d07 to 130056a Compare September 4, 2023 14:10

Thomas-Lemoine reviewed Sep 4, 2023

View reviewed changes

align_data/common/formatters.py

@@ -0,0 +1,60 @@

import re

Copy link

Collaborator

Thomas-Lemoine Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Thomas-Lemoine reviewed Sep 4, 2023

View reviewed changes

mruwnik added 2 commits October 16, 2023 21:13

Article checker

36ad8cb

seperate formatters file

44c3e17

mruwnik force-pushed the article-cleaner branch from 130056a to 44c3e17 Compare October 16, 2023 19:19

ccstan99 approved these changes Oct 18, 2023

View reviewed changes

mruwnik merged commit 471078b into main Oct 26, 2023
1 check passed

mruwnik deleted the article-cleaner branch October 26, 2023 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article checker #182

Article checker #182

mruwnik commented Sep 3, 2023

Thomas-Lemoine Sep 3, 2023

mruwnik Sep 3, 2023

Thomas-Lemoine Sep 3, 2023

Thomas-Lemoine Sep 3, 2023 •

edited

Loading

Thomas-Lemoine Sep 3, 2023

mruwnik Sep 3, 2023

Thomas-Lemoine Sep 3, 2023

mruwnik Sep 3, 2023

Thomas-Lemoine Sep 3, 2023

mruwnik Sep 3, 2023

Thomas-Lemoine Sep 4, 2023

Thomas-Lemoine Sep 4, 2023

mruwnik Sep 4, 2023

Thomas-Lemoine Sep 3, 2023

Thomas-Lemoine Sep 3, 2023

Thomas-Lemoine commented Sep 4, 2023 •

edited

Loading

mruwnik commented Sep 4, 2023

Thomas-Lemoine commented Sep 4, 2023

Thomas-Lemoine Sep 4, 2023

Thomas-Lemoine Sep 4, 2023

mruwnik commented Sep 4, 2023

Thomas-Lemoine Sep 4, 2023

Thomas-Lemoine commented Sep 4, 2023 •

edited

Loading

ccstan99 left a comment

mruwnik commented Oct 19, 2023

		@@ -23,6 +23,62 @@
		logger = logging.getLogger(__name__)



		def normalize_url(url: str \| None) -> str \| None:

		setattr(article, field, normalize_text(value))


		def check_article(article: Article) -> Article:

		return article


		def check_articles(sources: List[str], batch_size=100):

		@@ -66,6 +71,7 @@ def check_article(article: Article) -> Article:


		def check_articles(sources: List[str], batch_size=100):

Article checker #182

Article checker #182

Conversation

mruwnik commented Sep 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thomas-Lemoine Sep 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thomas-Lemoine commented Sep 4, 2023 • edited Loading

mruwnik commented Sep 4, 2023

Thomas-Lemoine commented Sep 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruwnik commented Sep 4, 2023

Choose a reason for hiding this comment

Thomas-Lemoine commented Sep 4, 2023 • edited Loading

ccstan99 left a comment

Choose a reason for hiding this comment

mruwnik commented Oct 19, 2023

Thomas-Lemoine Sep 3, 2023 •

edited

Loading

Thomas-Lemoine commented Sep 4, 2023 •

edited

Loading

Thomas-Lemoine commented Sep 4, 2023 •

edited

Loading