-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Article checker #182
Article checker #182
Conversation
@@ -23,6 +23,62 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
|
|||
def normalize_url(url: str | None) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this is here, shouldn't normalize_urls also be here?
def normalize_urls(urls: Iterable[str]) -> Set[str]:
return {normalize_url(url) for url in urls}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dunno. I only wanted to extract what I actually need. I was also considering to move these function to a separate module or something, rather than having them here. What would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the reason you need it is that you want to be able to use normalize_url and others without using AlignmentDataset at all, correct? Since normalize_urls
is never used without having access to a dataset obj, it can stay in AlignmentDataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a small preference for keeping it in the align_data/common/
. What do you think about some format.py file in common/ that deals with common ways to reformat different objects throughout the code, like formatting url strings to a standard-ish format, formatting authors list to a standard authors string format, format an entry's contents dict into some standard-but-flexible article_data_dict format which make_data_entry expects, etc.?
@@ -0,0 +1,84 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on it, but would it make more sense to have this file in align_data/db, since it's used to validate db objects, rather than any source in particular (afaict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort of. It also makes use of the parsers, as it fetches data from the appropriate source etc., so dunno :/
align_data/sources/validate.py
Outdated
for k, v in value.items(): | ||
meta_val = article.meta.get(k) | ||
if not meta_val or v > meta_val: | ||
article.meta[k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the values can't be compared, does it throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I suppose metadata keys can be ignored in those cases...
elif isinstance(value, datetime) and value > article_val: | ||
setattr(article, field, value) | ||
elif isinstance(value, str) and len(normalize_text(value) or '') > len(normalize_text(article_val) or ''): | ||
setattr(article, field, normalize_text(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the > heuristic seems fine since it will only do anything in the case of comparables like dates and numbers and idk what else, but imo, in other cases it would always update it? Like, if the text changed it would always be updated, rather than only updated when the length of the text grows, which seems like what's going on here. Lmk if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text changes seem to be:
- slight edits for stylistic reasons etc., which often have a note attached to say when the edit happened - I'm hoping these will result in slightly larger texts
- things being ammended - this should result in a larger article
- things being removed, or the article disappearing with only an error message displayed
I didn't want to just overwrite the text if it changed, because the changes might not be for the better. So this seemed like a heuristic that might be better than that. Albeit not much..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you thinking would be situations where the more recent changes are worse? If I'm understanding it right, "contents" will only be successful (no errors, non empty) when the url corresponds to a parsers-friendly url, ie a url that was originally fetched using the parsers, right? In that case, the newly fetched contents
will be created through the same method as the original contents of the article, but will have (as the only difference) more recent data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's meant to also work with any pdf and epub. Is that the reason why the contents may be of lesser quality? if the url is not the thing we used to get the pdf in the first place, maybe this generic pdf parser will be worse than a specialized one. Seems tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. Plus some websites return an error in HTML, but with a 200 code, or other such madness, so you can't really trust them :/
Though it might be better to just always replace things if possible...
align_data/sources/validate.py
Outdated
setattr(article, field, normalize_text(value)) | ||
|
||
|
||
def check_article(article: Article) -> Article: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would be worth either changing the name or creating a docstring, since the name "check_article" doesn't seem to imply much of what it does.
Seems like:
- it tries to get the source's url
- tries to fetch its contents, which should only ever work if the url is from a domain that has a parser implemented specifically for it.
- if it gets anything and it doesn't return an error (indicating a parser exists for this url and the parse was successful), it tries to update the fields of the corresponding article.
- if the url is not reachable, it is noted as such.
- the date_checked is updated.
Maybe "update_article" would be a sufficiently good name for that?
and update_article
goes through the fields and runs update_article_fields
, which makes sense?
return article | ||
|
||
|
||
def check_articles(sources: List[str], batch_size=100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly as above, maybe update_articles
? since that's what it does, updates them to the newest contents if possible.
If my understanding is right, validation's purpose is:
It seems to me that 1) and 2+3 are pretty different tasks? Would it maybe make more sense to have the validator file have the following purpose: And then, we could just refetch parsers + pdfs + epubs more often? In other words. validator.py would essentially validate that urls work, whereas regular fetching of item_metadata stuff would update some of the fields of the article if needed. In that case, validator would probably make even more sense in common/, but idk. it's possible that you preferred that parsable urls or pdfs or epubs would benefit from being updated batch by batch, some every 4 hours, rather than them being updated when fetching, every week or so. If so I'm curious why |
Regular fetching more often won't work, as the regular fetchers only get new data - they ignore things they've already seen. The basic assumption is that things won't change that often, but it's worth checking every now and then if there are updates. It would be possible to always recheck all urls, but that would take a LOT more time than the current process of only looking for new items. |
Hmmm, I see. would something like
|
b856d07
to
130056a
Compare
@@ -66,6 +71,7 @@ def check_article(article: Article) -> Article: | |||
|
|||
|
|||
def check_articles(sources: List[str], batch_size=100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really minor, but for consistency the first and second function should have the same name, except one plural, imo (ie update_article and update_articles or check_article and check_articles, since one just calls the other on a set of lots of articles
@@ -0,0 +1,60 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
To a certain extent. It would be fine with smaller datasets, which can take a few seconds or minutes to run, but the larger ones (e.g. LW, arxiv) can take hours, which is why I wanted to have a hundred or so checked every few hours. |
align_data/sources/validate.py
Outdated
article.meta[k] = v | ||
except Exception as e: | ||
# Ignore exceptions here - the metadata isn't that important | ||
logger.info('Error checking metadata value: %s', value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT - I'm imagining seeing the error logged and having no idea what article it's for; would adding the article url, or something, be beneficial
I see. I'm completely on board for doing manageable batches at a time for updating. Suppose we call
This way, datasets that get very little marginal value from getting updated (like arxivs; there are a lot of them and very few are being meaningfully modified) can be updated very rarely or never, whereas datasets that get a lot of value from updates (aisafety.info especially) can be updated every day or so. Finally, datasets that can be updated url-by-url, which gives the advantage of being updateable only using mysql's url data (parsable urls), can be updated in small batches, which spreads out the updates. Maybe this makes it needlessly more complex, though. Also kind of points to "date_updated" for
I don't know for sure, but yeah in my mind upserts are always better, I think? We keep old information if it's not replaced by new, but any new information is assumed to be improved or better than the original. One potential risk is that some url has its text replaced with "This article is no longer available" or something, which would be a bummer. This relates to the fact that in the validator code, I would have a slight preference to the heuristic being something more like |
130056a
to
44c3e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No fundamental objections from me but I also didn't test it myself.
- Probably a rare edge case but is there any concern that we happen to check a website that happens to be temporarily down but may be fine the next day? Like check the X num failures?
- Also just curious, do we use some kind of mysql trigger after update to make sure pinecone stays synced?
I considered adding special handling for that, but decided against it - the entry won't be deleted, it'll just have its status changed. So it should be possible to then just check items with that error and remove the error status, at which point it should be reindexed. Of course, if this starts being an issue it'll have to be addressed. |
Every 4h goes picks 100 articles that haven't been checked in the last 4 weeks and checks: