-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually resolve t.co Card URL if guesswork fails #981
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1081,7 +1081,12 @@ def _make_tweet(self, tweet, user, retweetedTweet = None, quotedTweet = None, ca | |
card.url = u.url | ||
break | ||
else: | ||
_logger.warning(f'Could not translate t.co card URL on tweet {tweetId}') | ||
try: | ||
u = self._head(card.url) | ||
assert u.status_code >= 300 and u.status_code < 400 | ||
card.url = u.headers["location"] | ||
except: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bare There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
AssertionError or any exception thrown from requests (I think it's better to continue with the t.co URL and log the warning than to crash) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case I'd either specify the exceptions or catch Exception (rather than the default BaseException, which catches EVERYTHING, including Ctrl-C):
or:
I'd recommend the former so you don't stifle valid errors. Might also be good to |
||
_logger.warning(f'Could not translate t.co card URL on tweet {tweetId}') | ||
if 'bookmark_count' in tweet: | ||
kwargs['bookmarkCount'] = tweet['bookmark_count'] | ||
kwargs['conversationControlPolicy'] = ConversationControlPolicy._from_policy(tweet.get('conversation_control', {'policy': None})['policy']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this slow down the scraper significantly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would but I don't think it'd matter much. HEAD requests are very fast. Would it be better if this were opt-in?
Most tweets I scraped could have the t.co URL translated fine, I didn't hit the warning often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the big thing to figure out here first. (I'll have other remarks later.)
It should definitely be configurable. I'm not sure whether it should be on or off by default, though I'm leaning towards off (i.e. opt-in).
This comment was marked as outdated.
Sorry, something went wrong.