Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually resolve t.co Card URL if guesswork fails #981

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions snscrape/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,9 @@ def _request(self, method, url, params = None, data = None, headers = None, time
def _get(self, *args, **kwargs):
return self._request('GET', *args, **kwargs)

def _head(self, *args, **kwargs):
return requests.head(*args, allow_redirects=False, timeout=10)

def _post(self, *args, **kwargs):
return self._request('POST', *args, **kwargs)

Expand Down
7 changes: 6 additions & 1 deletion snscrape/modules/twitter.py
Original file line number Diff line number Diff line change
Expand Up @@ -1081,7 +1081,12 @@ def _make_tweet(self, tweet, user, retweetedTweet = None, quotedTweet = None, ca
card.url = u.url
break
else:
_logger.warning(f'Could not translate t.co card URL on tweet {tweetId}')
try:
u = self._head(card.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this slow down the scraper significantly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this slow down the scraper significantly?

It would but I don't think it'd matter much. HEAD requests are very fast. Would it be better if this were opt-in?
Most tweets I scraped could have the t.co URL translated fine, I didn't hit the warning often.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the big thing to figure out here first. (I'll have other remarks later.)

It should definitely be configurable. I'm not sure whether it should be on or off by default, though I'm leaning towards off (i.e. opt-in).

This comment was marked as outdated.

assert u.status_code >= 300 and u.status_code < 400
card.url = u.headers["location"]
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare excepts aren't a great idea (they catch everything, including Ctrl-C), do you just want to catch AssertionError? If so, why not just use an if statement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you just want to catch AssertionError?

AssertionError or any exception thrown from requests (I think it's better to continue with the t.co URL and log the warning than to crash)

Copy link
Contributor

@TheTechRobo TheTechRobo Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I'd either specify the exceptions or catch Exception (rather than the default BaseException, which catches EVERYTHING, including Ctrl-C):

except (AssertionError, WhateverElseYouWantToHandle, ...):
    ...

or:

except Exception:
    ...

I'd recommend the former so you don't stifle valid errors.

Might also be good to logging.debug the actual exception, too.

_logger.warning(f'Could not translate t.co card URL on tweet {tweetId}')
if 'bookmark_count' in tweet:
kwargs['bookmarkCount'] = tweet['bookmark_count']
kwargs['conversationControlPolicy'] = ConversationControlPolicy._from_policy(tweet.get('conversation_control', {'policy': None})['policy'])
Expand Down