Skip to content

Latest commit

 

History

History
111 lines (79 loc) · 4.77 KB

README.md

File metadata and controls

111 lines (79 loc) · 4.77 KB

About

Author

This script was written by Dominik Rappaport. You can contact me via email: [email protected].

Introduction

The Strava Segment Downloader is a Python-based script to download the full leaderboard of a given Strava segment. The data is stored in a CSV file, which is the de facto for exchanging statistical data.

Why do you want to use this script?

Strava does not provide their uses with advanced analysis methods for the segment leaderboards. You cannot apply advanced filters or calculate statistical values like mean, median, or standard deviation. All this can be easily done using software like R or Excel. The CSV file generated by this script can be easily imported into these tools.

Background details

Strava implements a public API to programmatically interact with their data. That would be the most natural way of fetching the leaderboard data. Unfortunately, Strava deprecated the API endpoint to download leaderboards in the year 2020. This link provides you with more information:

https://developers.strava.com/docs/segment-changes/

The background of that controversial decision is described in an article of the well-known cycling blogger DC Rainmaker:

https://www.dcrainmaker.com/2020/05/strava-leaderboard-reduces.html

As a consequence, traditional Screen scraping is the only way to still get that data. As Strava's website make extensive use of JavaScript, libraries like BeautifulSoup are not able to parse the data, and we have to use Selenium to remote control the browser.

Challenges that come with screen scraping

Screen scraping is a fragile method to get data from a website. The website's structure may change anytime and the script may break as a consequence.

In addition, Strava imposes measure to prevent people from doing exactly that. In particular, they apply a rate limit to the number of requests you can make to their website. If you exceed that limit, you will be blocked from accessing the leaderboard data for a certain period of time (typically 24 hours).

To make the script work in such a condition, the user can interrupt the script using Ctrl+C (SIGINT) and continue another day. With the switch --resume it continues where it left off. Obviously that may introduce inconsistencies in the data as the leaderboard may have changed in the meantime.

How to use the script

Installation

Clone the repository and install the required Python packages:

git clone https://github.com/dominikrappaport/SegmentDownloader.git
cd SegmentDownloader
pip install -r requirements.txt

Usage

Selenium starts the browser with a blank profile, and we therefore have to log in to Strava first. If you use the script more often Strava may temporarily block your account. To avoid this, we use an authentication script that logins to Strava and saves the credentials in a cookie file. This file is then used by the main script to authenticate.

Username and password are stored in environment variables. I decided to use environment variables instead of command line parameters to make it easier to use the script programmatically like in GitHub actions together with the GitHub secrets.

export STRAVA_USERNAME="your_username"
export STRAVA_PASSWORD="your_password"
python authenticate.py

The script saves the cookies in a file cookies.pkl. As of today, the filename is hardcoded.

Then you can run the main script passing the segment ID as a command line parameter:

python segment_downloader.py 12345678

The script will download the leaderboard of the segment with the ID 12345678. It creates a CSV file with the name leaderboard_12345678.csv.

You can interrupt the script at any time using Ctrl+C as described above the paragraph Challenges that come with screen scraping. If you want to continue where you left off, you can use the --resume switch:

python segment_downloader.py --resume 12345678

Notes and Warnings

  • The script uses the Chrome browser and expects the Strava page to be in English. It may fail if the pages are in a different language because we identify for examples buttons or the categories with their labels.
  • The script tries to compile a single leaderboard list with all data in one table. In Strava for example age groups, sex and weight groups are not included in the full table. You can only see if a user is male or female if the leaderboard entry is displayed when the respective filter is applied. To get the full data, the scripts downloads the leaderboard for each category separately and joins the tables.
  • Please note that no user is obliged to specify their sex, weight or age or keep these values up to date. You may end up with missing data or wrong data in these columns.