A data pipeline built using luigi package that helps to retrieve and scores posts from the top subreddits.
The score of the subreddit is calculated by the following formula:
where the post score(PostRS) is calculated by:
In this project we score the top 50 subreddits by scoring them on the basis of top 5 comments of their top 10 posts.
Sample Output:
Schedule package of python has been used in order to run this pipeline twice a day.
- Install the dependencies by running
pip install -r requirements.txt
-
Add the following API credentials(you'll need to register for the reddit API) to an .env file a) Client ID b) Client Secret c) Username d) Password e) User agent
-
Run the file:
python app.py