GitHub feed is a great way for you to get updated with what's trending in the community. You can discover some useful repositories by looking at what your connections star.
However, there might be some repositories you don't care about. For example, you might be only interested in Python repositories while there are repositories written in other languages. Thus, it takes you a while to find interesting libraries.
Wouldn't it be nice if you can create a personal dashboard showing the repositories that your connections followed filtered by your favorite language?
This repository allows you to do exactly that.
In a high level, this project uses:
- GitHub API to write scripts to pull the data from GitHub
- Streamlit to create a dashboard displaying the statistics of the processed data.
- Prefect to schedule to run the scripts to get and process data daily
- Clone this repositories
git clone https://github.com/khuyentran1401/analyze_github_feed
- Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
To pull data from your GitHub feed, you will need a GitHub username and an access token. Create and save your GitHub authentication in the .env
file in your local machine:
# .env
username=YOUR_USERNAME
token=YOUR_TOKEN
Next, run get_and_process_data.py
under development
to pull data from your GitHub feed and process the data.
cd development
python get_and_process_data.py
By default, the script only saves Python repositories and filters out other languages. If you are interested in getting the repositories in other languages, type:
python get_and_process_data.py --language <YOUR-FAVORITE-LANGUAGE>
For example, if you want to get all repos in Java, type:
python get_and_process_data.py --language Java
To open a dashboard showing the statistics of all saved repositories, type:
cd ../app
streamlit run Visualize.py
And you should see something like the below:
If you want to pull the data from Github and process that data every day, you can use Prefect deployment.
Go to the directory contains the file deployment.py
:
cd development
Start a Prefect Orion server:
prefect orion start
Configure storage:
prefect storage create
And you will see the following options on your terminal.
Found the following storage types:
0) Azure Blob Storage
Store data in an Azure blob storage container.
1) File Storage
Store data as a file on local or remote file systems.
2) Google Cloud Storage
Store data in a GCS bucket.
3) Local Storage
Store data in a run's local file system.
4) S3 Storage
Store data in an AWS S3 bucket.
5) Temporary Local Storage
Store data in a temporary directory in a run's local file system.
Select a storage type to create:
For a quick start, you can choose 5 to save your flow runs in a local storage.
Create a work queue
prefect work-queue create --tag dev dev-queue
Output:
UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')
To run an agent, type prefect agent start . Since the ID of the dev-queue is e0e4ee25-bcff-4abb-9697-b8c7534355b2
, we type:
prefect agent start 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'
In another terminal, create a deployment
cd development
prefect deployment create deployment.py
Now go to http://127.0.0.1:4200/ and click Deployments then click the deployment Github-repo-flow
:
Then click Run in the top right corner:
Then click Flow Runs on the left menu:
And you will see that your flow is scheduled!
Now if you don't turn off your machine or shut down your agent, the script to pull and process data will run every day.