This project processes URLs to fetch Lighthouse scores using the Google PageSpeed Insights API.
-
Clone the repository:
git clone https://github.com/yourusername/my_project.git cd my_project
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Add your Google API key: Update the
API_KEY
inconfig.py
with your Google PageSpeed Insights API key. -
Place your input data: Ensure that your
cwv.csv
file is in thedata
directory.
- All URLs in the
cwv.csv
file must be in the formathttps://www.example.com
. - Ensure that each URL starts with
https://
and includeswww.
to avoid any issues with API requests.
- The
cwv.csv
file should include aplatform
column, which differentiates between "Carrot" and "Non-Carrot" sites. - This data is directly gleaned from the script output at carrot-serp-compare. The
TRUE
andFALSE
values from this script need to be turned into "Carrot" and "Non-Carrot" respectively. - The reason for this differentiation is to provide mean CWV (Core Web Vitals) scores for Carrot vs Non-Carrot sites in the comparison file at the end of the processing.
- Run the main script:
python main.py
- The processed Lighthouse scores will be saved in
data/lighthouse_scores.csv
. - The comparison results will be saved in
data/comparison_results.csv
. - Errors and logs will be saved in
logs/errors.log
.
- The current setup uses parallel processing to speed up the fetching of Lighthouse scores.
- URLs are processed concurrently using Python's
concurrent.futures.ThreadPoolExecutor
, which significantly reduces the total processing time. That said, the script does take a long time to execute when there are thousands of URLs. When ran in a virtual environment, your console will provide updates, such as X/Y processed. - By default, the script uses 10 threads to handle multiple requests in parallel. This can be adjusted by modifying the
max_workers
parameter in themain.py
script.