https://chulinuwu-articlepopularitypre-03-data-visualizationapp-ho1xxi.streamlit.app/
pip install -r requirements.txt
This section covers converting raw data files to JSON and scraping data from the Scopus website. Use the credentials provided in secret.py
to log in to Scopus.
-
Navigate to the project directory:
cd 00_Web_Scraping code .
-
Converting raw data files to JSON (update folder path in
convert_file_type.py
):python convert_file_type.py
-
Create a file named
secret.py
with the following content:_id = "your email" _pass = "your pass"
-
Execute
scraping_script.py
to scrape data from Scopus website:cd scobus_scraping python scraping_script.py
-
(Optional) Concatenate CSV files from scraping (update input file paths in
concat_csv.py
):python concat_csv.py
-
Clean the CSV file:
python clean_csv.py cd ..
This section involves extracting data provided by the professor and combining it with the data scraped from the web.
-
Run the
1_ingest_Ajarn.ipynb
file to extract data provided by the professor -
Run the
2_concat_data.ipynb
file to combine the data from the professor with the data scraped from the web. -
The combined data will be saved as
2_data_combined
.
This section covers how to store the combined dataset into a Cassandra database and interact with it using CQL (cassandra).
- Install Cassandra locally or use a cloud-based instance
Start cassandra using
sudo service cassandra start
- Set up the database by running
cqlsh -f 01.5_Data_Storage/structure.cql
- Write the data to Cassandra using PySpark
python -u 01.5_Data_Storage/spark_storage.py
After completing these steps, you can query and use the stored data with PySpark.
This section involves using the 2_data_combined
dataset to train a model to predict the cited by count of research papers.
- Run the
train_model.ipynb
file to train the model on the combined dataset and generate predictions for the cited by count of research papers.
This section involves visualizing the data using Streamlit.
- Navigate to the project directory and run the app:
streamlit run .\03_Data_Visualization\app.py