The goal of this project is to extract textual data from articles using provided URLs and perform text analysis to compute various metrics.
-
Web Scraping:
- Used the
BeautifulSoup
library for web scraping. - Extracted data from the first URL and converted it into a string, then a list of words.
- Used the
-
Data Manipulation:
- Converted the extracted data into a pandas DataFrame and further into a NumPy array for manipulation.
-
Text Analysis:
- Created a
TextAnalysis
class inside theTextAnalysis.py
file. - Defined class attributes for
StopWords
,PositiveWords
, andNegativeWords
to use across all URL data. - Methods were created to:
- Load stop words, positive words, and negative words from files.
- Extract, clean, and analyze the text data.
- Employed exception handling during data extraction.
- Managed the sequence of methods to ensure dependent variables like word count are assigned first.
- Created a
-
Automation:
- Created a
Main.py
file that imports theTextAnalysis
class and performs analysis on each URL iteratively. - Results are stored in a dictionary and exported to an
Output.xlsx
file.
- Created a
- Ensure all required libraries are installed by running:
pip3 install -r requirements.txt ```
- Requests
- Bs4 (BeautifulSoup)
- pandas
- NLTK
- Openpyxl
- MacOS
- Clone the repository:
git clone https://github.com/samarth-jain28/Web-Scraping-and-Text-Analysis-Project/ ```
- Navigate to the project directory:
cd Web-Scraping-and-Text-Analysis-Project
- pip3 install -r requirements.txt:
pip3 install -r requirements.txt
- Run the
Main.py
file to start the analysis:python3 Main.py
- The results will be saved in
Output.xlsx
within the project directory.
- Add support for additional languages in text analysis.
- Implement sentiment analysis using machine learning models.