This repository contains a collection of data science projects and notebooks. Each notebook explores different data science techniques, analyses, or machine learning models applied to various datasets and problems. Below is a detailed description of each notebook.
- Student Performance Indicator
- Effect of Government Social Programs on Poverty in Kenya
- Effect of Petroleum Prices Changes on the Demand for Petroleum in Kenya
- Effect of Taxation on SME Performance
- Fine-Tuning English-Swahili Translation Model
- Lyrics Finder
- English-Kiswahili Translation Notebook
- PandemAI
- Supervised Learning with SVM
- Supervised Learning with Random Forests
- Customer Churn Prediction
- [Causal Inference with Bayesian Networks](#12-Causal Inference with Bayesian Networks)
Notebook: EDA_STUDENT_PERFORMANCE_.ipynb
This notebook focuses on analyzing student performance through a comprehensive Exploratory Data Analysis (EDA). It follows the machine learning project lifecycle, starting from understanding the problem statement to data preprocessing, modeling, and choosing the best model.
- Understanding the Problem Statement: Defining the objectives of the analysis.
- Data Collection: Gathering the relevant data on student performance.
- Data Checks to Perform: Ensuring the data's integrity and suitability for analysis.
- Exploratory Data Analysis: Analyzing and visualizing data to uncover patterns and insights.
- Data Pre-Processing: Preparing data for modeling by handling missing values, encoding categorical variables, etc.
- Model Training: Training various machine learning models.
- Choose Best Model: Selecting the most effective model based on evaluation metrics.
Notebook: Effect_of_government_social_programs_on_poverty_in_Kenya.ipynb
This notebook performs descriptive analytics to examine the effect of government social programs on poverty in Kenya. The main focus is on understanding correlations and drawing insights from the data.
- Correlation Analysis: Identifying relationships between variables to understand how social programs may influence poverty.
- Descriptive Analytics: Summarizing and visualizing the data to gain insights into the impact of social programs.
Notebook: Effect_of_petroleum_prices_changes_on_the_demand_for_petroleum_in_Kenya.ipynb
This notebook explores the relationship between changes in petroleum prices and the demand for petroleum in Kenya. Through descriptive analytics, it aims to uncover correlations and patterns in the data.
- Correlation Analysis: Analyzing the relationship between petroleum prices and demand.
- Descriptive Analytics: Utilizing visualizations and statistical summaries to understand market trends.
Notebook: Effect_of_taxation_on_sme_performance.ipynb
This notebook investigates the impact of taxation on the performance of Small and Medium Enterprises (SMEs). It employs frequency analysis to explore common responses and patterns in the data.
- Frequency Analysis: Identifying the most common responses and trends related to taxation and SME performance.
- Descriptive Analytics: Visualizing the data to gain insights into how taxation affects SMEs.
Notebook: FineTuningEngSwaModel.ipynb
This notebook demonstrates the process of fine-tuning a translation model for English to Swahili. It utilizes deep learning techniques and frameworks like TensorFlow and Keras for model training and evaluation.
- Import Libraries: Utilizing TensorFlow, Keras, Matplotlib, Seaborn, Numpy, and Sklearn for various tasks.
- Load and Preprocess the Dataset: Working with the CIFAR-10 dataset, normalizing images, and converting labels for training.
- Model Training: Fine-tuning the translation model using deep learning techniques.
- Evaluation: Assessing the model's performance with appropriate metrics.
Notebook: LyricsFinder.ipynb
This notebook provides a tool for finding song lyrics by scraping Genius.com. It covers the process of collecting URLs and fetching lyrics for a specified number of songs by an artist.
- Get URLs: Obtaining a list of Genius.com URLs for the desired number of songs by a specific artist.
- Fetch Lyrics: Scraping the lyrics from the URLs using BeautifulSoup, including a fix for HTML parsing.
Notebook: eng_kisw_traslation_notebook.ipynb
This notebook focuses on fine-tuning a model for English-Kiswahili translation tasks. It emphasizes the importance of using GPU for accelerated computation and covers various aspects of model fine-tuning.
- Switch Runtime to GPU: Ensuring that the notebook utilizes GPU for faster processing.
- Model Fine-Tuning: Fine-tuning a translation model for improved performance on the English-Kiswahili task.
Notebook: pandemai.ipynb
This notebook deals with data cleaning and formatting as part of a larger project named "PandemAI." It outlines the steps involved in preparing data for analysis and modeling.
- Data Cleaning: Removing inconsistencies and preparing the dataset for analysis.
- Formatting: Structuring the data in a way that's suitable for further exploration and modeling.
Notebook: supervised_learning(SVM).ipynb
This notebook explores supervised learning techniques using Support Vector Machines (SVM). It delves into training and evaluating SVM models on various datasets.
- Model Training: Implementing SVM algorithms for supervised learning tasks.
- Evaluation: Assessing the performance of SVM models with relevant metrics.
Notebook: supervised_learning(randomForests).ipynb
This notebook examines the application of Random Forest algorithms for supervised learning. It covers the process of training models, selecting attributes, and evaluating performance.
- Attribute Selection: Identifying relevant attributes for modeling.
- Model Training: Implementing Random Forest algorithms for classification tasks.
- Evaluation: Utilizing confusion matrices and classification reports to measure accuracy and performance.
Notebook: Customer Churn Prediction.ipynb
This project focuses on predicting customer churn using various machine learning models to identify factors contributing to customer attrition. The analysis is performed using a dataset of retail customer information, including demographic and behavioral attributes.
i. Data Exploration:
- Loaded and explored the dataset to understand its structure and the distribution of features.
- Visualized target variable distribution, numerical and categorical features, and correlations.
ii. Feature Engineering:
- Encoded categorical variables and scaled numerical features for model training.
- Split data into training and testing sets.
iii. Model Training and Evaluation:
- Trained several classification models: Random Forest, AdaBoost, Support Vector Classifier, and XGBoost.
- Evaluated models using accuracy, classification reports, and confusion matrices.
iv. Results:
- Compared model performance to select the best performing model for predicting customer churn.
- Generated insights into the effectiveness of different machine learning algorithms in the context of customer churn prediction.
pandas
: 1.5.3numpy
: 1.24.3matplotlib
: 3.8.0seaborn
: 0.14.0scikit-learn
: 1.3.0xgboost
: 2.1.0
- Dataset: Online Retail Customer Churn Dataset
To explore these notebooks, you can open them directly in Google Colab using the provided links. Each notebook contains the necessary code and instructions to replicate the analyses and results.
- Python 3.x
- Jupyter Notebook
- Libraries: TensorFlow, Keras, Matplotlib, Seaborn, Numpy, Scikit-learn, BeautifulSoup, Pandas, etc.
- Clone the repository:
git clone https://github.com/aueskinj/Data-Science-Projects.git
- Navigate to the project directory:
cd Data-Science-Projects
- Open a Jupyter Notebook environment and select the desired notebook to run.
- Kimuhu Njuguna
Feel free to explore the notebooks, modify the code, and apply these techniques to your own data science projects!