Author: Kelly Lieu
BACKGROUND
My goal is for this capstone is to learn and practice how to apply ML/AI techniques with data visualization tools. It was important to me to choose a dataset that was easy to obtain and a topic that I can learn from to apply at home or at work. The data comes from Kaggle called, "Longevity Factors."
WE WILL EXPLORE
What longevity factors provide most gain in years of life?
What factors provide the most gain in years of life for females?
What are the top features based on the weighted impact of strength of science on years?
Which machine learning algorithm should I use with this specific dataset?
How can I improve my machine learning model?
SUMMARY OF EAD FINDINGS
The initial exploratory data analysis (EAD) involved loading and cleaning the dataset, identifying missing values, and removing irrelevant columns such as comments, notes, sources, and IDs. Visualization through histograms revealed that most longevity effects ranged from 2 to 6 years, with a few extreme negative outliers (e.g., -25 years). The majority of factors had weak scientific backing, and further statistical exploration (boxplots, pairplots) yielded limited insights due to the dataset’s low dimensionality and lack of strong correlations.
Given the inconclusive initial results, a secondary analysis was conducted to assess the relationship between longevity factors and scientific strength, as well as to explore gender-based differences. A weighted impact score was calculated by multiplying the years gained/lost by the science strength rating (1–3), offering a clearer picture of each factor’s relative importance.
The top science-backed positive longevity factors included:
Gender-specific analysis required additional data cleaning and normalization. Results showed substantial overlap in top factors across genders, though some differences emerged: for men, “spending time with women” ranked #7, while for women, “having a dog” appeared in the top 10.
Key negative longevity factors were consistent across both genders, highlighting universal risks:
These four factors had the greatest negative impact, with others trailing significantly.
TECHNIQUES FOR CAPSTONE
Exploratory Data Analysis (EAD) and Visualize Results
Feature Engineering
Build a Simple Model
Fit, Train, and Time Multiple Models: linear regression, ridge, lasso, random forest, and gradient boosting
Compare and Choose the Best Model
Perform Hyperparameter Tuning Using GridSearchCV
Perform Optimization with Random Search
Visualize Results of Model Performance
SUMMARY OF FINDINGS
Among the five machine learning models evaluated, the Gradient Boosting achieved the best performance with the lowest test RMSE and highest R² score, indicating strong predictive accuracy and generalization. Random Forest Regressor was closely following Gradient Boosting in both metrics. The Linear, Lasso, and Ridge Regression models showed weak performance, likely due to excessive coefficient shrinkage and had minimal benefit from regularization. Overall, ensemble methods (Random Forest and Gradient Boosting) outperformed linear models, highlighting their effectiveness in capturing complex patterns in the data. Futher hypertuning of the best model improved its performance by almost 20%, which suggests that hypertuning is beneficial and may have a positive impact on model performance. It took an additional 0.1 seconds longer to run the hypertuned model but that amount is trivial and the effort is well worth it. Visualization charts are included to cleary show comparisons between the models.
In conclusion, this was an excellent exercise to reinforce what I learned about building, comparing, and fine-tuning machine learning models for better predictions, especially when receiving new data to continue classifying the next subjects' longevity.
PERSONAL SUCCESS MEASURES
1. Did I use data that I was interested in?
2. Did this process answer my questions to help me further understanding of the topic?
3. Did I further my understanding of Python programming, data science techniques, preparing data, building multiple machine learning models, plotting results?
4. Did I further my understanding of data visualization techniques, using maplotlib and seaborn?
DATA SOURCE AND CITATION
• Arvidsson, Joakim. Longevity Factors (2023): URL: https://www.kaggle.com/datasets/joebeachcapital/life-longevity-factors