Course: Part-Time Data Science
Instructor: Amber Yandow
- Author: Fennec C. Nightingale
Predict & map whether or not a star is likely to have an exoplanet & make recommendations for astronomers to speed up the rate of exoplanet discovery Focus on recall so we're unlikely to inccoretly miss any planets.
CSV & XML tree Data on stars, their planets, their parent systems & their physical characteristics.
- Our expolanet data is from the Open Exoplanet Archive/Catalog & includes: Star names, magnitudes, radii, distance, right asciension/declination, spectral class
- Our additional star data is from the HYG dataset & includes: Id numbers, names, magnitudes, luminosity, x, y, z coordinates for the stars, spectral class, and some details about each stars orbit
I used Python in Jupyter Notebook to perform OSEMN & Logistic regression to create our model and predictions for housing prices in King County.
We obtained our stars data here from the links above over at Kaggle. If you want to get started on your own classification project like this, fork this repo.
After importing all of our data we checked it for null values, outliers, duplicates, and any other errors there might be in our dataset. We checked each column and decided what data we needed to keep or discard, what we might need to fill, or any other alterations we could make to fix up our data before we start modeling. This turned out to get rid of too much of our initial dataset on exoplanets alone, so I also lined up the ID numbers with the HYG dataset so I could randomly sample stars we have not found planets around.
We check out our data to see how our values are distributed, if there is any strong correlation, or if theres anything we missed in our scrubbing. Some of the catagories we wanted to include had really high correlations, but our cut off was .6 & there was no way to fix the multicolinearity through strategies like multiplication, so those catagories were dropped.
We use the Sklearn Logistic regression module to get our best fit in this project.
To work with some of our data in this model, we also have to get dummies for our catagorical variables. After doing an initial model including all of our variables we used a GricSearchCV to go back through and refine our model, trying to make our predictions stronger. After modeling, we check all available evaluation metrics & compare.
Here we take a deep dive into figuring out what our evaluation metrics are saying about our models & plot how our best features compare.
- We were able to make predictions as to wether or not a star would have n exoplanet, based on basic information about the stars themselves, with a high degree of recall, precision & accuracy.
- Currently our biggest predictors are things that affect how well we see stars, like their absolute magnutde, luminosity index & distance
-Use kepler labelled time series data to train deep learning algorithms to detect exoplanets based on light fluxuations in observed stars.
-Write something that is able to parse and accurately separate stellar types (as well as predict missing values) to test predictions made against more random data.
-Use additional data from the Open Exoplanet Catalogue to predict features of planets around stars & predicted stars.
-When more data is available, expand predictor to include multi-planetary predictions.
See the full analysis in the Jupyter Notebooks or review our Presentation. For additional info, contact me here: Fennec C. Nightingale,
├──.ipynb_checkpoints
├──.virtual_documents
├──.__pycache__
├──Scrubbed.csv
├──Images
├── hist.png
├── MilkyWay.png
├── outerarmmid.png
├── outerarmmiin.png
├── outerarmout.png
├── outerarmouter.png
├── planetviolin.png
├── poscoef.png
├── negcoef.png
├── ROC.png
├── PDF
├──Obtain & Scrub.pdf
├──Modeling.pdf
├──Presentation.pdf
├── Obtain & Scrub.ipynb
└── Exoplanet Regression.ipynb