Skip to content

Working through the nitty-gritty details of data science one entry at a time.

Notifications You must be signed in to change notification settings

julielinx/julielinx.github.io

Repository files navigation

Data Science Diaries

I graduated from Penn State in 2006 with a Bachelor's degree in English. A decade plus six months later, I finally figured out what I 'want to be' and went back to school. I graduated from Lipscomb University with a Master's degree in Data Science in 2018.

The Master's program was a whirlwind fifteen months and I didn't enter as well prepared as I probably should have. For example, I may have able to spend more time on advanced topics if I hadn't been learning computer programming in general, both R and Python specifically, as well as how databases work and how to query them with SQL.

However, I love a challenge and have never let a little under-preparedness stop me before. That may be an exaggeration, but nevertheless I finished on time and even had a job in machine learning before graduation.

Since then, I've been working to expand and improve my data science skills. I've worked my way through multiple courses on Coursera, Udemy, EdX, and Udacity, and have a plethora of books on the topic. However, I've found all these published and polished sources to be missing one very important thing: the messy details.

Jose Portilla has a lot of really great courses on Udemy and repos on github. I've learned a lot from his courses, he's one of my favorite instructors. But when I struck out on my own and tried to tackle new datasets I found myself a bit adrift.

For example, in the visualization section of one of his courses there were great examples of two features that were related to the target variable and predict with relatively high accuracy. How did he find those two features? A correlation matrix type plot that plots every feature against every other feature? That's all well and good on a dataset with less than ten features, but what do you do when you get a dataset with hundreds of features? I'd end up spending weeks just creating and analyzing plots.

Another example would be any tutorial with a Jupyter Notebook. The notebook ends up having around 30 cells. But when I'm working on my own my cell run numbers end up over 100. Where did all the iterations go? How did they get to the pretty product that got posted?

I find myself wanting to answer these questions and record the process - so is born Data Science Diaries.

I start at the beginning and work a single problem at a time. The entries can be read in sequence or used individually as quick references, but each entry assumes knowledge of what came before it. Hopefully, fledgling data scientists it useful to have a series with the nitty-gritty details and I won't just be talking to myself in an echo chamber.

About

Working through the nitty-gritty details of data science one entry at a time.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published