Skip to content

llwang8/Data_Science_Portfolio

Repository files navigation

Data Science Portfolio

Python Programming

  • Explore U.S. Births [open dir] [ipynb]
    Use python programming skills to analyze births data distribution and statistics by year, month, week, day of the month and day of the week.

  • Explore Gun Death in U.S. [open dir] [ipynb]
    Uncover patterns in the demographics of the gun death victims by using Python to compile gun deaths and homicide distribution in relation to year, month, date, gender, race, education and location.

Data Structures And Algorithms

  • Investigating Airplane Accidents [open dir] [ipynb]
    Use Python to experiment with different search algorithms for special value in the aviation data. Use dictionary data structure to count up accidents by U.S. states and fatalities and serious injuries by each unique month and year.

The Command Line

  • Working with Data Downloads [open dir]
    Unzip command on an archive file to extract the 2 csv files within it. Read in csv files in read.py. Use the pandas.pivot_table() function to create a pivot table to aggregate school enrollment by gender and by school type (magnet schools or juvenile justice facilities) in exploration.py. Compute the percentage of enrollment that each race and gender makes up by dividing the respective sums by total enrollment in enrollment.py.

  • Transforming Data With Python [open dir]
    Use different Python scripts to transform data to uncover insights. Create a load_data function to read in csv file and return dataframe in read.py. Process headline string to find the 100 words most often appear in headlines in count.py. Remove subdomain and sort domains to find domains most often submitted in domain.py. Use parse function from dateutil library‘s parser module to get hour of the article submitted so as to find when the most articles are submitted in time.py.

Data Analysis with Pandas and Numpy

  • Analyze Thanksgiving Dinner [open dir] [ipynb]
    Use Pandas to discover region and income-based patterns in what Americans eat for Thanksgiving dinner.

  • Exploring Ebay Car Sales Data [open dir] [ipynb]
    Use Pandas to clean up the data and analyze the included used car listings so as to find the most common brand/model combination, and correlation of average price to different odometer bin groups.

Data Visualization

  • Visualizing Earnings Based On College Majors [open dir] [ipynb]
    Explore college major data on job outcomes using pivot tables to aggregate data and matplotlib plotting tools such as scatter plots, histogram, bar plots for distribution and comparison analysis.

  • Visualizing The Gender Gap in College Degrees [open dir] [ipynb]
    Use matplotlib to produce a grid of line charts comparing percentage of degrees awarded by gender and degree categories.

Data Cleaning

  • Star War Surveys [open dir] [ipynb]
    Use Pandas to clean up survey data, which involves removing invalid rows, converting columns to different data type and renaming columns. Then proceed to explore the dataset to compare total viewership and ranking of different star war episodes by different demographic factors and fan or not.

  • Analyzing New York City High School Data [open dir] [ipynb]
    Explore relationships between SAT scores and demographic factors in New York City public schools. Use Pandas to clean up dataset to prepare it for further analysis, which involves dataframe's corr() function for correlation studies, mpl_toolkits.basemap to map out factors to differentiate school, bar plot and scatter plot to display demographic difference in SAT performance.

Analyzing Data Using SQL

  • Analyzing CIA Factbook Data with SQLite and Python [open dir] [ipynb]
    Use Pandas to execute SQL queries to summarize population statistics, find outliers, exam population density and explore ratio of water to land distribution.

  • Answering Business Decision Using SQL [open dir] [ipynb]
    Use Pandas to execute SQL queries to help business with questions such as choosing purchase strategies and evaluating sales performance. Develop SQL queries using subqueries, multiple joins, set operations, aggregate function, views and case statement.

  • Designing and Creating a Database [open dir]
    Import CSV data into a database. Design a normalized schema for a large, predominantly single table data set. Create tables that match the schema design. Migrate data from unnormalized tables into normalized tables.

Probability and Statistics in Python

  • Investigating Fandango Movie Ratings [open dir] [ipynb]
    Use Kernel Density Plot to compare the distribution of both movie rating samples. Compute statistics for both samples and plot them on bar graph to compare. The analysis above helps to determine if Fandango movie rating before and after Hickey’s discrepancy findings are the same or different.

  • Finding the Best Markets to Advertising in [open dir] [ipynb]
    Exam if FreeCodeCamp’s survey is representative of the population of interest before compiling relative frequency distribution by countries to compare new coders in different countries. Investigate outliers and remove them before using boxplot to illustrate distribution of money spent per country per month to pinpoint the best markets to advertise in for a fictional e-learning company.

  • Winning Jeopardy! [open dir] [ipynb]
    Using Python, Pandas, Numpy and Scipy to analyze Jeopardy dataset to exam claims of using past questions to prepare for future episodes or answers could be hidden in questions.

Machine Learning

K Nearest Neighbors Regression and Cross Validation

  • Predicting Car Prices [open dir] [ipynb]
    Explore the workflow of machine learning using the k-nearest neighbors algorithm to predict a car's market price using its attributes. Experiment with a train/test model and a k-fold cross validation model.

Linear Regression

  • Predicting House Sale Prices [open dir] [ipynb]
    Explore approaches to model fitting, data cleaning, features transforming and selection, and k-fold cross-validation to achieve the optimal LinearRegression model for predicting house sale prices.

Machine Learning In Python

  • Predicting the Stock Market[open dir] [ipynb]
    Use Linear Regression and historical data on the price of the S&P 500 Index from 1950 to 2012 to make predictions about future prices for 2013-2015 and evaluate the predictions.

Decision Tree

  • Predicting Bike Rentals [open dir] [ipynb]
    Predict the total number of bikes people rented in a given hour by using all of the other features columns. A few different machine learning models are used and models performance are evaluated.

Data Engineering

Production Database - Postgres

Handling Large Datasets in Python

Build Data Pipeline

  • Hacker News Pipeline [open dir] [ipynb]
    From a JSON API, build a pipeline of processing tasks to filter, clean, aggregate, and summarize data including running a sequence of basic natural language processing. The goal is to find the top 100 keywords of Hacker News posts in 2014.

Spark and Map-Reduce

About

Data Science Portfolio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published