Project | Link | Code |
---|---|---|
NYC Taxi | Visualization for NYCTaxi | Private |
AutoComplete | Visualization for AutoComplete | Github Link |
PageRank | Visualization for PageRank | Github Link |
Movie Recommendation | Github Link |
- Cleaned and filtered 1.2 Billion NYC Taxi Trip Data (300GB) and stored them in S3 buckets.
- Built and configured data pipeline based on AWS resources automatically byTerraform and Ansible.
- Designed a MapReduce program and deployed it on an auto scaling group of EC2 instances to generate the statistics of the taxi trip data.
- Scheduled and tracked tasks on each node by SQS, where each task read the input by date range.
- Built a Docker image of the MapReduce program and pushed it to ECS repository as an alternative to facilitate the scaling out process and compared the latencies.
- Aggregated and stored the output of reducer into DynamoDB and used it as the data source for Bokeh visualization.
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
- Constructed a N-Gram library based on Wiki data set, built a Language Model by computing the Gram probabilities to generate (frequency) data and stored them into MySql.
- Fetch data from MySql using Jquery and PHP to realize a real time auto completion feature of a searching engine on the web browser.
- Implemented a page rank algorithm based on Twitter social network data sets with 11.3 million user profiles and 85 million social relations.
- Formulated the relations between different users using transition matrix, calculated each user's rank value through 30 iterations until converge using EMR cluster.
- Visualized the social network Graph based on the resulting PageRank matrix through Node.js.
http://socialcomputing.asu.edu/datasets/Twitter
- Formulated a user rating matrix and a Co-concurrence matrix based on Netflix raw data set with 480k users, 17k movies and over 100 million ratings.
- Merged the two matrices using a Item-based collaborative filtering algorithm to compute the movie recommendation list and deployed the jobs on AWS Hadoop cluster.
http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a