- Spread Sheets (Excel, CSV): store data that business needs → Human can analyse data to make business decision
- Relational DB (MySQL): a better way to organize things → Human can analyse data to make business decision
- Big Data (NoSQL): FB, Amazon, Twitter accumulating more and more data like "User actions, user purchasing history", where you can store un-structure data → need Machine Learning instead of Human to make business decision
- A subset of AI: ML uses Algorithms or Computer Programs to learn different patterns of data & then take those algorithms & what it learned to make prediction or classification on similar data.
- The things hard to describe for computers to perform like
- How to ask Computers to classify Cat/Dog images, or Product Reviews
- Normal Algorithm: a set of instructions on how to accomplish a task: start with
given input + set of instructions
→ output - ML Algorithm : start with
given input + given output
→ set of instructions between I/P and O/P
- Supervised: Data with Label
- Unsupervised: Data without Label like CSV without Column Names
- Clustering: Machine decicdes clusters/groups
- Association Rule Learning: Associate different things to predict what customers might buy in the future
- Reinforcement: teach Machine to try and error (with reward and penalty)
Data Analysis
: analyse data to gain understanding of your dataData Science
: running experiments on set of data to figure actionable insights within it- Example: to build ML Models
- What problem are we trying to solve ?
- Supervised
- Un-supervised
- Classification
- Regression
- What kind of Data we have ?
- What defines success for us ? knowing what metrics you should be paying attention to gives you an idea of how to evaluate your machine learning project.
- What features does your data have and which can you use to build your model ? turning features → patterns
- Three main types of features:
Categorical
features — One or the other(s)- For example, in our heart disease problem, the sex of the patient. Or for an online store, whether or not someone has made a purchase or not.
Continuous (or numerical)
features: A numerical value such as average heart rate or the number of times logged in.Derived
features — Features you create from the data. Often referred to as feature engineering.Feature engineering
is how a subject matter expert takes their knowledge and encodes it into the data. You might combine the number of times logged in with timestamps to make a feature called time since last login. Or turn dates from numbers into “is a weekday (yes)” and “is a weekday (no)”.
- Figure out right models for your problems
- How to improve or what can do better ?
- (Input & Output) Data + Label → Classifications, Regressions
- (Only Input) Data → Clustering
- (My problem similar to others) Leverage from Other ML Models
- Purnishing & Rewarding the ML Learning model by updating the scores of ML
Classification | Regression | Recommendation |
---|---|---|
Accuracy | Mean Absolute Error (MAE) | Precision at K |
Precision | Mean Squared Error (MSE) | |
Recall | Root Mean Squared Error (RMSE) |
- Numerical Features
- Categorical Features
- 3 sets: Trainning, Validation (model hyperparameter tuning and experimentation evaluation) & Test Sets (model testing and comparison)
- Chosen models work for your problem → train the model
- Goal: Minimise time between experiments
- Start small and add up complexity (use small parts of your training sets to start with)
- Choosing the less complicated models to start first
- Happens on Validation or Training Sets
- Measure Model Performance via Test Set
- Advoid
Overfitting
&Underfitting
- Great performance on the training data but poor performance on test data means your model doesn’t generalize well
- Solution: Try simpler model or making sure your the test data is of the same style your model is training on
- Poor performance on training data means the model hasn’t learned properly and is underfitting
- Solution: Try a different model, improve the existing one through hyperparameter or collect more data.