ML Academy — Learn Machine Learning

Machine Learning — Complete Study Guide

Learn ML from first principles to fine-tuning AI

A structured, detailed, beginner-to-advanced guide covering every core concept — written simply so anyone can understand it deeply.

15

Chapters

80+

Concepts Covered

30+

Code Examples

∞

Things to Learn

🧠

What is Machine Learning?

Start here. Understand what ML is, why it matters, and how it works at a high level.

3 lessons · Beginner

⚙️

How ML Actually Works

Representation, Evaluation, Optimization — the three building blocks of every algorithm.

4 lessons · Beginner

📈

Overfitting & Underfitting

The most critical problem in ML. Learn why your model fails and how to fix it.

5 lessons · Intermediate

⚖️

Bias-Variance Tradeoff

The fundamental tension at the heart of every ML system, explained visually.

3 lessons · Intermediate

🔧

Feature Engineering

The most important skill in ML that no textbook teaches properly.

4 lessons · Intermediate

🔬

Fine-Tuning AI Models

How to take a pre-trained model like GPT and adapt it to your specific task.

6 lessons · Advanced

📐

Algorithm Guide

Every major ML algorithm explained — when to use it, how it works, trade-offs.

8 algorithms · All levels

🕸️

Neural Networks & Deep Learning

From perceptrons to transformers — the architecture behind modern AI.

7 lessons · Advanced

📋

Master Cheat Sheet

Every formula, rule, and key fact in one condensed reference sheet.

Quick reference

Learning Roadmap

Follow this path from zero to fine-tuning real AI models. Each phase builds on the last.

1

Foundation — "What & Why"

WEEKS 1–2 · No prerequisites

What is ML?

Types of ML

The 3 Components

Basic Python + NumPy

Math Review (Algebra, Stats)

2

Core ML Concepts

WEEKS 3–5 · Need Phase 1

Generalization

Overfitting

Bias-Variance

Curse of Dimensionality

Evaluation Metrics

3

Algorithms & Practice

WEEKS 6–9 · Need Phase 2

Classical Algorithms

Feature Engineering

Data Preparation

Ensembles

Scikit-Learn Projects

4

Deep Learning

WEEKS 10–14 · Need Phase 3

Neural Networks

CNNs (Images)

RNNs / LSTMs

Transformers

PyTorch Basics

5

Fine-Tuning & LLMs

WEEKS 15–20 · Need Phase 4

Transfer Learning

Fine-Tuning LLMs

LoRA & PEFT

RLHF Basics

Deploy a Model

What is Machine Learning?

Understanding what ML is, where it came from, and why it's revolutionizing every industry.

📖 Beginner ⏱ 15 min read Based on Domingos (2012) + more

The Simple Definition

Machine learning is a way of programming computers using data instead of explicit instructions. Instead of writing a program that says "if the email contains the word 'win money' then mark it as spam," you show the computer thousands of examples of spam and non-spam emails and let it figure out the rules itself.

✦ Core Definition

Machine Learning is a field of computer science that gives computers the ability to learn from data without being explicitly programmed for every situation. The system improves its performance as it is exposed to more data over time.

The classic alternative — hand-coded rules — breaks down fast. Think about spam filters: spammers constantly change their tricks. A rules-based system needs constant manual updates. An ML system can re-train itself and adapt automatically.

A Real Example: Spam Filter

Let's make this concrete. A spam filter built with ML works like this:

Collect training data
Gather thousands of emails labeled "spam" or "not spam" by humans.
Extract features
Turn each email into numbers the computer can understand — e.g., does it contain "free money"? How many exclamation marks? Who is the sender?
Train the model
Run a learning algorithm that finds patterns — combinations of features that predict spam vs not-spam.
Evaluate and test
Check how well the model works on emails it has never seen before.
Deploy and update
Release the model. As new spam patterns emerge, re-train on fresh data.

Where ML Is Used Today

ML is already embedded in almost every digital product you use:

Domain	ML Application	What it learns
Search Engines	Google Search ranking	Which pages are most relevant to your query
Social Media	TikTok/YouTube recommendations	What content you'll keep watching
Finance	Fraud detection, credit scoring	Patterns that indicate fraudulent transactions
Healthcare	Medical image diagnosis	What tumors and diseases look like in scans
Language	ChatGPT, Claude, Gemini	How language works and how to respond helpfully
Self-driving	Tesla Autopilot	How to navigate roads, detect objects

ML vs Traditional Programming

Traditional Programming

You write explicit rules
Input + Rules → Output
You must anticipate every case
Brittle — breaks with new patterns
Good for well-defined problems

Machine Learning

You provide data + expected answers
Input + Output → Rules (learned)
Generalizes to unseen cases
Adapts as new data arrives
Good for pattern-heavy problems

💡 Key Insight from Domingos (2012)

Machine learning is not magic — it cannot get something from nothing. What it does is get more from less. Programming is like building from scratch. Learning is more like farming: you combine seeds (knowledge) with nutrients (data) to grow programs.

Quick Check ✓

In machine learning, which best describes what a "training set" is?

How ML Actually Works

Every ML algorithm is a combination of three core components: Representation, Evaluation, and Optimization.

📖 Beginner⏱ 20 min read

There are thousands of ML algorithms. Choosing one seems overwhelming. But here's the secret: every single learning algorithm is just a combination of three components. Once you understand these, the whole landscape of ML makes sense.

The Three Components

Component 1

Representation — "What form can the answer take?"

A classifier (or model) must be expressed in some formal language the computer can handle. Your choice of representation defines the hypothesis space — the set of all answers the model could possibly learn. If the answer isn't in this space, the model literally cannot learn it, no matter how much data you give it.

Component 2

Evaluation — "How do we score a candidate answer?"

An evaluation function (also called an objective function, loss function, or scoring function) measures how good a particular model is. For example: accuracy, error rate, log-likelihood, or F1 score. The algorithm uses this to distinguish better models from worse ones.

Component 3

Optimization — "How do we search for the best answer?"

Given the evaluation function, we need a method to search through all possible models and find the highest-scoring one. Common choices: gradient descent, greedy search, genetic algorithms, branch-and-bound. This determines both the quality and speed of learning.

The Full Landscape

Representation	Evaluation	Optimization
K-Nearest Neighbor	Accuracy / Error rate	Greedy search
Linear / Logistic Regression	Squared error / Likelihood	Gradient descent
Decision Trees	Information gain, Gini	Greedy recursive split
Support Vector Machines	Margin	Quadratic programming
Neural Networks	Cross-entropy loss	Backprop + Adam/SGD
Naive Bayes	Posterior probability	Closed-form calculation
Random Forest	Gini / Entropy	Bagging + greedy splits

The ML Pipeline — End to End

📦

Raw Data

→

🧹

Clean & Preprocess

→

🔧

Feature Engineering

→

🧠

Choose Model

→

🏋️

Train

→

📊

Evaluate

→

🚀

Deploy

Decision Tree — A Concrete Example

Let's see all three components in action with a decision tree for spam detection:

Python — Decision Tree (Simplified)

# REPRESENTATION: A tree of if/else questions
# EVALUATION: Information Gain (how much does a split reduce uncertainty?)
# OPTIMIZATION: Greedy search — pick the best split at each step

from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Features: [has "free", has "win", exclamation_count, link_count]
X_train = np.array([
    [1, 1, 5, 3],   # spam
    [0, 0, 1, 0],   # not spam
    [1, 0, 3, 2],   # spam
    [0, 1, 0, 1],   # not spam
])
y_train = [1, 0, 1, 0]  # 1=spam, 0=not spam

# Train — the algorithm finds the best splits automatically
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

# Predict new emails
new_email = [[1, 1, 4, 2]]  # suspicious features
print(model.predict(new_email))  # → [1] (spam)

📌 Important Note

Most ML textbooks are organized around the Representation column only. This makes it easy to miss that Evaluation and Optimization are equally important. Two models with the same representation but different optimization strategies can produce very different results.

Types of Machine Learning

The three main paradigms — and when to use each one.

📖 Beginner⏱ 12 min read

1. Supervised Learning

You provide the model with labeled examples — both the inputs (features) and the correct outputs (labels). The model learns to map inputs to outputs. This is by far the most common type of ML.

✦ Analogy

Like a student learning with an answer key — they see the question AND the correct answer, and learn the pattern that connects them.

Two main subtypes:

Task	Output Type	Example	Algorithms
Classification	Discrete class label	Spam vs Not Spam, Dog vs Cat	Logistic Regression, SVM, Decision Trees, Neural Nets
Regression	Continuous number	Predict house price, stock value	Linear Regression, Random Forest, Neural Nets

2. Unsupervised Learning

You give the model unlabeled data — inputs only, no correct answers. The model must find structure, patterns, or groupings by itself.

✦ Analogy

Like organizing a pile of random photos with no instructions — you'd naturally group them by scene (beaches, cities, people). The model does the same with data.

Task	What it does	Example	Algorithms
Clustering	Groups similar data points	Customer segmentation	K-Means, DBSCAN, Hierarchical
Dimensionality Reduction	Compress data while keeping structure	Visualizing high-dim data	PCA, t-SNE, UMAP
Anomaly Detection	Find unusual data points	Fraud detection	Isolation Forest, Autoencoders
Generative Models	Learn to generate new data	Image generation (GANs)	VAE, GAN, Diffusion Models

3. Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties. There's no labeled dataset — the model learns through trial and error, trying to maximize cumulative reward over time.

✦ Analogy

Like training a dog with treats. You don't show it every possible situation — you let it explore, reward good behavior, and punish bad. Over millions of tries, it figures out the optimal strategy.

Used in: Game playing (AlphaGo, chess engines), robotics, ad bidding systems, and crucially — training LLMs like ChatGPT (RLHF: Reinforcement Learning from Human Feedback).

Generalization — The True Goal

Why performing well on training data means nothing, and what actually matters in ML.

📖 Beginner–Intermediate⏱ 14 min read

⚡ The Fundamental Goal of ML

The only thing that matters is how well the model performs on data it has never seen before. This is called generalization. Getting 100% accuracy on your training data is easy — and meaningless.

Why Training Accuracy Is a Trap

Imagine you are studying for an exam. If your teacher gives you the exact exam questions in advance and you memorize all the answers, you'll get 100%. But if the real exam has slightly different questions, you'll fail — because you memorized, not understood.

This is exactly what happens when an ML model memorizes its training data. It's called overfitting — and it's the #1 problem in ML.

Python — Train/Test Split (Correct Way)

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split data: 80% training, 20% testing
# CRITICAL: Never touch test data until final evaluation!
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)  # Train on training data ONLY

train_acc = model.score(X_train, y_train)
test_acc  = model.score(X_test, y_test)

print(f"Train accuracy: {train_acc:.2%}")  # Might be 99% — don't celebrate
print(f"Test accuracy:  {test_acc:.2%}")   # THIS is what matters

Cross-Validation — The Gold Standard

Holding out 20% for testing means you're wasting 20% of your data for training. Cross-validation solves this by rotating which portion is held out.

Split data into K equal parts (folds)
Common choice: K=5 or K=10 folds.
Train K times
Each time, use K-1 folds for training and 1 fold for validation.
Average the results
Take the average performance across all K runs. This is your estimate of generalization.

Python — 5-Fold Cross Validation

from sklearn.model_selection import cross_val_score

model = DecisionTreeClassifier(max_depth=5)

# 5-fold CV: trains 5 times, tests on each held-out fold
scores = cross_val_score(model, X, y, cv=5)

print(f"CV scores: {scores}")          # [0.92, 0.88, 0.90, 0.91, 0.89]
print(f"Mean: {scores.mean():.3f}")     # 0.900
print(f"Std:  {scores.std():.3f}")      # 0.013 (low = stable model)

Data Alone Is Never Enough

This is a deep and surprising result called the No Free Lunch Theorem (Wolpert, 1996): no algorithm can beat random guessing over all possible problems. Every learner must make assumptions — called inductive biases — about the world to generalize.

💡 The No Free Lunch Theorem

There is no universally best machine learning algorithm. An algorithm that works well on one problem must be making assumptions that fail on another. The best algorithm always depends on the problem — which is why choosing your model based on domain knowledge matters.

Overfitting & Underfitting

The two ways your model can fail — and a toolkit for fixing both.

⚡ Critical Topic📖 Intermediate⏱ 20 min read

What is Overfitting?

Overfitting happens when a model learns the training data too well — including its noise and random quirks — and fails to generalize to new data.

⚠ Classic Symptom

Training accuracy = 99%. Test accuracy = 62%. Your model has memorized the training set, not learned the underlying pattern. It's useless in the real world.

Think of a student who memorizes every exam from the past 10 years word-for-word, but has no real understanding. They'll ace those exact exams but fail any new questions.

What is Underfitting?

Underfitting is the opposite: the model is too simple to capture the real pattern in the data. Both training and test accuracy are poor.

Example: trying to fit a curved relationship with a straight line. No matter how much data you have, a line can't capture the curve.

The Visual Intuition

Problem	Train Error	Test Error	Cause	Fix
Just Right ✓	Low	Low	Balanced model complexity	—
Overfitting	Very Low	High	Model too complex / too little data	Regularize, get more data, simplify model
Underfitting	High	High	Model too simple	More complex model, more features

Techniques to Fight Overfitting

1. Regularization

Add a penalty to the loss function for complexity. Forces the model to stay simple.

Regularization — Ridge vs Lasso

from sklearn.linear_model import Ridge, Lasso

# Ridge (L2): penalizes large weights
# Loss = MSE + λ * Σ(weights²)
ridge = Ridge(alpha=1.0)  # alpha = λ, controls regularization strength

# Lasso (L1): pushes some weights to exactly zero (feature selection)
# Loss = MSE + λ * Σ|weights|
lasso = Lasso(alpha=0.1)

# Rule of thumb:
# - Ridge: when you think all features matter but need smaller weights
# - Lasso: when you think many features are irrelevant (automatic selection)

2. Dropout (for Neural Networks)

During training, randomly "switch off" a fraction of neurons. This prevents any neuron from becoming too dominant and forces the network to learn redundant representations.

Dropout in PyTorch

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # 50% of neurons randomly off during training
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.3),   # 30% dropout on second layer
    nn.Linear(128, 10)
)

3. Early Stopping

Monitor validation loss during training. Stop training when validation loss starts increasing — even if training loss keeps dropping.

4. Get More Data

Overfitting is fundamentally a problem of having too little data relative to model complexity. More data is almost always the best fix, when available.

5. Data Augmentation

Artificially expand your training set by creating variations of existing examples. For images: rotate, flip, crop, adjust brightness. For text: paraphrase, back-translate.

Quick Check ✓

Your model has 98% training accuracy but only 61% test accuracy. What's happening?

The Bias-Variance Tradeoff

The deep mathematical reason behind overfitting and underfitting — and why you can never eliminate both simultaneously.

📖 Intermediate⏱ 18 min read

Decomposing Error

Any ML model's total error on new data can be split into three parts:

Total Error = Bias² + Variance + Irreducible Noise

The Bias-Variance Decomposition

Understanding each term is key to diagnosing and fixing your model's problems.

High Bias

Model is too simple. It consistently makes the same types of mistakes, no matter what data you give it. This is underfitting.

Example: fitting a straight line to data that follows a curve

High Variance

Model is too complex. It changes drastically with small changes in training data. This is overfitting.

Example: a deep decision tree that learns every noise point

High Both

Very complex model that still misses the real pattern. Common with wrong architecture choices.

Example: polynomial regression with wrong degree on messy data

Low Both ✓

The ideal model. Complex enough to capture the real pattern, but not so complex it memorizes noise.

This is what you're always aiming for

The Dart Board Analogy (from Domingos)

Imagine your model is a dart thrower aiming at a target (the true answer). You run many trials with different training sets and observe where the darts land:

Scenario	Darts Clustered?	Darts on Target?	Interpretation
Low Bias, Low Variance	✓ Yes (tight cluster)	✓ Yes (near center)	Perfect — this is the goal
High Bias, Low Variance	✓ Yes (tight cluster)	✗ No (off-center)	Consistently wrong — model too simple
Low Bias, High Variance	✗ No (spread out)	∼ Average on target	Inconsistent — model too complex
High Bias, High Variance	✗ No (spread out)	✗ No (off-center)	Worst case

Surprising Result: Strong Wrong Assumptions Can Beat Weak True Ones

Here is a counterintuitive but crucial insight from the paper by Domingos (2012): naive Bayes, which assumes all features are completely independent (which is almost never true), can outperform a rule learner on problems where the truth is a set of rules — because naive Bayes doesn't overfit as badly.

💡 Key Insight

With limited data, a learner with strong (even wrong) assumptions can outperform one with correct but weak assumptions — because strong assumptions reduce variance at the cost of introducing some bias, and with small datasets that tradeoff is often worth it. This is why naive Bayes works surprisingly well in practice.

How to Control the Tradeoff

Action	Effect on Bias	Effect on Variance
Increase model complexity	↓ Decreases	↑ Increases
Add more training data	≈ Same	↓ Decreases
Add regularization (L1/L2)	↑ May increase	↓ Decreases
Feature selection	↑ May increase	↓ Decreases
Use ensemble methods	↓ Decreases	↓ Decreases (bagging)
Reduce number of features	↑ May increase	↓ Decreases

The Curse of Dimensionality

Why more features can hurt your model — and how to fight back.

📖 Intermediate⏱ 15 min read

⚠ Definition

The Curse of Dimensionality (coined by Bellman, 1961) refers to the exponential growth in problems that occur when working with high-dimensional data. As features (dimensions) increase, the data becomes increasingly sparse, distances lose meaning, and models become harder to train.

The Exponential Sparsity Problem

Imagine you have 1,000 training examples in a 1D space (one feature). They fill the space reasonably well. Now add a second dimension: you need roughly 1,000² = 1,000,000 examples to fill the 2D space equally well. By 10 dimensions: you'd need 10^30 examples. For 100 dimensions: you'd need more examples than atoms in the universe.

Examples needed ∝ n^d

Where n = examples per dimension, d = number of dimensions. This grows explosively.

Distances Stop Working

Most ML algorithms (K-NN, SVM, clustering) rely on the idea that similar examples are nearby in feature space. In high dimensions, this completely breaks down:

All points become approximately equidistant from each other
The "nearest neighbor" is no more similar than the "farthest" point
K-NN becomes effectively random in very high dimensions

The "Blessing of Non-Uniformity"

Fortunately, real-world data has a saving grace: it doesn't actually fill all of high-dimensional space. Real data tends to lie on a lower-dimensional manifold — a structure with far fewer effective dimensions than the raw feature count.

✦ Example

An image of a handwritten digit has 784 pixels (784 dimensions). But the "space of all valid digit images" is much smaller — a small manifold within that 784D space. This is why K-NN works well on MNIST despite its theoretical problems with high dimensions.

Practical Solutions

Technique	How it helps	When to use
PCA Principal Component Analysis	Projects data onto fewer dimensions that explain most variance	Preprocessing step for many algorithms
t-SNE / UMAP	Non-linear reduction, great for visualization	Visualizing high-dim data in 2D/3D
Feature Selection	Remove irrelevant features using statistical tests or model importance	When you have domain knowledge or too many features
Regularization	Penalizes models that use many features (Lasso forces weights to zero)	Built into your model training process
Deep Learning	Automatically learns compact low-dimensional representations (embeddings)	Large datasets, complex patterns

Python — PCA Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale first — PCA is sensitive to feature scales
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce from 100 features to 20, keeping 95% of variance
pca = PCA(n_components=0.95)  # Keep 95% explained variance
X_reduced = pca.fit_transform(X_scaled)

print(f"Original: {X.shape[1]} features")
print(f"Reduced:  {X_reduced.shape[1]} features")  # Much smaller!
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")

Feature Engineering — The Art of ML

The single most impactful thing you can do to improve a model. More important than choosing the right algorithm.

📖 Intermediate⏱ 22 min read

⚡ Key Quote (Domingos, 2012)

"The most important factor [in ML project success] is the features used. Learning is easy if you have many independent features that each correlate well with the class." Feature engineering is where intuition, creativity, and domain expertise are as important as technical skill.

What is Feature Engineering?

Raw data is rarely in a form that algorithms can directly use. Feature engineering is the process of transforming raw data into informative, discriminative inputs for your model. It includes:

Feature creation — constructing new features from raw data
Feature transformation — scaling, encoding, normalizing
Feature selection — removing irrelevant or redundant features
Feature extraction — learning compact representations (e.g., PCA, embeddings)

Common Transformations with Code

Numerical Features

Python — Numerical Feature Engineering

import pandas as pd
import numpy as np

df = pd.read_csv('houses.csv')

# 1. SCALING — essential for distance-based and gradient methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1 — use for most algorithms
scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])

# MinMaxScaler: range [0,1] — use for neural networks, image data
mm = MinMaxScaler()
df['area_norm'] = mm.fit_transform(df[['area']])

# 2. LOG TRANSFORM — fix skewed distributions
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

# 3. BINNING — turn continuous into categorical
df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 65, 100],
    labels=['minor', 'young', 'middle', 'senior']
)

# 4. INTERACTION FEATURES — combine existing features
df['price_per_sqft'] = df['price'] / df['sqft']
df['rooms_per_floor'] = df['rooms'] / df['floors']

Categorical Features

Python — Categorical Feature Encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# 1. LABEL ENCODING — for ordinal data (order matters)
# e.g., "low" < "medium" < "high"
le = LabelEncoder()
df['quality_enc'] = le.fit_transform(df['quality'])

# 2. ONE-HOT ENCODING — for nominal data (order doesn't matter)
# e.g., "red", "blue", "green" — turns each into its own binary column
df_encoded = pd.get_dummies(df, columns=['color', 'city'])

# 3. TARGET ENCODING — for high-cardinality categories
# Replace each category with the mean of the target for that category
target_means = df.groupby('city')['price'].mean()
df['city_price_mean'] = df['city'].map(target_means)

# WARNING: Target encoding can cause leakage on small datasets
# Use cross-validation target encoding in production

Feature Selection Methods

Method	How it works	Best for
Correlation Filter	Remove features with correlation > 0.95 to another feature	Quick preprocessing
Chi-Square Test	Statistical test: does this feature have a relationship with the target?	Classification with categorical features
Feature Importance (Trees)	Random Forest / XGBoost tell you how much each feature reduces impurity	Any prediction problem
Lasso Regression	L1 penalty drives unimportant weights to exactly 0	Linear models
Recursive Feature Elimination	Iteratively remove least important features and re-train	When you need a specific number of features

The Algorithm Guide

Every major ML algorithm — explained clearly, with when to use each one.

📖 All levels⏱ 30 min read

✦ Rule of Thumb (Domingos)

Always try the simplest algorithm first. Naive Bayes before logistic regression. K-NN before SVMs. Simpler models are faster, more interpretable, and often surprisingly competitive — especially with limited data.

Linear / Logistic Regression

ClassificationRegressionInterpretable

Fits a straight line (or hyperplane) through data. Logistic regression predicts probabilities using the sigmoid function: P(y=1) = 1/(1+e^(-wx)). Despite the name, logistic regression is a classification algorithm. One of the most widely used algorithms in industry due to its interpretability.

Use when: Features have linear relationship with target, you need interpretability, you have few features, baseline model.

K-Nearest Neighbors (K-NN)

ClassificationRegressionNon-parametric

To classify a new point, find its K most similar training examples and vote. No training step — all computation happens at prediction time ("lazy learner"). The key choice is K: small K = complex, noisy boundary; large K = smoother, simpler boundary.

Use when: Small datasets, local patterns matter, quick prototype. Avoid with high-dimensional data (curse of dimensionality kills it).

Decision Trees

ClassificationRegressionInterpretable

Recursively splits data by asking yes/no questions about features. At each node, picks the feature and threshold that best separates the classes (measured by Information Gain or Gini impurity). Highly interpretable — you can visualize and explain every prediction. But: prone to overfitting without depth limits.

Use when: Interpretability is required, mixed feature types, non-linear boundaries. Add max_depth to prevent overfitting.

Random Forest

ClassificationRegressionEnsembleRobust

An ensemble of hundreds of decision trees, each trained on a random subset of data (bagging) and a random subset of features. Predictions are made by majority vote. Dramatically reduces variance compared to a single tree. One of the most reliable "off-the-shelf" algorithms in existence.

Use when: Tabular data, mixed features, you want a strong baseline, feature importance is useful. Great starting point for most supervised learning problems.

Gradient Boosting (XGBoost / LightGBM)

ClassificationRegressionEnsembleState-of-Art

Builds trees sequentially, where each tree corrects the errors of the previous ones. Uses gradient descent in function space. XGBoost and LightGBM are highly optimized implementations that dominate Kaggle competitions for tabular data. Requires tuning but delivers exceptional results.

Use when: Tabular data, you want maximum accuracy, you have time to tune. The go-to for structured/tabular prediction tasks.

Support Vector Machine (SVM)

ClassificationRegressionKernel trick

Finds the hyperplane that maximizes the margin between classes. The "kernel trick" allows SVMs to work in very high-dimensional implicit feature spaces without explicitly computing them. Theoretically elegant and works well with clear margins. Scales poorly to large datasets.

Use when: Small-to-medium datasets with clear class separation, text classification, high-dimensional features (like TF-IDF vectors).

Naive Bayes

ClassificationProbabilisticFast

Applies Bayes' theorem with the "naive" assumption that all features are conditionally independent given the class. This assumption is almost always false, yet Naive Bayes works remarkably well — especially for text. Very fast to train, works with tiny datasets, handles missing data naturally.

Use when: Text classification (spam, sentiment), small datasets, need fast training, good first baseline. Often surprisingly competitive.

K-Means Clustering

UnsupervisedClustering

Partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids. Must specify K in advance. Sensitive to initialization (use K-Means++ to fix this). Assumes clusters are spherical and similar in size.

Use when: Customer segmentation, topic modeling, data exploration. Avoid when clusters have very different sizes or non-spherical shapes.

Neural Networks & Deep Learning

From the single neuron to the Transformer architecture powering ChatGPT.

📖 Advanced⏱ 35 min read

The Neuron — Building Block

A single artificial neuron does three things: takes weighted inputs, sums them up, and applies an activation function to produce an output.

output = activation( w₁x₁ + w₂x₂ + ... + wₙxₙ + bias )

The neuron equation. Weights (w) are learned. Bias shifts the activation threshold.

Activation Functions — Why They Matter

Without activation functions, any neural network — no matter how deep — collapses to a simple linear function. Activations introduce non-linearity, which is what gives deep networks their power.

Function	Formula	Range	When to Use
ReLU	max(0, x)	[0, ∞)	Hidden layers — default choice, fast, avoids vanishing gradients
Sigmoid	1/(1+e^(-x))	(0, 1)	Binary classification output layer only
Softmax	e^xᵢ / Σe^xⱼ	(0,1), sums to 1	Multi-class classification output layer
Tanh	(e^x - e^-x)/(e^x + e^-x)	(-1, 1)	Recurrent networks, when zero-centered output matters
GELU	x·Φ(x)	(-∞, ∞)	Transformers (GPT, BERT use this)

Backpropagation — How Neural Nets Learn

Backpropagation uses the chain rule of calculus to efficiently compute how much each weight contributed to the error, then adjusts them in the direction that reduces the error.

Forward Pass
Feed input through the network, layer by layer, to get a prediction.
Compute Loss
Compare prediction to actual answer using a loss function (e.g., cross-entropy for classification).
Backward Pass
Propagate the error gradient backward through the network using the chain rule. Compute ∂Loss/∂w for every weight.
Update Weights
Adjust each weight by a small step in the direction that reduces the loss: w ← w - η·(∂Loss/∂w), where η is the learning rate.
Repeat
Do this for thousands of mini-batches over many epochs until loss converges.

Architecture Types

Architecture	Input Type	Key Idea	Use Case
MLP (Fully Connected)	Tabular, fixed-size	Each neuron connects to all neurons in next layer	Tabular data, classification
CNN (Convolutional)	Images, grids	Local filters + shared weights = spatial invariance	Image classification, object detection
RNN / LSTM	Sequences	Hidden state carries information across time steps	Time series, older NLP tasks
Transformer	Sequences	Self-attention: every position attends to every other	NLP (GPT, BERT), vision, multimodal
Autoencoder	Any	Compress to bottleneck then reconstruct (encoder + decoder)	Anomaly detection, representation learning
GAN	Noise	Generator vs Discriminator adversarial training	Image generation, data augmentation

The Transformer — How Modern AI Works

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind GPT, BERT, Claude, Gemini, and nearly all modern AI. Its key innovation is self-attention.

🔑 Self-Attention Explained Simply

For each word in a sentence, self-attention asks: "which other words in this sentence are most relevant to understanding THIS word?" It computes a weighted sum of all other words' representations, where the weights reflect how relevant each word is. This happens in parallel for every position simultaneously — unlike RNNs which process sequentially.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

The self-attention formula. Q=Queries, K=Keys, V=Values. d_k = key dimension (for scaling).

Ensembles & Boosting

Why combining many weak models beats one strong model — every time.

📖 Intermediate⏱ 16 min read

💡 From Domingos (2012)

In the Netflix Prize, the winning and runner-up submissions were both stacked ensembles of over 100 learners. Combining them improved results even further. The lesson: learn many models, not just one.

Three Main Ensemble Methods

1. Bagging (Bootstrap Aggregating)

Train many copies of the same model on different random samples of the training data (with replacement). Combine by voting (classification) or averaging (regression). Dramatically reduces variance with almost no increase in bias. Random Forest is bagging applied to decision trees.

2. Boosting

Train models sequentially. Each new model focuses on the mistakes of the previous ones. Training examples that were classified incorrectly get higher weights. Final prediction is a weighted vote of all models. Reduces both bias and variance. XGBoost and LightGBM are boosting algorithms.

3. Stacking

Train several different base models (e.g., Random Forest + SVM + Logistic Regression). Then train a "meta-learner" on the predictions of the base models. The meta-learner learns how to best combine their predictions.

Python — XGBoost (State of the Art for Tabular Data)

import xgboost as xgb
from sklearn.model_selection import cross_val_score

model = xgb.XGBClassifier(
    n_estimators=500,       # number of trees
    max_depth=6,            # tree depth (controls complexity)
    learning_rate=0.05,     # step size (lower = more trees needed)
    subsample=0.8,          # fraction of data per tree (bagging)
    colsample_bytree=0.8,   # fraction of features per tree
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    use_label_encoder=False,
    eval_metric='logloss'
)

# Evaluate with cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Mean CV Accuracy: {scores.mean():.3f}")

Fine-Tuning AI Models

How to take a massive pre-trained model like GPT-2 or LLaMA and adapt it to your specific task — with far less data and compute than training from scratch.

📖 Advanced⏱ 40 min read🔥 Most Practical

What is Fine-Tuning?

A pre-trained model like GPT-2 was trained on hundreds of billions of tokens of text. It has learned general language understanding — grammar, facts, reasoning patterns. Fine-tuning takes this pre-trained model and continues training it on a much smaller, task-specific dataset to adapt it to your specific use case.

⚡ Why Not Train From Scratch?

Training GPT-3 from scratch cost ~$12 million in compute. Fine-tuning can cost as little as a few dollars on a single GPU. You leverage billions of dollars of existing training and adapt just the last mile for your specific task.

Transfer Learning — The Core Idea

1

Pre-Training (Already Done)

A large model is trained on massive data to learn general representations. For language models, this is next-token prediction on vast internet text. The model learns grammar, facts, common sense, and reasoning. This creates powerful internal representations.

2

Fine-Tuning (You Do This)

You take the pre-trained model and continue training it on your task-specific dataset. The model's weights shift slightly to adapt to your domain. All the general knowledge is preserved — you're just specializing it.

3

Deployment

The fine-tuned model performs much better on your task than the base model — and far better than training from scratch on your small dataset alone.

Full Fine-Tuning vs Parameter-Efficient Methods

Method	What it updates	VRAM needed	Data needed	Best for
Full Fine-Tuning	ALL model weights	Very high (40B+ model → 80GB+)	Thousands of examples	When you have serious compute budget
LoRA	Low-rank adapter matrices only	Low (can do 7B on 16GB)	Hundreds of examples	Most practical fine-tuning today
QLoRA	LoRA adapters on quantized model	Very low (7B on 8GB!)	Hundreds of examples	Fine-tuning on consumer hardware
Prompt Tuning	Only a small set of "soft prompt" tokens	Minimal	Small	Light task adaptation
Adapter Layers	Small inserted adapter modules	Low	Medium	Multi-task models

LoRA Deep Dive — The Modern Standard

LoRA (Low-Rank Adaptation) is the most important fine-tuning technique to understand. Here's how it works:

In a standard neural network layer, the weight matrix W has dimensions d×k, meaning d×k trainable parameters. During full fine-tuning, ALL of these change. LoRA instead adds two small matrices A (d×r) and B (r×k) where r is small (like 4, 8, or 16). The modified layer becomes:

W' = W + BA

W is frozen. Only A and B are trained. If r=8, d=1024, k=1024: original = 1M params. LoRA A+B = 8×1024 + 1024×8 = 16,384 params. 98.4% reduction!

Python — QLoRA Fine-Tuning with Hugging Face + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# Step 1: Load the base model in 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Step 2: Configure LoRA adapters
lora_config = LoraConfig(
    r=16,                  # rank — higher = more capacity, more params
    lora_alpha=32,         # scaling factor (usually 2x rank)
    target_modules=["q_proj", "v_proj"],  # which layers to apply LoRA to
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Step 3: Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || trainable: 0.06%

Preparing Your Fine-Tuning Dataset

Dataset quality matters far more than quantity for fine-tuning. Here's what a good fine-tuning dataset looks like for an instruction-following model:

Fine-Tuning Dataset Format (JSONL)

// Each line is one training example in instruction format
{"instruction": "Explain what gradient descent is in simple terms.",
 "input": "",
 "output": "Gradient descent is like hiking down a mountain in fog..."}

{"instruction": "Convert this SQL query to Python pandas code.",
 "input": "SELECT name, age FROM users WHERE age > 18",
 "output": "df[df['age'] > 18][['name', 'age']]"}

// For chat fine-tuning (preferred for conversational models):
{"messages": [
  {"role": "system", "content": "You are a helpful ML tutor."},
  {"role": "user", "content": "What is overfitting?"},
  {"role": "assistant", "content": "Overfitting is when a model..."}
]}

✦ Dataset Quality Checklist

• Diversity — cover all the cases you care about
• Quality over quantity — 500 excellent examples beats 5,000 mediocre ones
• Format consistency — always use the same template
• No duplicates — deduplicate your dataset
• Output quality — bad outputs teach bad behavior. Review them manually.

RLHF — How ChatGPT Was Trained

RLHF (Reinforcement Learning from Human Feedback) is how models like ChatGPT, Claude, and Gemini are aligned to be helpful, harmless, and honest. It happens in 3 phases:

Supervised Fine-Tuning (SFT)
Fine-tune the base model on high-quality demonstrations of the desired behavior. Human contractors write ideal responses to thousands of prompts.
Reward Model Training
Collect human preference data: show the same prompt to the model, get two different responses, have humans rank which is better. Train a "reward model" to predict human preference scores.
RL Fine-Tuning with PPO
Use the reward model as a scoring signal and apply reinforcement learning (PPO algorithm) to train the language model to generate responses with higher reward scores. Add KL-divergence penalty to prevent the model from drifting too far from the original SFT model.

Practical Hyperparameters for Fine-Tuning

Parameter	Typical Range	Notes
Learning Rate	1e-5 to 5e-4	Much lower than pre-training. Start with 2e-4 for LoRA.
Batch Size	4–32 (with gradient accumulation)	Effective batch = batch_size × gradient_accum_steps
Epochs	1–5	Very few epochs needed. More often causes overfitting.
LoRA Rank (r)	4, 8, 16, 64	Higher = more capacity. 8-16 is sweet spot for most tasks.
LoRA Alpha	= 2 × r	Controls scaling. Keep at 2× rank as default.
Max Sequence Length	512–4096	Longer = more memory. Match to your use case.

Evaluation & Metrics

How to actually measure whether your model is good — and which metric to use when.

📖 Intermediate⏱ 18 min read

⚠ Common Mistake

Accuracy is often the worst metric to use. If 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy — but catches zero spam. Always choose metrics that match your actual problem.

Classification Metrics

All classification metrics derive from the confusion matrix — a table of True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).

Metric	Formula	Use When
Accuracy	(TP+TN) / All	Balanced classes, equal error costs
Precision	TP / (TP+FP)	False positives are costly (e.g., spam filter — don't delete real emails)
Recall	TP / (TP+FN)	False negatives are costly (e.g., cancer screening — don't miss cancer)
F1 Score	2 × (P×R)/(P+R)	Need balance of precision & recall, imbalanced classes
ROC-AUC	Area under ROC curve	Ranking quality, comparing models across thresholds
PR-AUC	Area under Precision-Recall curve	Very imbalanced classes (better than ROC-AUC in this case)

Regression Metrics

Metric	Formula	Notes
MSE	mean((y - ŷ)²)	Penalizes large errors heavily. In output units².
RMSE	√MSE	Same units as output. Most commonly reported.
MAE	mean(\|y - ŷ\|)	More robust to outliers than MSE. Easy to interpret.
R²	1 - SS_res/SS_tot	% of variance explained. 1.0 = perfect. Can be negative (worse than baseline).

Data Preparation

Real ML is 80% data work. Here's how to do it right.

📖 Practical⏱ 20 min read

Handling Missing Values

Python — Missing Value Strategies

import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer

# 1. DROP rows/columns with too many missing values
df.dropna(thresh=0.8 * len(df), axis=1)  # drop cols with >20% missing

# 2. SIMPLE IMPUTATION
imputer = SimpleImputer(strategy='median')   # or 'mean', 'most_frequent'
X_imputed = imputer.fit_transform(X)

# 3. KNN IMPUTATION — fills missing with average of K nearest neighbors
# More accurate but slower
knn_imputer = KNNImputer(n_neighbors=5)
X_knn = knn_imputer.fit_transform(X)

# 4. ADD "was_missing" indicator column — let model learn the pattern
df['age_was_missing'] = df['age'].isna().astype(int)
df['age'].fillna(df['age'].median(), inplace=True)

Handling Class Imbalance

Python — SMOTE & Class Weights

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# OPTION 1: SMOTE — synthesize new minority class examples
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# OPTION 2: class_weight='balanced' — let the model handle it
model = RandomForestClassifier(class_weight='balanced')

# OPTION 3: Custom weights
weights = {0: 1, 1: 10}  # 10x penalty for misclassifying minority class
model = RandomForestClassifier(class_weight=weights)

ML Master Cheat Sheet

Everything condensed. Keep this open while studying or building.

Key Formulas

MSEmean((y - ŷ)²)

RMSE√MSE

Accuracy(TP+TN)/All

PrecisionTP/(TP+FP)

RecallTP/(TP+FN)

F12×(P×R)/(P+R)

Attentionsoftmax(QKᵀ/√d)·V

LoRAW' = W + BA

Overfitting vs Underfitting

OverfittingHigh train acc, low test acc

UnderfittingLow train acc, low test acc

Fix overfitRegularize, more data, simpler model

Fix underfitMore complex model, more features

Ridge (L2)Loss + λΣw²

Lasso (L1)Loss + λΣ|w|

DropoutRandom disable neurons during train

Algorithm Quick Picks

Tabular (best)XGBoost / LightGBM

Tabular (fast)Random Forest

ImageCNN / Vision Transformer

TextTransformer (BERT/GPT)

Baseline firstLogistic Reg / Naive Bayes

ClusteringK-Means / DBSCAN

DimensionalityPCA / UMAP

Fine-Tuning LLMs

LibraryHuggingFace + PEFT

MethodQLoRA (best for most)

LR range1e-5 to 5e-4

LoRA rank8–16 (sweet spot)

Epochs1–3 (more = overfitting)

Data formatInstruction JSONL

Min examples~200–500 high quality

Bias-Variance

Total ErrorBias² + Variance + Noise

High BiasUnderfitting, too simple

High VarianceOverfitting, too complex

More data →Reduces variance

Regularize →Reduces variance

Ensemble →Reduces both

Simpler model →Reduces variance, raises bias

Preprocessing Checklist

Scale featuresStandardScaler for most algs

Encode catsOHE nominal, Label ordinal

Missing valsImpute median / add flag

OutliersIQR clip or log-transform

ImbalanceSMOTE or class_weight

LeakageFit transforms on train ONLY

Validate5-fold CV before test set

Glossary

Every key term in ML defined clearly and concisely.

Backpropagation

The algorithm for computing gradients in neural networks using the chain rule of calculus. For each weight, it computes how much changing that weight would increase or decrease the loss, then adjusts all weights in the direction of lower loss.

Batch Size

The number of training examples processed before updating model weights. Small batches (stochastic) are noisier but escape local minima better. Large batches are more accurate but may converge to sharp minima.

Bias (in ML models)

The error from incorrect assumptions in the learning algorithm. High bias = model is too simple and consistently wrong in the same direction. Leads to underfitting.

Cross-Entropy Loss

The standard loss function for classification. Measures how well the predicted probability distribution matches the true distribution. Lower is better. For binary classification: -[y·log(p) + (1-y)·log(1-p)].

Cross-Validation

A technique to estimate generalization performance by rotating which portion of data is used for validation. K-fold CV splits data into K parts, trains K times, each time holding out a different part for evaluation.

Dropout

A regularization technique for neural networks. During training, randomly sets a fraction of neurons to zero at each forward pass. Prevents co-adaptation and acts like training an ensemble of many smaller networks.

Embedding

A learned dense vector representation of a discrete item (word, category, user, item). Embeddings capture semantic relationships — similar items have similar vectors. The foundation of modern NLP and recommendation systems.

Epoch

One complete pass through the entire training dataset. Training typically requires multiple epochs. Too few = underfitting. Too many = overfitting.

Feature

An individual measurable property used as input to a model. Also called a predictor, attribute, or input variable. Feature quality is the most important factor in ML project success.

Gradient Descent

The optimization algorithm used to train most ML models. Computes the gradient of the loss with respect to all weights, then takes a small step in the opposite direction (downhill). Variants: SGD, Adam, RMSProp.

Hyperparameter

A configuration value set before training that controls the learning process itself (not learned from data). Examples: learning rate, number of layers, max_depth, regularization strength. Must be tuned via CV or grid search.

Inductive Bias

The set of assumptions a learning algorithm makes to generalize beyond training data. Every algorithm has one — without it, no generalization is possible (No Free Lunch Theorem). Choice of algorithm encodes a specific set of assumptions.

Learning Rate

The step size in gradient descent. Too large: overshoots minima, diverges. Too small: training is slow and may get stuck. One of the most important hyperparameters. Common default: 0.001 for Adam, 0.01 for SGD.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method. Instead of updating all weights in a layer (d×k parameters), adds two small matrices A (d×r) and B (r×k) where r ≪ d,k. Only A and B are trained. Reduces trainable parameters by 99%+.

Loss Function

A function measuring how far the model's predictions are from the true values. The algorithm minimizes this during training. Classification: cross-entropy. Regression: MSE. The choice of loss determines what the model optimizes.

No Free Lunch Theorem

A theorem (Wolpert, 1996) proving that no algorithm can outperform random guessing over all possible problems. Every algorithm's advantages on some problems come at the cost of disadvantages on others. Algorithm choice must be matched to problem structure.

Overfitting

When a model learns the training data too well — including its noise — and fails to generalize. Symptom: very high training accuracy, much lower test accuracy. Root cause: model too complex relative to data size.

Regularization

Techniques that constrain model complexity to reduce overfitting. Common forms: L1 (Lasso), L2 (Ridge), dropout, early stopping. Typically implemented by adding a penalty term to the loss function.

RLHF

Reinforcement Learning from Human Feedback. The training paradigm used to align LLMs (ChatGPT, Claude). Involves SFT on demonstrations, training a reward model on human preferences, then optimizing the language model against the reward model using PPO.

Transformer

The neural network architecture introduced in "Attention Is All You Need" (2017). Uses self-attention to model relationships between all positions in a sequence simultaneously. The foundation of GPT, BERT, Claude, Gemini, and virtually all modern AI.

Variance (in ML models)

The sensitivity of a model to fluctuations in training data. High variance = model changes drastically with different training sets. Leads to overfitting. Complex models (deep trees, large neural nets) tend to have high variance.