Machine Learning — Complete Study Guide

Learn ML from first principles to fine-tuning AI

A structured, detailed, beginner-to-advanced guide covering every core concept — written simply so anyone can understand it deeply.

15
Chapters
80+
Concepts Covered
30+
Code Examples
Things to Learn
🧠

What is Machine Learning?

Start here. Understand what ML is, why it matters, and how it works at a high level.

3 lessons · Beginner
⚙️

How ML Actually Works

Representation, Evaluation, Optimization — the three building blocks of every algorithm.

4 lessons · Beginner
📈

Overfitting & Underfitting

The most critical problem in ML. Learn why your model fails and how to fix it.

5 lessons · Intermediate
⚖️

Bias-Variance Tradeoff

The fundamental tension at the heart of every ML system, explained visually.

3 lessons · Intermediate
🔧

Feature Engineering

The most important skill in ML that no textbook teaches properly.

4 lessons · Intermediate
🔬

Fine-Tuning AI Models

How to take a pre-trained model like GPT and adapt it to your specific task.

6 lessons · Advanced
📐

Algorithm Guide

Every major ML algorithm explained — when to use it, how it works, trade-offs.

8 algorithms · All levels
🕸️

Neural Networks & Deep Learning

From perceptrons to transformers — the architecture behind modern AI.

7 lessons · Advanced
📋

Master Cheat Sheet

Every formula, rule, and key fact in one condensed reference sheet.

Quick reference

Learning Roadmap

Follow this path from zero to fine-tuning real AI models. Each phase builds on the last.

1
Foundation — "What & Why"
WEEKS 1–2 · No prerequisites
What is ML?
Types of ML
The 3 Components
Basic Python + NumPy
Math Review (Algebra, Stats)
2
Core ML Concepts
WEEKS 3–5 · Need Phase 1
Generalization
Overfitting
Bias-Variance
Curse of Dimensionality
Evaluation Metrics
3
Algorithms & Practice
WEEKS 6–9 · Need Phase 2
Classical Algorithms
Feature Engineering
Data Preparation
Ensembles
Scikit-Learn Projects
4
Deep Learning
WEEKS 10–14 · Need Phase 3
Neural Networks
CNNs (Images)
RNNs / LSTMs
Transformers
PyTorch Basics
5
Fine-Tuning & LLMs
WEEKS 15–20 · Need Phase 4
Transfer Learning
Fine-Tuning LLMs
LoRA & PEFT
RLHF Basics
Deploy a Model

What is Machine Learning?

Understanding what ML is, where it came from, and why it's revolutionizing every industry.

📖 Beginner ⏱ 15 min read Based on Domingos (2012) + more

The Simple Definition

Machine learning is a way of programming computers using data instead of explicit instructions. Instead of writing a program that says "if the email contains the word 'win money' then mark it as spam," you show the computer thousands of examples of spam and non-spam emails and let it figure out the rules itself.

✦ Core Definition

Machine Learning is a field of computer science that gives computers the ability to learn from data without being explicitly programmed for every situation. The system improves its performance as it is exposed to more data over time.

The classic alternative — hand-coded rules — breaks down fast. Think about spam filters: spammers constantly change their tricks. A rules-based system needs constant manual updates. An ML system can re-train itself and adapt automatically.

A Real Example: Spam Filter

Let's make this concrete. A spam filter built with ML works like this:

  1. Collect training data

    Gather thousands of emails labeled "spam" or "not spam" by humans.

  2. Extract features

    Turn each email into numbers the computer can understand — e.g., does it contain "free money"? How many exclamation marks? Who is the sender?

  3. Train the model

    Run a learning algorithm that finds patterns — combinations of features that predict spam vs not-spam.

  4. Evaluate and test

    Check how well the model works on emails it has never seen before.

  5. Deploy and update

    Release the model. As new spam patterns emerge, re-train on fresh data.

Where ML Is Used Today

ML is already embedded in almost every digital product you use:

DomainML ApplicationWhat it learns
Search EnginesGoogle Search rankingWhich pages are most relevant to your query
Social MediaTikTok/YouTube recommendationsWhat content you'll keep watching
FinanceFraud detection, credit scoringPatterns that indicate fraudulent transactions
HealthcareMedical image diagnosisWhat tumors and diseases look like in scans
LanguageChatGPT, Claude, GeminiHow language works and how to respond helpfully
Self-drivingTesla AutopilotHow to navigate roads, detect objects

ML vs Traditional Programming

Traditional Programming

  • You write explicit rules
  • Input + Rules → Output
  • You must anticipate every case
  • Brittle — breaks with new patterns
  • Good for well-defined problems

Machine Learning

  • You provide data + expected answers
  • Input + Output → Rules (learned)
  • Generalizes to unseen cases
  • Adapts as new data arrives
  • Good for pattern-heavy problems
💡 Key Insight from Domingos (2012)

Machine learning is not magic — it cannot get something from nothing. What it does is get more from less. Programming is like building from scratch. Learning is more like farming: you combine seeds (knowledge) with nutrients (data) to grow programs.

Quick Check ✓

In machine learning, which best describes what a "training set" is?

How ML Actually Works

Every ML algorithm is a combination of three core components: Representation, Evaluation, and Optimization.

📖 Beginner⏱ 20 min read

There are thousands of ML algorithms. Choosing one seems overwhelming. But here's the secret: every single learning algorithm is just a combination of three components. Once you understand these, the whole landscape of ML makes sense.

The Three Components

Component 1

Representation — "What form can the answer take?"

A classifier (or model) must be expressed in some formal language the computer can handle. Your choice of representation defines the hypothesis space — the set of all answers the model could possibly learn. If the answer isn't in this space, the model literally cannot learn it, no matter how much data you give it.

Component 2

Evaluation — "How do we score a candidate answer?"

An evaluation function (also called an objective function, loss function, or scoring function) measures how good a particular model is. For example: accuracy, error rate, log-likelihood, or F1 score. The algorithm uses this to distinguish better models from worse ones.

Component 3

Optimization — "How do we search for the best answer?"

Given the evaluation function, we need a method to search through all possible models and find the highest-scoring one. Common choices: gradient descent, greedy search, genetic algorithms, branch-and-bound. This determines both the quality and speed of learning.

The Full Landscape

RepresentationEvaluationOptimization
K-Nearest NeighborAccuracy / Error rateGreedy search
Linear / Logistic RegressionSquared error / LikelihoodGradient descent
Decision TreesInformation gain, GiniGreedy recursive split
Support Vector MachinesMarginQuadratic programming
Neural NetworksCross-entropy lossBackprop + Adam/SGD
Naive BayesPosterior probabilityClosed-form calculation
Random ForestGini / EntropyBagging + greedy splits

The ML Pipeline — End to End

📦
Raw Data
🧹
Clean & Preprocess
🔧
Feature Engineering
🧠
Choose Model
🏋️
Train
📊
Evaluate
🚀
Deploy

Decision Tree — A Concrete Example

Let's see all three components in action with a decision tree for spam detection:

Python — Decision Tree (Simplified)
# REPRESENTATION: A tree of if/else questions
# EVALUATION: Information Gain (how much does a split reduce uncertainty?)
# OPTIMIZATION: Greedy search — pick the best split at each step

from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Features: [has "free", has "win", exclamation_count, link_count]
X_train = np.array([
    [1, 1, 5, 3],   # spam
    [0, 0, 1, 0],   # not spam
    [1, 0, 3, 2],   # spam
    [0, 1, 0, 1],   # not spam
])
y_train = [1, 0, 1, 0]  # 1=spam, 0=not spam

# Train — the algorithm finds the best splits automatically
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

# Predict new emails
new_email = [[1, 1, 4, 2]]  # suspicious features
print(model.predict(new_email))  # → [1] (spam)
📌 Important Note

Most ML textbooks are organized around the Representation column only. This makes it easy to miss that Evaluation and Optimization are equally important. Two models with the same representation but different optimization strategies can produce very different results.

Types of Machine Learning

The three main paradigms — and when to use each one.

📖 Beginner⏱ 12 min read

1. Supervised Learning

You provide the model with labeled examples — both the inputs (features) and the correct outputs (labels). The model learns to map inputs to outputs. This is by far the most common type of ML.

✦ Analogy

Like a student learning with an answer key — they see the question AND the correct answer, and learn the pattern that connects them.

Two main subtypes:

TaskOutput TypeExampleAlgorithms
ClassificationDiscrete class labelSpam vs Not Spam, Dog vs CatLogistic Regression, SVM, Decision Trees, Neural Nets
RegressionContinuous numberPredict house price, stock valueLinear Regression, Random Forest, Neural Nets

2. Unsupervised Learning

You give the model unlabeled data — inputs only, no correct answers. The model must find structure, patterns, or groupings by itself.

✦ Analogy

Like organizing a pile of random photos with no instructions — you'd naturally group them by scene (beaches, cities, people). The model does the same with data.

TaskWhat it doesExampleAlgorithms
ClusteringGroups similar data pointsCustomer segmentationK-Means, DBSCAN, Hierarchical
Dimensionality ReductionCompress data while keeping structureVisualizing high-dim dataPCA, t-SNE, UMAP
Anomaly DetectionFind unusual data pointsFraud detectionIsolation Forest, Autoencoders
Generative ModelsLearn to generate new dataImage generation (GANs)VAE, GAN, Diffusion Models

3. Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties. There's no labeled dataset — the model learns through trial and error, trying to maximize cumulative reward over time.

✦ Analogy

Like training a dog with treats. You don't show it every possible situation — you let it explore, reward good behavior, and punish bad. Over millions of tries, it figures out the optimal strategy.

Used in: Game playing (AlphaGo, chess engines), robotics, ad bidding systems, and crucially — training LLMs like ChatGPT (RLHF: Reinforcement Learning from Human Feedback).

Generalization — The True Goal

Why performing well on training data means nothing, and what actually matters in ML.

📖 Beginner–Intermediate⏱ 14 min read
⚡ The Fundamental Goal of ML

The only thing that matters is how well the model performs on data it has never seen before. This is called generalization. Getting 100% accuracy on your training data is easy — and meaningless.

Why Training Accuracy Is a Trap

Imagine you are studying for an exam. If your teacher gives you the exact exam questions in advance and you memorize all the answers, you'll get 100%. But if the real exam has slightly different questions, you'll fail — because you memorized, not understood.

This is exactly what happens when an ML model memorizes its training data. It's called overfitting — and it's the #1 problem in ML.

Python — Train/Test Split (Correct Way)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split data: 80% training, 20% testing
# CRITICAL: Never touch test data until final evaluation!
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)  # Train on training data ONLY

train_acc = model.score(X_train, y_train)
test_acc  = model.score(X_test, y_test)

print(f"Train accuracy: {train_acc:.2%}")  # Might be 99% — don't celebrate
print(f"Test accuracy:  {test_acc:.2%}")   # THIS is what matters

Cross-Validation — The Gold Standard

Holding out 20% for testing means you're wasting 20% of your data for training. Cross-validation solves this by rotating which portion is held out.

  1. Split data into K equal parts (folds)

    Common choice: K=5 or K=10 folds.

  2. Train K times

    Each time, use K-1 folds for training and 1 fold for validation.

  3. Average the results

    Take the average performance across all K runs. This is your estimate of generalization.

Python — 5-Fold Cross Validation
from sklearn.model_selection import cross_val_score

model = DecisionTreeClassifier(max_depth=5)

# 5-fold CV: trains 5 times, tests on each held-out fold
scores = cross_val_score(model, X, y, cv=5)

print(f"CV scores: {scores}")          # [0.92, 0.88, 0.90, 0.91, 0.89]
print(f"Mean: {scores.mean():.3f}")     # 0.900
print(f"Std:  {scores.std():.3f}")      # 0.013 (low = stable model)

Data Alone Is Never Enough

This is a deep and surprising result called the No Free Lunch Theorem (Wolpert, 1996): no algorithm can beat random guessing over all possible problems. Every learner must make assumptions — called inductive biases — about the world to generalize.

💡 The No Free Lunch Theorem

There is no universally best machine learning algorithm. An algorithm that works well on one problem must be making assumptions that fail on another. The best algorithm always depends on the problem — which is why choosing your model based on domain knowledge matters.

Overfitting & Underfitting

The two ways your model can fail — and a toolkit for fixing both.

⚡ Critical Topic📖 Intermediate⏱ 20 min read

What is Overfitting?

Overfitting happens when a model learns the training data too well — including its noise and random quirks — and fails to generalize to new data.

⚠ Classic Symptom

Training accuracy = 99%. Test accuracy = 62%. Your model has memorized the training set, not learned the underlying pattern. It's useless in the real world.

Think of a student who memorizes every exam from the past 10 years word-for-word, but has no real understanding. They'll ace those exact exams but fail any new questions.

What is Underfitting?

Underfitting is the opposite: the model is too simple to capture the real pattern in the data. Both training and test accuracy are poor.

Example: trying to fit a curved relationship with a straight line. No matter how much data you have, a line can't capture the curve.

The Visual Intuition

ProblemTrain ErrorTest ErrorCauseFix
Just Right ✓LowLowBalanced model complexity
OverfittingVery LowHighModel too complex / too little dataRegularize, get more data, simplify model
UnderfittingHighHighModel too simpleMore complex model, more features

Techniques to Fight Overfitting

1. Regularization

Add a penalty to the loss function for complexity. Forces the model to stay simple.

Regularization — Ridge vs Lasso
from sklearn.linear_model import Ridge, Lasso

# Ridge (L2): penalizes large weights
# Loss = MSE + λ * Σ(weights²)
ridge = Ridge(alpha=1.0)  # alpha = λ, controls regularization strength

# Lasso (L1): pushes some weights to exactly zero (feature selection)
# Loss = MSE + λ * Σ|weights|
lasso = Lasso(alpha=0.1)

# Rule of thumb:
# - Ridge: when you think all features matter but need smaller weights
# - Lasso: when you think many features are irrelevant (automatic selection)

2. Dropout (for Neural Networks)

During training, randomly "switch off" a fraction of neurons. This prevents any neuron from becoming too dominant and forces the network to learn redundant representations.

Dropout in PyTorch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # 50% of neurons randomly off during training
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.3),   # 30% dropout on second layer
    nn.Linear(128, 10)
)

3. Early Stopping

Monitor validation loss during training. Stop training when validation loss starts increasing — even if training loss keeps dropping.

4. Get More Data

Overfitting is fundamentally a problem of having too little data relative to model complexity. More data is almost always the best fix, when available.

5. Data Augmentation

Artificially expand your training set by creating variations of existing examples. For images: rotate, flip, crop, adjust brightness. For text: paraphrase, back-translate.

Quick Check ✓

Your model has 98% training accuracy but only 61% test accuracy. What's happening?

The Bias-Variance Tradeoff

The deep mathematical reason behind overfitting and underfitting — and why you can never eliminate both simultaneously.

📖 Intermediate⏱ 18 min read

Decomposing Error

Any ML model's total error on new data can be split into three parts:

Total Error = Bias² + Variance + Irreducible Noise
The Bias-Variance Decomposition

Understanding each term is key to diagnosing and fixing your model's problems.

High Bias

Model is too simple. It consistently makes the same types of mistakes, no matter what data you give it. This is underfitting.

Example: fitting a straight line to data that follows a curve

High Variance

Model is too complex. It changes drastically with small changes in training data. This is overfitting.

Example: a deep decision tree that learns every noise point

High Both

Very complex model that still misses the real pattern. Common with wrong architecture choices.

Example: polynomial regression with wrong degree on messy data

Low Both ✓

The ideal model. Complex enough to capture the real pattern, but not so complex it memorizes noise.

This is what you're always aiming for

The Dart Board Analogy (from Domingos)

Imagine your model is a dart thrower aiming at a target (the true answer). You run many trials with different training sets and observe where the darts land:

ScenarioDarts Clustered?Darts on Target?Interpretation
Low Bias, Low Variance✓ Yes (tight cluster)✓ Yes (near center)Perfect — this is the goal
High Bias, Low Variance✓ Yes (tight cluster)✗ No (off-center)Consistently wrong — model too simple
Low Bias, High Variance✗ No (spread out)∼ Average on targetInconsistent — model too complex
High Bias, High Variance✗ No (spread out)✗ No (off-center)Worst case

Surprising Result: Strong Wrong Assumptions Can Beat Weak True Ones

Here is a counterintuitive but crucial insight from the paper by Domingos (2012): naive Bayes, which assumes all features are completely independent (which is almost never true), can outperform a rule learner on problems where the truth is a set of rules — because naive Bayes doesn't overfit as badly.

💡 Key Insight

With limited data, a learner with strong (even wrong) assumptions can outperform one with correct but weak assumptions — because strong assumptions reduce variance at the cost of introducing some bias, and with small datasets that tradeoff is often worth it. This is why naive Bayes works surprisingly well in practice.

How to Control the Tradeoff

ActionEffect on BiasEffect on Variance
Increase model complexity↓ Decreases↑ Increases
Add more training data≈ Same↓ Decreases
Add regularization (L1/L2)↑ May increase↓ Decreases
Feature selection↑ May increase↓ Decreases
Use ensemble methods↓ Decreases↓ Decreases (bagging)
Reduce number of features↑ May increase↓ Decreases

The Curse of Dimensionality

Why more features can hurt your model — and how to fight back.

📖 Intermediate⏱ 15 min read
⚠ Definition

The Curse of Dimensionality (coined by Bellman, 1961) refers to the exponential growth in problems that occur when working with high-dimensional data. As features (dimensions) increase, the data becomes increasingly sparse, distances lose meaning, and models become harder to train.

The Exponential Sparsity Problem

Imagine you have 1,000 training examples in a 1D space (one feature). They fill the space reasonably well. Now add a second dimension: you need roughly 1,000² = 1,000,000 examples to fill the 2D space equally well. By 10 dimensions: you'd need 10^30 examples. For 100 dimensions: you'd need more examples than atoms in the universe.

Examples needed ∝ n^d
Where n = examples per dimension, d = number of dimensions. This grows explosively.

Distances Stop Working

Most ML algorithms (K-NN, SVM, clustering) rely on the idea that similar examples are nearby in feature space. In high dimensions, this completely breaks down:

  • All points become approximately equidistant from each other
  • The "nearest neighbor" is no more similar than the "farthest" point
  • K-NN becomes effectively random in very high dimensions

The "Blessing of Non-Uniformity"

Fortunately, real-world data has a saving grace: it doesn't actually fill all of high-dimensional space. Real data tends to lie on a lower-dimensional manifold — a structure with far fewer effective dimensions than the raw feature count.

✦ Example

An image of a handwritten digit has 784 pixels (784 dimensions). But the "space of all valid digit images" is much smaller — a small manifold within that 784D space. This is why K-NN works well on MNIST despite its theoretical problems with high dimensions.

Practical Solutions

TechniqueHow it helpsWhen to use
PCA
Principal Component Analysis
Projects data onto fewer dimensions that explain most variancePreprocessing step for many algorithms
t-SNE / UMAPNon-linear reduction, great for visualizationVisualizing high-dim data in 2D/3D
Feature SelectionRemove irrelevant features using statistical tests or model importanceWhen you have domain knowledge or too many features
RegularizationPenalizes models that use many features (Lasso forces weights to zero)Built into your model training process
Deep LearningAutomatically learns compact low-dimensional representations (embeddings)Large datasets, complex patterns
Python — PCA Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale first — PCA is sensitive to feature scales
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce from 100 features to 20, keeping 95% of variance
pca = PCA(n_components=0.95)  # Keep 95% explained variance
X_reduced = pca.fit_transform(X_scaled)

print(f"Original: {X.shape[1]} features")
print(f"Reduced:  {X_reduced.shape[1]} features")  # Much smaller!
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")

Feature Engineering — The Art of ML

The single most impactful thing you can do to improve a model. More important than choosing the right algorithm.

📖 Intermediate⏱ 22 min read
⚡ Key Quote (Domingos, 2012)

"The most important factor [in ML project success] is the features used. Learning is easy if you have many independent features that each correlate well with the class." Feature engineering is where intuition, creativity, and domain expertise are as important as technical skill.

What is Feature Engineering?

Raw data is rarely in a form that algorithms can directly use. Feature engineering is the process of transforming raw data into informative, discriminative inputs for your model. It includes:

  • Feature creation — constructing new features from raw data
  • Feature transformation — scaling, encoding, normalizing
  • Feature selection — removing irrelevant or redundant features
  • Feature extraction — learning compact representations (e.g., PCA, embeddings)

Common Transformations with Code

Numerical Features

Python — Numerical Feature Engineering
import pandas as pd
import numpy as np

df = pd.read_csv('houses.csv')

# 1. SCALING — essential for distance-based and gradient methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1 — use for most algorithms
scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])

# MinMaxScaler: range [0,1] — use for neural networks, image data
mm = MinMaxScaler()
df['area_norm'] = mm.fit_transform(df[['area']])

# 2. LOG TRANSFORM — fix skewed distributions
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

# 3. BINNING — turn continuous into categorical
df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 65, 100],
    labels=['minor', 'young', 'middle', 'senior']
)

# 4. INTERACTION FEATURES — combine existing features
df['price_per_sqft'] = df['price'] / df['sqft']
df['rooms_per_floor'] = df['rooms'] / df['floors']

Categorical Features

Python — Categorical Feature Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# 1. LABEL ENCODING — for ordinal data (order matters)
# e.g., "low" < "medium" < "high"
le = LabelEncoder()
df['quality_enc'] = le.fit_transform(df['quality'])

# 2. ONE-HOT ENCODING — for nominal data (order doesn't matter)
# e.g., "red", "blue", "green" — turns each into its own binary column
df_encoded = pd.get_dummies(df, columns=['color', 'city'])

# 3. TARGET ENCODING — for high-cardinality categories
# Replace each category with the mean of the target for that category
target_means = df.groupby('city')['price'].mean()
df['city_price_mean'] = df['city'].map(target_means)

# WARNING: Target encoding can cause leakage on small datasets
# Use cross-validation target encoding in production

Feature Selection Methods

MethodHow it worksBest for
Correlation FilterRemove features with correlation > 0.95 to another featureQuick preprocessing
Chi-Square TestStatistical test: does this feature have a relationship with the target?Classification with categorical features
Feature Importance (Trees)Random Forest / XGBoost tell you how much each feature reduces impurityAny prediction problem
Lasso RegressionL1 penalty drives unimportant weights to exactly 0Linear models
Recursive Feature EliminationIteratively remove least important features and re-trainWhen you need a specific number of features

The Algorithm Guide

Every major ML algorithm — explained clearly, with when to use each one.

📖 All levels⏱ 30 min read
✦ Rule of Thumb (Domingos)

Always try the simplest algorithm first. Naive Bayes before logistic regression. K-NN before SVMs. Simpler models are faster, more interpretable, and often surprisingly competitive — especially with limited data.

Linear / Logistic Regression
ClassificationRegressionInterpretable
Fits a straight line (or hyperplane) through data. Logistic regression predicts probabilities using the sigmoid function: P(y=1) = 1/(1+e^(-wx)). Despite the name, logistic regression is a classification algorithm. One of the most widely used algorithms in industry due to its interpretability.
Use when: Features have linear relationship with target, you need interpretability, you have few features, baseline model.
K-Nearest Neighbors (K-NN)
ClassificationRegressionNon-parametric
To classify a new point, find its K most similar training examples and vote. No training step — all computation happens at prediction time ("lazy learner"). The key choice is K: small K = complex, noisy boundary; large K = smoother, simpler boundary.
Use when: Small datasets, local patterns matter, quick prototype. Avoid with high-dimensional data (curse of dimensionality kills it).
Decision Trees
ClassificationRegressionInterpretable
Recursively splits data by asking yes/no questions about features. At each node, picks the feature and threshold that best separates the classes (measured by Information Gain or Gini impurity). Highly interpretable — you can visualize and explain every prediction. But: prone to overfitting without depth limits.
Use when: Interpretability is required, mixed feature types, non-linear boundaries. Add max_depth to prevent overfitting.
Random Forest
ClassificationRegressionEnsembleRobust
An ensemble of hundreds of decision trees, each trained on a random subset of data (bagging) and a random subset of features. Predictions are made by majority vote. Dramatically reduces variance compared to a single tree. One of the most reliable "off-the-shelf" algorithms in existence.
Use when: Tabular data, mixed features, you want a strong baseline, feature importance is useful. Great starting point for most supervised learning problems.
Gradient Boosting (XGBoost / LightGBM)
ClassificationRegressionEnsembleState-of-Art
Builds trees sequentially, where each tree corrects the errors of the previous ones. Uses gradient descent in function space. XGBoost and LightGBM are highly optimized implementations that dominate Kaggle competitions for tabular data. Requires tuning but delivers exceptional results.
Use when: Tabular data, you want maximum accuracy, you have time to tune. The go-to for structured/tabular prediction tasks.
Support Vector Machine (SVM)
ClassificationRegressionKernel trick
Finds the hyperplane that maximizes the margin between classes. The "kernel trick" allows SVMs to work in very high-dimensional implicit feature spaces without explicitly computing them. Theoretically elegant and works well with clear margins. Scales poorly to large datasets.
Use when: Small-to-medium datasets with clear class separation, text classification, high-dimensional features (like TF-IDF vectors).
Naive Bayes
ClassificationProbabilisticFast
Applies Bayes' theorem with the "naive" assumption that all features are conditionally independent given the class. This assumption is almost always false, yet Naive Bayes works remarkably well — especially for text. Very fast to train, works with tiny datasets, handles missing data naturally.
Use when: Text classification (spam, sentiment), small datasets, need fast training, good first baseline. Often surprisingly competitive.
K-Means Clustering
UnsupervisedClustering
Partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids. Must specify K in advance. Sensitive to initialization (use K-Means++ to fix this). Assumes clusters are spherical and similar in size.
Use when: Customer segmentation, topic modeling, data exploration. Avoid when clusters have very different sizes or non-spherical shapes.

Neural Networks & Deep Learning

From the single neuron to the Transformer architecture powering ChatGPT.

📖 Advanced⏱ 35 min read

The Neuron — Building Block

A single artificial neuron does three things: takes weighted inputs, sums them up, and applies an activation function to produce an output.

output = activation( w₁x₁ + w₂x₂ + ... + wₙxₙ + bias )
The neuron equation. Weights (w) are learned. Bias shifts the activation threshold.

Activation Functions — Why They Matter

Without activation functions, any neural network — no matter how deep — collapses to a simple linear function. Activations introduce non-linearity, which is what gives deep networks their power.

FunctionFormulaRangeWhen to Use
ReLUmax(0, x)[0, ∞)Hidden layers — default choice, fast, avoids vanishing gradients
Sigmoid1/(1+e^(-x))(0, 1)Binary classification output layer only
Softmaxe^xᵢ / Σe^xⱼ(0,1), sums to 1Multi-class classification output layer
Tanh(e^x - e^-x)/(e^x + e^-x)(-1, 1)Recurrent networks, when zero-centered output matters
GELUx·Φ(x)(-∞, ∞)Transformers (GPT, BERT use this)

Backpropagation — How Neural Nets Learn

Backpropagation uses the chain rule of calculus to efficiently compute how much each weight contributed to the error, then adjusts them in the direction that reduces the error.

  1. Forward Pass

    Feed input through the network, layer by layer, to get a prediction.

  2. Compute Loss

    Compare prediction to actual answer using a loss function (e.g., cross-entropy for classification).

  3. Backward Pass

    Propagate the error gradient backward through the network using the chain rule. Compute ∂Loss/∂w for every weight.

  4. Update Weights

    Adjust each weight by a small step in the direction that reduces the loss: w ← w - η·(∂Loss/∂w), where η is the learning rate.

  5. Repeat

    Do this for thousands of mini-batches over many epochs until loss converges.

Architecture Types

ArchitectureInput TypeKey IdeaUse Case
MLP (Fully Connected)Tabular, fixed-sizeEach neuron connects to all neurons in next layerTabular data, classification
CNN (Convolutional)Images, gridsLocal filters + shared weights = spatial invarianceImage classification, object detection
RNN / LSTMSequencesHidden state carries information across time stepsTime series, older NLP tasks
TransformerSequencesSelf-attention: every position attends to every otherNLP (GPT, BERT), vision, multimodal
AutoencoderAnyCompress to bottleneck then reconstruct (encoder + decoder)Anomaly detection, representation learning
GANNoiseGenerator vs Discriminator adversarial trainingImage generation, data augmentation

The Transformer — How Modern AI Works

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind GPT, BERT, Claude, Gemini, and nearly all modern AI. Its key innovation is self-attention.

🔑 Self-Attention Explained Simply

For each word in a sentence, self-attention asks: "which other words in this sentence are most relevant to understanding THIS word?" It computes a weighted sum of all other words' representations, where the weights reflect how relevant each word is. This happens in parallel for every position simultaneously — unlike RNNs which process sequentially.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V
The self-attention formula. Q=Queries, K=Keys, V=Values. d_k = key dimension (for scaling).

Ensembles & Boosting

Why combining many weak models beats one strong model — every time.

📖 Intermediate⏱ 16 min read
💡 From Domingos (2012)

In the Netflix Prize, the winning and runner-up submissions were both stacked ensembles of over 100 learners. Combining them improved results even further. The lesson: learn many models, not just one.

Three Main Ensemble Methods

1. Bagging (Bootstrap Aggregating)

Train many copies of the same model on different random samples of the training data (with replacement). Combine by voting (classification) or averaging (regression). Dramatically reduces variance with almost no increase in bias. Random Forest is bagging applied to decision trees.

2. Boosting

Train models sequentially. Each new model focuses on the mistakes of the previous ones. Training examples that were classified incorrectly get higher weights. Final prediction is a weighted vote of all models. Reduces both bias and variance. XGBoost and LightGBM are boosting algorithms.

3. Stacking

Train several different base models (e.g., Random Forest + SVM + Logistic Regression). Then train a "meta-learner" on the predictions of the base models. The meta-learner learns how to best combine their predictions.

Python — XGBoost (State of the Art for Tabular Data)
import xgboost as xgb
from sklearn.model_selection import cross_val_score

model = xgb.XGBClassifier(
    n_estimators=500,       # number of trees
    max_depth=6,            # tree depth (controls complexity)
    learning_rate=0.05,     # step size (lower = more trees needed)
    subsample=0.8,          # fraction of data per tree (bagging)
    colsample_bytree=0.8,   # fraction of features per tree
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    use_label_encoder=False,
    eval_metric='logloss'
)

# Evaluate with cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Mean CV Accuracy: {scores.mean():.3f}")

Fine-Tuning AI Models

How to take a massive pre-trained model like GPT-2 or LLaMA and adapt it to your specific task — with far less data and compute than training from scratch.

📖 Advanced⏱ 40 min read🔥 Most Practical

What is Fine-Tuning?

A pre-trained model like GPT-2 was trained on hundreds of billions of tokens of text. It has learned general language understanding — grammar, facts, reasoning patterns. Fine-tuning takes this pre-trained model and continues training it on a much smaller, task-specific dataset to adapt it to your specific use case.

⚡ Why Not Train From Scratch?

Training GPT-3 from scratch cost ~$12 million in compute. Fine-tuning can cost as little as a few dollars on a single GPU. You leverage billions of dollars of existing training and adapt just the last mile for your specific task.

Transfer Learning — The Core Idea

1

Pre-Training (Already Done)

A large model is trained on massive data to learn general representations. For language models, this is next-token prediction on vast internet text. The model learns grammar, facts, common sense, and reasoning. This creates powerful internal representations.

2

Fine-Tuning (You Do This)

You take the pre-trained model and continue training it on your task-specific dataset. The model's weights shift slightly to adapt to your domain. All the general knowledge is preserved — you're just specializing it.

3

Deployment

The fine-tuned model performs much better on your task than the base model — and far better than training from scratch on your small dataset alone.

Full Fine-Tuning vs Parameter-Efficient Methods

MethodWhat it updatesVRAM neededData neededBest for
Full Fine-TuningALL model weightsVery high (40B+ model → 80GB+)Thousands of examplesWhen you have serious compute budget
LoRALow-rank adapter matrices onlyLow (can do 7B on 16GB)Hundreds of examplesMost practical fine-tuning today
QLoRALoRA adapters on quantized modelVery low (7B on 8GB!)Hundreds of examplesFine-tuning on consumer hardware
Prompt TuningOnly a small set of "soft prompt" tokensMinimalSmallLight task adaptation
Adapter LayersSmall inserted adapter modulesLowMediumMulti-task models

LoRA Deep Dive — The Modern Standard

LoRA (Low-Rank Adaptation) is the most important fine-tuning technique to understand. Here's how it works:

In a standard neural network layer, the weight matrix W has dimensions d×k, meaning d×k trainable parameters. During full fine-tuning, ALL of these change. LoRA instead adds two small matrices A (d×r) and B (r×k) where r is small (like 4, 8, or 16). The modified layer becomes:

W' = W + BA
W is frozen. Only A and B are trained. If r=8, d=1024, k=1024: original = 1M params. LoRA A+B = 8×1024 + 1024×8 = 16,384 params. 98.4% reduction!
Python — QLoRA Fine-Tuning with Hugging Face + PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# Step 1: Load the base model in 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Step 2: Configure LoRA adapters
lora_config = LoraConfig(
    r=16,                  # rank — higher = more capacity, more params
    lora_alpha=32,         # scaling factor (usually 2x rank)
    target_modules=["q_proj", "v_proj"],  # which layers to apply LoRA to
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Step 3: Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || trainable: 0.06%

Preparing Your Fine-Tuning Dataset

Dataset quality matters far more than quantity for fine-tuning. Here's what a good fine-tuning dataset looks like for an instruction-following model:

Fine-Tuning Dataset Format (JSONL)
// Each line is one training example in instruction format
{"instruction": "Explain what gradient descent is in simple terms.",
 "input": "",
 "output": "Gradient descent is like hiking down a mountain in fog..."}

{"instruction": "Convert this SQL query to Python pandas code.",
 "input": "SELECT name, age FROM users WHERE age > 18",
 "output": "df[df['age'] > 18][['name', 'age']]"}

// For chat fine-tuning (preferred for conversational models):
{"messages": [
  {"role": "system", "content": "You are a helpful ML tutor."},
  {"role": "user", "content": "What is overfitting?"},
  {"role": "assistant", "content": "Overfitting is when a model..."}
]}
✦ Dataset Quality Checklist

Diversity — cover all the cases you care about
Quality over quantity — 500 excellent examples beats 5,000 mediocre ones
Format consistency — always use the same template
No duplicates — deduplicate your dataset
Output quality — bad outputs teach bad behavior. Review them manually.

RLHF — How ChatGPT Was Trained

RLHF (Reinforcement Learning from Human Feedback) is how models like ChatGPT, Claude, and Gemini are aligned to be helpful, harmless, and honest. It happens in 3 phases:

  1. Supervised Fine-Tuning (SFT)

    Fine-tune the base model on high-quality demonstrations of the desired behavior. Human contractors write ideal responses to thousands of prompts.

  2. Reward Model Training

    Collect human preference data: show the same prompt to the model, get two different responses, have humans rank which is better. Train a "reward model" to predict human preference scores.

  3. RL Fine-Tuning with PPO

    Use the reward model as a scoring signal and apply reinforcement learning (PPO algorithm) to train the language model to generate responses with higher reward scores. Add KL-divergence penalty to prevent the model from drifting too far from the original SFT model.

Practical Hyperparameters for Fine-Tuning

ParameterTypical RangeNotes
Learning Rate1e-5 to 5e-4Much lower than pre-training. Start with 2e-4 for LoRA.
Batch Size4–32 (with gradient accumulation)Effective batch = batch_size × gradient_accum_steps
Epochs1–5Very few epochs needed. More often causes overfitting.
LoRA Rank (r)4, 8, 16, 64Higher = more capacity. 8-16 is sweet spot for most tasks.
LoRA Alpha= 2 × rControls scaling. Keep at 2× rank as default.
Max Sequence Length512–4096Longer = more memory. Match to your use case.

Evaluation & Metrics

How to actually measure whether your model is good — and which metric to use when.

📖 Intermediate⏱ 18 min read
⚠ Common Mistake

Accuracy is often the worst metric to use. If 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy — but catches zero spam. Always choose metrics that match your actual problem.

Classification Metrics

All classification metrics derive from the confusion matrix — a table of True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).

MetricFormulaUse When
Accuracy(TP+TN) / AllBalanced classes, equal error costs
PrecisionTP / (TP+FP)False positives are costly (e.g., spam filter — don't delete real emails)
RecallTP / (TP+FN)False negatives are costly (e.g., cancer screening — don't miss cancer)
F1 Score2 × (P×R)/(P+R)Need balance of precision & recall, imbalanced classes
ROC-AUCArea under ROC curveRanking quality, comparing models across thresholds
PR-AUCArea under Precision-Recall curveVery imbalanced classes (better than ROC-AUC in this case)

Regression Metrics

MetricFormulaNotes
MSEmean((y - ŷ)²)Penalizes large errors heavily. In output units².
RMSE√MSESame units as output. Most commonly reported.
MAEmean(|y - ŷ|)More robust to outliers than MSE. Easy to interpret.
1 - SS_res/SS_tot% of variance explained. 1.0 = perfect. Can be negative (worse than baseline).

Data Preparation

Real ML is 80% data work. Here's how to do it right.

📖 Practical⏱ 20 min read

Handling Missing Values

Python — Missing Value Strategies
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer

# 1. DROP rows/columns with too many missing values
df.dropna(thresh=0.8 * len(df), axis=1)  # drop cols with >20% missing

# 2. SIMPLE IMPUTATION
imputer = SimpleImputer(strategy='median')   # or 'mean', 'most_frequent'
X_imputed = imputer.fit_transform(X)

# 3. KNN IMPUTATION — fills missing with average of K nearest neighbors
# More accurate but slower
knn_imputer = KNNImputer(n_neighbors=5)
X_knn = knn_imputer.fit_transform(X)

# 4. ADD "was_missing" indicator column — let model learn the pattern
df['age_was_missing'] = df['age'].isna().astype(int)
df['age'].fillna(df['age'].median(), inplace=True)

Handling Class Imbalance

Python — SMOTE & Class Weights
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# OPTION 1: SMOTE — synthesize new minority class examples
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# OPTION 2: class_weight='balanced' — let the model handle it
model = RandomForestClassifier(class_weight='balanced')

# OPTION 3: Custom weights
weights = {0: 1, 1: 10}  # 10x penalty for misclassifying minority class
model = RandomForestClassifier(class_weight=weights)

ML Master Cheat Sheet

Everything condensed. Keep this open while studying or building.

Key Formulas

MSEmean((y - ŷ)²)
RMSE√MSE
Accuracy(TP+TN)/All
PrecisionTP/(TP+FP)
RecallTP/(TP+FN)
F12×(P×R)/(P+R)
Attentionsoftmax(QKᵀ/√d)·V
LoRAW' = W + BA

Overfitting vs Underfitting

OverfittingHigh train acc, low test acc
UnderfittingLow train acc, low test acc
Fix overfitRegularize, more data, simpler model
Fix underfitMore complex model, more features
Ridge (L2)Loss + λΣw²
Lasso (L1)Loss + λΣ|w|
DropoutRandom disable neurons during train

Algorithm Quick Picks

Tabular (best)XGBoost / LightGBM
Tabular (fast)Random Forest
ImageCNN / Vision Transformer
TextTransformer (BERT/GPT)
Baseline firstLogistic Reg / Naive Bayes
ClusteringK-Means / DBSCAN
DimensionalityPCA / UMAP

Fine-Tuning LLMs

LibraryHuggingFace + PEFT
MethodQLoRA (best for most)
LR range1e-5 to 5e-4
LoRA rank8–16 (sweet spot)
Epochs1–3 (more = overfitting)
Data formatInstruction JSONL
Min examples~200–500 high quality

Bias-Variance

Total ErrorBias² + Variance + Noise
High BiasUnderfitting, too simple
High VarianceOverfitting, too complex
More data →Reduces variance
Regularize →Reduces variance
Ensemble →Reduces both
Simpler model →Reduces variance, raises bias

Preprocessing Checklist

Scale featuresStandardScaler for most algs
Encode catsOHE nominal, Label ordinal
Missing valsImpute median / add flag
OutliersIQR clip or log-transform
ImbalanceSMOTE or class_weight
LeakageFit transforms on train ONLY
Validate5-fold CV before test set

Glossary

Every key term in ML defined clearly and concisely.

Backpropagation
The algorithm for computing gradients in neural networks using the chain rule of calculus. For each weight, it computes how much changing that weight would increase or decrease the loss, then adjusts all weights in the direction of lower loss.
Batch Size
The number of training examples processed before updating model weights. Small batches (stochastic) are noisier but escape local minima better. Large batches are more accurate but may converge to sharp minima.
Bias (in ML models)
The error from incorrect assumptions in the learning algorithm. High bias = model is too simple and consistently wrong in the same direction. Leads to underfitting.
Cross-Entropy Loss
The standard loss function for classification. Measures how well the predicted probability distribution matches the true distribution. Lower is better. For binary classification: -[y·log(p) + (1-y)·log(1-p)].
Cross-Validation
A technique to estimate generalization performance by rotating which portion of data is used for validation. K-fold CV splits data into K parts, trains K times, each time holding out a different part for evaluation.
Dropout
A regularization technique for neural networks. During training, randomly sets a fraction of neurons to zero at each forward pass. Prevents co-adaptation and acts like training an ensemble of many smaller networks.
Embedding
A learned dense vector representation of a discrete item (word, category, user, item). Embeddings capture semantic relationships — similar items have similar vectors. The foundation of modern NLP and recommendation systems.
Epoch
One complete pass through the entire training dataset. Training typically requires multiple epochs. Too few = underfitting. Too many = overfitting.
Feature
An individual measurable property used as input to a model. Also called a predictor, attribute, or input variable. Feature quality is the most important factor in ML project success.
Gradient Descent
The optimization algorithm used to train most ML models. Computes the gradient of the loss with respect to all weights, then takes a small step in the opposite direction (downhill). Variants: SGD, Adam, RMSProp.
Hyperparameter
A configuration value set before training that controls the learning process itself (not learned from data). Examples: learning rate, number of layers, max_depth, regularization strength. Must be tuned via CV or grid search.
Inductive Bias
The set of assumptions a learning algorithm makes to generalize beyond training data. Every algorithm has one — without it, no generalization is possible (No Free Lunch Theorem). Choice of algorithm encodes a specific set of assumptions.
Learning Rate
The step size in gradient descent. Too large: overshoots minima, diverges. Too small: training is slow and may get stuck. One of the most important hyperparameters. Common default: 0.001 for Adam, 0.01 for SGD.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method. Instead of updating all weights in a layer (d×k parameters), adds two small matrices A (d×r) and B (r×k) where r ≪ d,k. Only A and B are trained. Reduces trainable parameters by 99%+.
Loss Function
A function measuring how far the model's predictions are from the true values. The algorithm minimizes this during training. Classification: cross-entropy. Regression: MSE. The choice of loss determines what the model optimizes.
No Free Lunch Theorem
A theorem (Wolpert, 1996) proving that no algorithm can outperform random guessing over all possible problems. Every algorithm's advantages on some problems come at the cost of disadvantages on others. Algorithm choice must be matched to problem structure.
Overfitting
When a model learns the training data too well — including its noise — and fails to generalize. Symptom: very high training accuracy, much lower test accuracy. Root cause: model too complex relative to data size.
Regularization
Techniques that constrain model complexity to reduce overfitting. Common forms: L1 (Lasso), L2 (Ridge), dropout, early stopping. Typically implemented by adding a penalty term to the loss function.
RLHF
Reinforcement Learning from Human Feedback. The training paradigm used to align LLMs (ChatGPT, Claude). Involves SFT on demonstrations, training a reward model on human preferences, then optimizing the language model against the reward model using PPO.
Transformer
The neural network architecture introduced in "Attention Is All You Need" (2017). Uses self-attention to model relationships between all positions in a sequence simultaneously. The foundation of GPT, BERT, Claude, Gemini, and virtually all modern AI.
Variance (in ML models)
The sensitivity of a model to fluctuations in training data. High variance = model changes drastically with different training sets. Leads to overfitting. Complex models (deep trees, large neural nets) tend to have high variance.