Learn ML from first principles to fine-tuning AI
A structured, detailed, beginner-to-advanced guide covering every core concept — written simply so anyone can understand it deeply.
What is Machine Learning?
Start here. Understand what ML is, why it matters, and how it works at a high level.
How ML Actually Works
Representation, Evaluation, Optimization — the three building blocks of every algorithm.
Overfitting & Underfitting
The most critical problem in ML. Learn why your model fails and how to fix it.
Bias-Variance Tradeoff
The fundamental tension at the heart of every ML system, explained visually.
Feature Engineering
The most important skill in ML that no textbook teaches properly.
Fine-Tuning AI Models
How to take a pre-trained model like GPT and adapt it to your specific task.
Algorithm Guide
Every major ML algorithm explained — when to use it, how it works, trade-offs.
Neural Networks & Deep Learning
From perceptrons to transformers — the architecture behind modern AI.
Master Cheat Sheet
Every formula, rule, and key fact in one condensed reference sheet.
Learning Roadmap
Follow this path from zero to fine-tuning real AI models. Each phase builds on the last.
What is Machine Learning?
Understanding what ML is, where it came from, and why it's revolutionizing every industry.
The Simple Definition
Machine learning is a way of programming computers using data instead of explicit instructions. Instead of writing a program that says "if the email contains the word 'win money' then mark it as spam," you show the computer thousands of examples of spam and non-spam emails and let it figure out the rules itself.
Machine Learning is a field of computer science that gives computers the ability to learn from data without being explicitly programmed for every situation. The system improves its performance as it is exposed to more data over time.
The classic alternative — hand-coded rules — breaks down fast. Think about spam filters: spammers constantly change their tricks. A rules-based system needs constant manual updates. An ML system can re-train itself and adapt automatically.
A Real Example: Spam Filter
Let's make this concrete. A spam filter built with ML works like this:
- Collect training data
Gather thousands of emails labeled "spam" or "not spam" by humans.
- Extract features
Turn each email into numbers the computer can understand — e.g., does it contain "free money"? How many exclamation marks? Who is the sender?
- Train the model
Run a learning algorithm that finds patterns — combinations of features that predict spam vs not-spam.
- Evaluate and test
Check how well the model works on emails it has never seen before.
- Deploy and update
Release the model. As new spam patterns emerge, re-train on fresh data.
Where ML Is Used Today
ML is already embedded in almost every digital product you use:
| Domain | ML Application | What it learns |
|---|---|---|
| Search Engines | Google Search ranking | Which pages are most relevant to your query |
| Social Media | TikTok/YouTube recommendations | What content you'll keep watching |
| Finance | Fraud detection, credit scoring | Patterns that indicate fraudulent transactions |
| Healthcare | Medical image diagnosis | What tumors and diseases look like in scans |
| Language | ChatGPT, Claude, Gemini | How language works and how to respond helpfully |
| Self-driving | Tesla Autopilot | How to navigate roads, detect objects |
ML vs Traditional Programming
Traditional Programming
- You write explicit rules
- Input + Rules → Output
- You must anticipate every case
- Brittle — breaks with new patterns
- Good for well-defined problems
Machine Learning
- You provide data + expected answers
- Input + Output → Rules (learned)
- Generalizes to unseen cases
- Adapts as new data arrives
- Good for pattern-heavy problems
Machine learning is not magic — it cannot get something from nothing. What it does is get more from less. Programming is like building from scratch. Learning is more like farming: you combine seeds (knowledge) with nutrients (data) to grow programs.
Quick Check ✓
How ML Actually Works
Every ML algorithm is a combination of three core components: Representation, Evaluation, and Optimization.
There are thousands of ML algorithms. Choosing one seems overwhelming. But here's the secret: every single learning algorithm is just a combination of three components. Once you understand these, the whole landscape of ML makes sense.
The Three Components
Representation — "What form can the answer take?"
A classifier (or model) must be expressed in some formal language the computer can handle. Your choice of representation defines the hypothesis space — the set of all answers the model could possibly learn. If the answer isn't in this space, the model literally cannot learn it, no matter how much data you give it.
Evaluation — "How do we score a candidate answer?"
An evaluation function (also called an objective function, loss function, or scoring function) measures how good a particular model is. For example: accuracy, error rate, log-likelihood, or F1 score. The algorithm uses this to distinguish better models from worse ones.
Optimization — "How do we search for the best answer?"
Given the evaluation function, we need a method to search through all possible models and find the highest-scoring one. Common choices: gradient descent, greedy search, genetic algorithms, branch-and-bound. This determines both the quality and speed of learning.
The Full Landscape
| Representation | Evaluation | Optimization |
|---|---|---|
| K-Nearest Neighbor | Accuracy / Error rate | Greedy search |
| Linear / Logistic Regression | Squared error / Likelihood | Gradient descent |
| Decision Trees | Information gain, Gini | Greedy recursive split |
| Support Vector Machines | Margin | Quadratic programming |
| Neural Networks | Cross-entropy loss | Backprop + Adam/SGD |
| Naive Bayes | Posterior probability | Closed-form calculation |
| Random Forest | Gini / Entropy | Bagging + greedy splits |
The ML Pipeline — End to End
Decision Tree — A Concrete Example
Let's see all three components in action with a decision tree for spam detection:
# REPRESENTATION: A tree of if/else questions # EVALUATION: Information Gain (how much does a split reduce uncertainty?) # OPTIMIZATION: Greedy search — pick the best split at each step from sklearn.tree import DecisionTreeClassifier import numpy as np # Features: [has "free", has "win", exclamation_count, link_count] X_train = np.array([ [1, 1, 5, 3], # spam [0, 0, 1, 0], # not spam [1, 0, 3, 2], # spam [0, 1, 0, 1], # not spam ]) y_train = [1, 0, 1, 0] # 1=spam, 0=not spam # Train — the algorithm finds the best splits automatically model = DecisionTreeClassifier(max_depth=3) model.fit(X_train, y_train) # Predict new emails new_email = [[1, 1, 4, 2]] # suspicious features print(model.predict(new_email)) # → [1] (spam)
Most ML textbooks are organized around the Representation column only. This makes it easy to miss that Evaluation and Optimization are equally important. Two models with the same representation but different optimization strategies can produce very different results.
Types of Machine Learning
The three main paradigms — and when to use each one.
1. Supervised Learning
You provide the model with labeled examples — both the inputs (features) and the correct outputs (labels). The model learns to map inputs to outputs. This is by far the most common type of ML.
Like a student learning with an answer key — they see the question AND the correct answer, and learn the pattern that connects them.
Two main subtypes:
| Task | Output Type | Example | Algorithms |
|---|---|---|---|
| Classification | Discrete class label | Spam vs Not Spam, Dog vs Cat | Logistic Regression, SVM, Decision Trees, Neural Nets |
| Regression | Continuous number | Predict house price, stock value | Linear Regression, Random Forest, Neural Nets |
2. Unsupervised Learning
You give the model unlabeled data — inputs only, no correct answers. The model must find structure, patterns, or groupings by itself.
Like organizing a pile of random photos with no instructions — you'd naturally group them by scene (beaches, cities, people). The model does the same with data.
| Task | What it does | Example | Algorithms |
|---|---|---|---|
| Clustering | Groups similar data points | Customer segmentation | K-Means, DBSCAN, Hierarchical |
| Dimensionality Reduction | Compress data while keeping structure | Visualizing high-dim data | PCA, t-SNE, UMAP |
| Anomaly Detection | Find unusual data points | Fraud detection | Isolation Forest, Autoencoders |
| Generative Models | Learn to generate new data | Image generation (GANs) | VAE, GAN, Diffusion Models |
3. Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties. There's no labeled dataset — the model learns through trial and error, trying to maximize cumulative reward over time.
Like training a dog with treats. You don't show it every possible situation — you let it explore, reward good behavior, and punish bad. Over millions of tries, it figures out the optimal strategy.
Used in: Game playing (AlphaGo, chess engines), robotics, ad bidding systems, and crucially — training LLMs like ChatGPT (RLHF: Reinforcement Learning from Human Feedback).
Generalization — The True Goal
Why performing well on training data means nothing, and what actually matters in ML.
The only thing that matters is how well the model performs on data it has never seen before. This is called generalization. Getting 100% accuracy on your training data is easy — and meaningless.
Why Training Accuracy Is a Trap
Imagine you are studying for an exam. If your teacher gives you the exact exam questions in advance and you memorize all the answers, you'll get 100%. But if the real exam has slightly different questions, you'll fail — because you memorized, not understood.
This is exactly what happens when an ML model memorizes its training data. It's called overfitting — and it's the #1 problem in ML.
from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Split data: 80% training, 20% testing # CRITICAL: Never touch test data until final evaluation! X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) model = DecisionTreeClassifier() model.fit(X_train, y_train) # Train on training data ONLY train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) print(f"Train accuracy: {train_acc:.2%}") # Might be 99% — don't celebrate print(f"Test accuracy: {test_acc:.2%}") # THIS is what matters
Cross-Validation — The Gold Standard
Holding out 20% for testing means you're wasting 20% of your data for training. Cross-validation solves this by rotating which portion is held out.
- Split data into K equal parts (folds)
Common choice: K=5 or K=10 folds.
- Train K times
Each time, use K-1 folds for training and 1 fold for validation.
- Average the results
Take the average performance across all K runs. This is your estimate of generalization.
from sklearn.model_selection import cross_val_score model = DecisionTreeClassifier(max_depth=5) # 5-fold CV: trains 5 times, tests on each held-out fold scores = cross_val_score(model, X, y, cv=5) print(f"CV scores: {scores}") # [0.92, 0.88, 0.90, 0.91, 0.89] print(f"Mean: {scores.mean():.3f}") # 0.900 print(f"Std: {scores.std():.3f}") # 0.013 (low = stable model)
Data Alone Is Never Enough
This is a deep and surprising result called the No Free Lunch Theorem (Wolpert, 1996): no algorithm can beat random guessing over all possible problems. Every learner must make assumptions — called inductive biases — about the world to generalize.
There is no universally best machine learning algorithm. An algorithm that works well on one problem must be making assumptions that fail on another. The best algorithm always depends on the problem — which is why choosing your model based on domain knowledge matters.
Overfitting & Underfitting
The two ways your model can fail — and a toolkit for fixing both.
What is Overfitting?
Overfitting happens when a model learns the training data too well — including its noise and random quirks — and fails to generalize to new data.
Training accuracy = 99%. Test accuracy = 62%. Your model has memorized the training set, not learned the underlying pattern. It's useless in the real world.
Think of a student who memorizes every exam from the past 10 years word-for-word, but has no real understanding. They'll ace those exact exams but fail any new questions.
What is Underfitting?
Underfitting is the opposite: the model is too simple to capture the real pattern in the data. Both training and test accuracy are poor.
Example: trying to fit a curved relationship with a straight line. No matter how much data you have, a line can't capture the curve.
The Visual Intuition
| Problem | Train Error | Test Error | Cause | Fix |
|---|---|---|---|---|
| Just Right ✓ | Low | Low | Balanced model complexity | — |
| Overfitting | Very Low | High | Model too complex / too little data | Regularize, get more data, simplify model |
| Underfitting | High | High | Model too simple | More complex model, more features |
Techniques to Fight Overfitting
1. Regularization
Add a penalty to the loss function for complexity. Forces the model to stay simple.
from sklearn.linear_model import Ridge, Lasso # Ridge (L2): penalizes large weights # Loss = MSE + λ * Σ(weights²) ridge = Ridge(alpha=1.0) # alpha = λ, controls regularization strength # Lasso (L1): pushes some weights to exactly zero (feature selection) # Loss = MSE + λ * Σ|weights| lasso = Lasso(alpha=0.1) # Rule of thumb: # - Ridge: when you think all features matter but need smaller weights # - Lasso: when you think many features are irrelevant (automatic selection)
2. Dropout (for Neural Networks)
During training, randomly "switch off" a fraction of neurons. This prevents any neuron from becoming too dominant and forces the network to learn redundant representations.
import torch.nn as nn model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(p=0.5), # 50% of neurons randomly off during training nn.Linear(256, 128), nn.ReLU(), nn.Dropout(p=0.3), # 30% dropout on second layer nn.Linear(128, 10) )
3. Early Stopping
Monitor validation loss during training. Stop training when validation loss starts increasing — even if training loss keeps dropping.
4. Get More Data
Overfitting is fundamentally a problem of having too little data relative to model complexity. More data is almost always the best fix, when available.
5. Data Augmentation
Artificially expand your training set by creating variations of existing examples. For images: rotate, flip, crop, adjust brightness. For text: paraphrase, back-translate.
Quick Check ✓
The Bias-Variance Tradeoff
The deep mathematical reason behind overfitting and underfitting — and why you can never eliminate both simultaneously.
Decomposing Error
Any ML model's total error on new data can be split into three parts:
Understanding each term is key to diagnosing and fixing your model's problems.
High Bias
Model is too simple. It consistently makes the same types of mistakes, no matter what data you give it. This is underfitting.
Example: fitting a straight line to data that follows a curve
High Variance
Model is too complex. It changes drastically with small changes in training data. This is overfitting.
Example: a deep decision tree that learns every noise point
High Both
Very complex model that still misses the real pattern. Common with wrong architecture choices.
Example: polynomial regression with wrong degree on messy data
Low Both ✓
The ideal model. Complex enough to capture the real pattern, but not so complex it memorizes noise.
This is what you're always aiming for
The Dart Board Analogy (from Domingos)
Imagine your model is a dart thrower aiming at a target (the true answer). You run many trials with different training sets and observe where the darts land:
| Scenario | Darts Clustered? | Darts on Target? | Interpretation |
|---|---|---|---|
| Low Bias, Low Variance | ✓ Yes (tight cluster) | ✓ Yes (near center) | Perfect — this is the goal |
| High Bias, Low Variance | ✓ Yes (tight cluster) | ✗ No (off-center) | Consistently wrong — model too simple |
| Low Bias, High Variance | ✗ No (spread out) | ∼ Average on target | Inconsistent — model too complex |
| High Bias, High Variance | ✗ No (spread out) | ✗ No (off-center) | Worst case |
Surprising Result: Strong Wrong Assumptions Can Beat Weak True Ones
Here is a counterintuitive but crucial insight from the paper by Domingos (2012): naive Bayes, which assumes all features are completely independent (which is almost never true), can outperform a rule learner on problems where the truth is a set of rules — because naive Bayes doesn't overfit as badly.
With limited data, a learner with strong (even wrong) assumptions can outperform one with correct but weak assumptions — because strong assumptions reduce variance at the cost of introducing some bias, and with small datasets that tradeoff is often worth it. This is why naive Bayes works surprisingly well in practice.
How to Control the Tradeoff
| Action | Effect on Bias | Effect on Variance |
|---|---|---|
| Increase model complexity | ↓ Decreases | ↑ Increases |
| Add more training data | ≈ Same | ↓ Decreases |
| Add regularization (L1/L2) | ↑ May increase | ↓ Decreases |
| Feature selection | ↑ May increase | ↓ Decreases |
| Use ensemble methods | ↓ Decreases | ↓ Decreases (bagging) |
| Reduce number of features | ↑ May increase | ↓ Decreases |
The Curse of Dimensionality
Why more features can hurt your model — and how to fight back.
The Curse of Dimensionality (coined by Bellman, 1961) refers to the exponential growth in problems that occur when working with high-dimensional data. As features (dimensions) increase, the data becomes increasingly sparse, distances lose meaning, and models become harder to train.
The Exponential Sparsity Problem
Imagine you have 1,000 training examples in a 1D space (one feature). They fill the space reasonably well. Now add a second dimension: you need roughly 1,000² = 1,000,000 examples to fill the 2D space equally well. By 10 dimensions: you'd need 10^30 examples. For 100 dimensions: you'd need more examples than atoms in the universe.
Distances Stop Working
Most ML algorithms (K-NN, SVM, clustering) rely on the idea that similar examples are nearby in feature space. In high dimensions, this completely breaks down:
- All points become approximately equidistant from each other
- The "nearest neighbor" is no more similar than the "farthest" point
- K-NN becomes effectively random in very high dimensions
The "Blessing of Non-Uniformity"
Fortunately, real-world data has a saving grace: it doesn't actually fill all of high-dimensional space. Real data tends to lie on a lower-dimensional manifold — a structure with far fewer effective dimensions than the raw feature count.
An image of a handwritten digit has 784 pixels (784 dimensions). But the "space of all valid digit images" is much smaller — a small manifold within that 784D space. This is why K-NN works well on MNIST despite its theoretical problems with high dimensions.
Practical Solutions
| Technique | How it helps | When to use |
|---|---|---|
| PCA Principal Component Analysis | Projects data onto fewer dimensions that explain most variance | Preprocessing step for many algorithms |
| t-SNE / UMAP | Non-linear reduction, great for visualization | Visualizing high-dim data in 2D/3D |
| Feature Selection | Remove irrelevant features using statistical tests or model importance | When you have domain knowledge or too many features |
| Regularization | Penalizes models that use many features (Lasso forces weights to zero) | Built into your model training process |
| Deep Learning | Automatically learns compact low-dimensional representations (embeddings) | Large datasets, complex patterns |
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler # Always scale first — PCA is sensitive to feature scales scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Reduce from 100 features to 20, keeping 95% of variance pca = PCA(n_components=0.95) # Keep 95% explained variance X_reduced = pca.fit_transform(X_scaled) print(f"Original: {X.shape[1]} features") print(f"Reduced: {X_reduced.shape[1]} features") # Much smaller! print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")
Feature Engineering — The Art of ML
The single most impactful thing you can do to improve a model. More important than choosing the right algorithm.
"The most important factor [in ML project success] is the features used. Learning is easy if you have many independent features that each correlate well with the class." Feature engineering is where intuition, creativity, and domain expertise are as important as technical skill.
What is Feature Engineering?
Raw data is rarely in a form that algorithms can directly use. Feature engineering is the process of transforming raw data into informative, discriminative inputs for your model. It includes:
- Feature creation — constructing new features from raw data
- Feature transformation — scaling, encoding, normalizing
- Feature selection — removing irrelevant or redundant features
- Feature extraction — learning compact representations (e.g., PCA, embeddings)
Common Transformations with Code
Numerical Features
import pandas as pd import numpy as np df = pd.read_csv('houses.csv') # 1. SCALING — essential for distance-based and gradient methods from sklearn.preprocessing import StandardScaler, MinMaxScaler # StandardScaler: mean=0, std=1 — use for most algorithms scaler = StandardScaler() df['price_scaled'] = scaler.fit_transform(df[['price']]) # MinMaxScaler: range [0,1] — use for neural networks, image data mm = MinMaxScaler() df['area_norm'] = mm.fit_transform(df[['area']]) # 2. LOG TRANSFORM — fix skewed distributions df['log_income'] = np.log1p(df['income']) # log1p handles zeros # 3. BINNING — turn continuous into categorical df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 65, 100], labels=['minor', 'young', 'middle', 'senior'] ) # 4. INTERACTION FEATURES — combine existing features df['price_per_sqft'] = df['price'] / df['sqft'] df['rooms_per_floor'] = df['rooms'] / df['floors']
Categorical Features
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # 1. LABEL ENCODING — for ordinal data (order matters) # e.g., "low" < "medium" < "high" le = LabelEncoder() df['quality_enc'] = le.fit_transform(df['quality']) # 2. ONE-HOT ENCODING — for nominal data (order doesn't matter) # e.g., "red", "blue", "green" — turns each into its own binary column df_encoded = pd.get_dummies(df, columns=['color', 'city']) # 3. TARGET ENCODING — for high-cardinality categories # Replace each category with the mean of the target for that category target_means = df.groupby('city')['price'].mean() df['city_price_mean'] = df['city'].map(target_means) # WARNING: Target encoding can cause leakage on small datasets # Use cross-validation target encoding in production
Feature Selection Methods
| Method | How it works | Best for |
|---|---|---|
| Correlation Filter | Remove features with correlation > 0.95 to another feature | Quick preprocessing |
| Chi-Square Test | Statistical test: does this feature have a relationship with the target? | Classification with categorical features |
| Feature Importance (Trees) | Random Forest / XGBoost tell you how much each feature reduces impurity | Any prediction problem |
| Lasso Regression | L1 penalty drives unimportant weights to exactly 0 | Linear models |
| Recursive Feature Elimination | Iteratively remove least important features and re-train | When you need a specific number of features |
The Algorithm Guide
Every major ML algorithm — explained clearly, with when to use each one.
Always try the simplest algorithm first. Naive Bayes before logistic regression. K-NN before SVMs. Simpler models are faster, more interpretable, and often surprisingly competitive — especially with limited data.
Neural Networks & Deep Learning
From the single neuron to the Transformer architecture powering ChatGPT.
The Neuron — Building Block
A single artificial neuron does three things: takes weighted inputs, sums them up, and applies an activation function to produce an output.
Activation Functions — Why They Matter
Without activation functions, any neural network — no matter how deep — collapses to a simple linear function. Activations introduce non-linearity, which is what gives deep networks their power.
| Function | Formula | Range | When to Use |
|---|---|---|---|
| ReLU | max(0, x) | [0, ∞) | Hidden layers — default choice, fast, avoids vanishing gradients |
| Sigmoid | 1/(1+e^(-x)) | (0, 1) | Binary classification output layer only |
| Softmax | e^xᵢ / Σe^xⱼ | (0,1), sums to 1 | Multi-class classification output layer |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1, 1) | Recurrent networks, when zero-centered output matters |
| GELU | x·Φ(x) | (-∞, ∞) | Transformers (GPT, BERT use this) |
Backpropagation — How Neural Nets Learn
Backpropagation uses the chain rule of calculus to efficiently compute how much each weight contributed to the error, then adjusts them in the direction that reduces the error.
- Forward Pass
Feed input through the network, layer by layer, to get a prediction.
- Compute Loss
Compare prediction to actual answer using a loss function (e.g., cross-entropy for classification).
- Backward Pass
Propagate the error gradient backward through the network using the chain rule. Compute ∂Loss/∂w for every weight.
- Update Weights
Adjust each weight by a small step in the direction that reduces the loss: w ← w - η·(∂Loss/∂w), where η is the learning rate.
- Repeat
Do this for thousands of mini-batches over many epochs until loss converges.
Architecture Types
| Architecture | Input Type | Key Idea | Use Case |
|---|---|---|---|
| MLP (Fully Connected) | Tabular, fixed-size | Each neuron connects to all neurons in next layer | Tabular data, classification |
| CNN (Convolutional) | Images, grids | Local filters + shared weights = spatial invariance | Image classification, object detection |
| RNN / LSTM | Sequences | Hidden state carries information across time steps | Time series, older NLP tasks |
| Transformer | Sequences | Self-attention: every position attends to every other | NLP (GPT, BERT), vision, multimodal |
| Autoencoder | Any | Compress to bottleneck then reconstruct (encoder + decoder) | Anomaly detection, representation learning |
| GAN | Noise | Generator vs Discriminator adversarial training | Image generation, data augmentation |
The Transformer — How Modern AI Works
The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the architecture behind GPT, BERT, Claude, Gemini, and nearly all modern AI. Its key innovation is self-attention.
For each word in a sentence, self-attention asks: "which other words in this sentence are most relevant to understanding THIS word?" It computes a weighted sum of all other words' representations, where the weights reflect how relevant each word is. This happens in parallel for every position simultaneously — unlike RNNs which process sequentially.
Ensembles & Boosting
Why combining many weak models beats one strong model — every time.
In the Netflix Prize, the winning and runner-up submissions were both stacked ensembles of over 100 learners. Combining them improved results even further. The lesson: learn many models, not just one.
Three Main Ensemble Methods
1. Bagging (Bootstrap Aggregating)
Train many copies of the same model on different random samples of the training data (with replacement). Combine by voting (classification) or averaging (regression). Dramatically reduces variance with almost no increase in bias. Random Forest is bagging applied to decision trees.
2. Boosting
Train models sequentially. Each new model focuses on the mistakes of the previous ones. Training examples that were classified incorrectly get higher weights. Final prediction is a weighted vote of all models. Reduces both bias and variance. XGBoost and LightGBM are boosting algorithms.
3. Stacking
Train several different base models (e.g., Random Forest + SVM + Logistic Regression). Then train a "meta-learner" on the predictions of the base models. The meta-learner learns how to best combine their predictions.
import xgboost as xgb from sklearn.model_selection import cross_val_score model = xgb.XGBClassifier( n_estimators=500, # number of trees max_depth=6, # tree depth (controls complexity) learning_rate=0.05, # step size (lower = more trees needed) subsample=0.8, # fraction of data per tree (bagging) colsample_bytree=0.8, # fraction of features per tree reg_alpha=0.1, # L1 regularization reg_lambda=1.0, # L2 regularization use_label_encoder=False, eval_metric='logloss' ) # Evaluate with cross-validation scores = cross_val_score(model, X_train, y_train, cv=5) print(f"Mean CV Accuracy: {scores.mean():.3f}")
Fine-Tuning AI Models
How to take a massive pre-trained model like GPT-2 or LLaMA and adapt it to your specific task — with far less data and compute than training from scratch.
What is Fine-Tuning?
A pre-trained model like GPT-2 was trained on hundreds of billions of tokens of text. It has learned general language understanding — grammar, facts, reasoning patterns. Fine-tuning takes this pre-trained model and continues training it on a much smaller, task-specific dataset to adapt it to your specific use case.
Training GPT-3 from scratch cost ~$12 million in compute. Fine-tuning can cost as little as a few dollars on a single GPU. You leverage billions of dollars of existing training and adapt just the last mile for your specific task.
Transfer Learning — The Core Idea
Pre-Training (Already Done)
A large model is trained on massive data to learn general representations. For language models, this is next-token prediction on vast internet text. The model learns grammar, facts, common sense, and reasoning. This creates powerful internal representations.
Fine-Tuning (You Do This)
You take the pre-trained model and continue training it on your task-specific dataset. The model's weights shift slightly to adapt to your domain. All the general knowledge is preserved — you're just specializing it.
Deployment
The fine-tuned model performs much better on your task than the base model — and far better than training from scratch on your small dataset alone.
Full Fine-Tuning vs Parameter-Efficient Methods
| Method | What it updates | VRAM needed | Data needed | Best for |
|---|---|---|---|---|
| Full Fine-Tuning | ALL model weights | Very high (40B+ model → 80GB+) | Thousands of examples | When you have serious compute budget |
| LoRA | Low-rank adapter matrices only | Low (can do 7B on 16GB) | Hundreds of examples | Most practical fine-tuning today |
| QLoRA | LoRA adapters on quantized model | Very low (7B on 8GB!) | Hundreds of examples | Fine-tuning on consumer hardware |
| Prompt Tuning | Only a small set of "soft prompt" tokens | Minimal | Small | Light task adaptation |
| Adapter Layers | Small inserted adapter modules | Low | Medium | Multi-task models |
LoRA Deep Dive — The Modern Standard
LoRA (Low-Rank Adaptation) is the most important fine-tuning technique to understand. Here's how it works:
In a standard neural network layer, the weight matrix W has dimensions d×k, meaning d×k trainable parameters. During full fine-tuning, ALL of these change. LoRA instead adds two small matrices A (d×r) and B (r×k) where r is small (like 4, 8, or 16). The modified layer becomes:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import get_peft_model, LoraConfig, TaskType import torch # Step 1: Load the base model in 4-bit quantization (QLoRA) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Step 2: Configure LoRA adapters lora_config = LoraConfig( r=16, # rank — higher = more capacity, more params lora_alpha=32, # scaling factor (usually 2x rank) target_modules=["q_proj", "v_proj"], # which layers to apply LoRA to lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM ) # Step 3: Wrap model with LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # → trainable params: 4,194,304 || all params: 6,742,609,920 || trainable: 0.06%
Preparing Your Fine-Tuning Dataset
Dataset quality matters far more than quantity for fine-tuning. Here's what a good fine-tuning dataset looks like for an instruction-following model:
// Each line is one training example in instruction format {"instruction": "Explain what gradient descent is in simple terms.", "input": "", "output": "Gradient descent is like hiking down a mountain in fog..."} {"instruction": "Convert this SQL query to Python pandas code.", "input": "SELECT name, age FROM users WHERE age > 18", "output": "df[df['age'] > 18][['name', 'age']]"} // For chat fine-tuning (preferred for conversational models): {"messages": [ {"role": "system", "content": "You are a helpful ML tutor."}, {"role": "user", "content": "What is overfitting?"}, {"role": "assistant", "content": "Overfitting is when a model..."} ]}
• Diversity — cover all the cases you care about
• Quality over quantity — 500 excellent examples beats 5,000 mediocre ones
• Format consistency — always use the same template
• No duplicates — deduplicate your dataset
• Output quality — bad outputs teach bad behavior. Review them manually.
RLHF — How ChatGPT Was Trained
RLHF (Reinforcement Learning from Human Feedback) is how models like ChatGPT, Claude, and Gemini are aligned to be helpful, harmless, and honest. It happens in 3 phases:
- Supervised Fine-Tuning (SFT)
Fine-tune the base model on high-quality demonstrations of the desired behavior. Human contractors write ideal responses to thousands of prompts.
- Reward Model Training
Collect human preference data: show the same prompt to the model, get two different responses, have humans rank which is better. Train a "reward model" to predict human preference scores.
- RL Fine-Tuning with PPO
Use the reward model as a scoring signal and apply reinforcement learning (PPO algorithm) to train the language model to generate responses with higher reward scores. Add KL-divergence penalty to prevent the model from drifting too far from the original SFT model.
Practical Hyperparameters for Fine-Tuning
| Parameter | Typical Range | Notes |
|---|---|---|
| Learning Rate | 1e-5 to 5e-4 | Much lower than pre-training. Start with 2e-4 for LoRA. |
| Batch Size | 4–32 (with gradient accumulation) | Effective batch = batch_size × gradient_accum_steps |
| Epochs | 1–5 | Very few epochs needed. More often causes overfitting. |
| LoRA Rank (r) | 4, 8, 16, 64 | Higher = more capacity. 8-16 is sweet spot for most tasks. |
| LoRA Alpha | = 2 × r | Controls scaling. Keep at 2× rank as default. |
| Max Sequence Length | 512–4096 | Longer = more memory. Match to your use case. |
Evaluation & Metrics
How to actually measure whether your model is good — and which metric to use when.
Accuracy is often the worst metric to use. If 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy — but catches zero spam. Always choose metrics that match your actual problem.
Classification Metrics
All classification metrics derive from the confusion matrix — a table of True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | (TP+TN) / All | Balanced classes, equal error costs |
| Precision | TP / (TP+FP) | False positives are costly (e.g., spam filter — don't delete real emails) |
| Recall | TP / (TP+FN) | False negatives are costly (e.g., cancer screening — don't miss cancer) |
| F1 Score | 2 × (P×R)/(P+R) | Need balance of precision & recall, imbalanced classes |
| ROC-AUC | Area under ROC curve | Ranking quality, comparing models across thresholds |
| PR-AUC | Area under Precision-Recall curve | Very imbalanced classes (better than ROC-AUC in this case) |
Regression Metrics
| Metric | Formula | Notes |
|---|---|---|
| MSE | mean((y - ŷ)²) | Penalizes large errors heavily. In output units². |
| RMSE | √MSE | Same units as output. Most commonly reported. |
| MAE | mean(|y - ŷ|) | More robust to outliers than MSE. Easy to interpret. |
| R² | 1 - SS_res/SS_tot | % of variance explained. 1.0 = perfect. Can be negative (worse than baseline). |
Data Preparation
Real ML is 80% data work. Here's how to do it right.
Handling Missing Values
import pandas as pd from sklearn.impute import SimpleImputer, KNNImputer # 1. DROP rows/columns with too many missing values df.dropna(thresh=0.8 * len(df), axis=1) # drop cols with >20% missing # 2. SIMPLE IMPUTATION imputer = SimpleImputer(strategy='median') # or 'mean', 'most_frequent' X_imputed = imputer.fit_transform(X) # 3. KNN IMPUTATION — fills missing with average of K nearest neighbors # More accurate but slower knn_imputer = KNNImputer(n_neighbors=5) X_knn = knn_imputer.fit_transform(X) # 4. ADD "was_missing" indicator column — let model learn the pattern df['age_was_missing'] = df['age'].isna().astype(int) df['age'].fillna(df['age'].median(), inplace=True)
Handling Class Imbalance
from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier # OPTION 1: SMOTE — synthesize new minority class examples sm = SMOTE(random_state=42) X_resampled, y_resampled = sm.fit_resample(X_train, y_train) # OPTION 2: class_weight='balanced' — let the model handle it model = RandomForestClassifier(class_weight='balanced') # OPTION 3: Custom weights weights = {0: 1, 1: 10} # 10x penalty for misclassifying minority class model = RandomForestClassifier(class_weight=weights)
ML Master Cheat Sheet
Everything condensed. Keep this open while studying or building.
Key Formulas
Overfitting vs Underfitting
Algorithm Quick Picks
Fine-Tuning LLMs
Bias-Variance
Preprocessing Checklist
Glossary
Every key term in ML defined clearly and concisely.