Baum-Welch Algorithm
for Hidden Markov Models

Imagine you're trying to understand a system, but you can only see indirect clues — not the system itself. That's exactly what a Hidden Markov Model (HMM) is designed for.

🌦️

The Weather Room

You're inside a windowless room. You can only tell the weather by whether people bring umbrellas. The actual weather is hidden — what you observe is people's behaviour.

🎰

The Casino Dice

A casino secretly switches between a fair die and a loaded one. You see the numbers rolled, but never which die is active. The die choice is the hidden state.

🧠

Speech Recognition

When you speak, your mouth produces sound waves (observable). The words and phonemes in your brain are hidden — HMMs help decode them from audio alone.

💡 In Plain English

An HMM says: there are hidden "states" changing over time (sunny / rainy). At each moment, the current state produces an observable "output" (umbrella / no umbrella). We never see the states — only the outputs. Our job is to figure out the states from the outputs.

The Three Things We Learn (λ = A, B, π)
π (pi)
Initial probs
Where do we start? — The probability that the system begins in each hidden state. E.g., "70% chance it starts sunny."
A matrix
Transitions
How do states change? — "If sunny today, 80% chance sunny tomorrow, 20% chance rainy." These are state-to-state probabilities over time.
B matrix
Emissions
What does each state produce? — "When rainy, 90% chance of seeing umbrella." These link hidden states to observable outputs.
The Central Problem

Given only a sequence of observations (what we can see), can we figure out the best possible values for A, B, and π?

⚡ This is the Baum-Welch problem!

We don't know the hidden states. We don't know A, B, or π. All we have is a sequence of observations — and Baum-Welch will learn the entire model from that alone.

Input: O = [o₁, o₂, ..., oT]  (observations only)
↓   Baum-Welch   ↓
Output: best λ = (A, B, π) that explains O

Baum-Welch is an Expectation-Maximization (EM) algorithm — it repeatedly improves its guesses for A, B, and π. Think of it like tuning a radio: keep adjusting until the signal is clearest.

🔁 The Big Idea (EM Loop)

Start with random guesses for A, B, π → compute how well they explain your observations → adjust the matrices to make the observations more likely → repeat until improvement stops.

Step-by-Step Walkthrough
🎲

Step 0 — Random Initialization

We randomly set starting values for A (transitions), B (emissions), and π (initial). These will be wrong — that's totally fine. The algorithm will correct them through iterations.

λ⁰ = (A⁰, B⁰, π⁰) ← random, but each row sums to 1
➡️

Step 1 — Forward Pass (compute α, "alpha")

For each time step t, compute αt(i) — the probability of seeing observations o₁, o₂, …, ot AND being in hidden state i right now. We sweep left to right through the sequence.

α₁(i) = πi · bi(o₁)
αt+1(j) = [Σi αt(i) · aij] · bj(ot+1)
💡 Plain English

"What is the probability that I followed this exact path of hidden states and produced these observations up to now?" We track this for every possible state at each time step.

⬅️

Step 2 — Backward Pass (compute β, "beta")

Symmetrically, compute βt(i) — the probability of seeing future observations ot+1, …, oT given that we're currently in state i at time t. We sweep right to left.

βT(i) = 1  (nothing left to observe)
βt(i) = Σj aij · bj(ot+1) · βt+1(j)
💡 Plain English

"Given I'm in state i right now, how well do the rest of the observations 'make sense' from here?" Combining forward α and backward β gives us full context from both directions.

🔗

Step 3 — E-Step: Compute γ (gamma) and ξ (xi)

By multiplying forward and backward probabilities, we estimate the probability of being in each state at each time — even though we never directly see those states!

γt(i) = αt(i) · βt(i) / P(O|λ)   ← "soft" state assignment

ξt(i,j) = αt(i) · aij · bj(ot+1) · βt+1(j) / P(O|λ)
💡 Plain English

γt(i) answers: "How confident are we that the system was in state i at time t?" — a soft probability between 0 and 1.

ξt(i,j) answers: "How likely was the transition from state i → state j between time t and t+1?"

🔄

Step 4 — M-Step: Re-estimate A, B, π

Use γ and ξ to compute better estimates. Each update is guaranteed to increase (or maintain) P(O|λ) — this is what makes EM so powerful.

π̂i = γ₁(i)   (use first-step state probs as new start probs)

âij = Σt ξt(i,j) / Σt γt(i)

i(k) = Σt : oₜ=k γt(i) / Σt γt(i)
💡 Plain English — It's just counting!

For the transition matrix A: "Out of all the times I was in state i, how often did I go to state j next?"

For the emission matrix B: "Out of all the times I was in state i, how often did I produce symbol k?"

Step 5 — Convergence Check

Compute log P(O|λ) — how well the updated model explains the observations. If it barely changed (less than threshold ε), the algorithm has converged. Otherwise, go back to Step 1 with the new λ.

Converged if |log P(O|λ)_new − log P(O|λ)_old| < ε
⚠️ Why log probability?

Probabilities of long sequences get astronomically small (like 10⁻⁵⁰⁰). Taking the log converts these into manageable negative numbers and prevents the computer from rounding everything to zero.


🌤️
Recommended: Start with the Weather Example
Uses 3 observation symbols (0 = Sunny, 1 = Cloudy, 2 = Rainy) with 2 hidden states (think "Summer pattern" vs "Winter pattern"). Hit the button below to pre-fill everything, then click Run.
Configure & Run
Each number = one observed symbol. These are what the model can see. E.g., 0=sunny, 1=cloudy, 2=rainy.
How many underlying hidden states to model. Start with 2 — it's the simplest case.
The number of distinct symbols in your sequence. Auto-detected if left blank.
Different seeds → different initializations → possibly different final models. Try a few!
What Each Output Means
log P(O|λ)
How well the learned model explains your observations. Always negative — a value closer to 0 is better. Watch it rise toward 0 as the algorithm improves.
Iterations
How many EM rounds ran before convergence. Fewer = found a good solution quickly. More = the landscape was harder to navigate.
Transition A
Row i → col j = probability of moving from hidden state i to state j. Each row sums to exactly 1.0. High diagonal values mean states tend to stay stable.
Emission B
Row i → col k = probability of producing symbol k when in hidden state i. Each row sums to 1.0. Darker cells = the state "prefers" that symbol.
Initial π
Probability of starting in each hidden state at t=0. All values sum to 1.0.
γ (gamma) chart
For each time step, how confident are we that the system was in each state? Lines near 1.0 = high confidence. Crossing lines = uncertainty.

Glossary of Terms
HMM
Hidden Markov Model. A probabilistic model where the real underlying states are hidden and we only observe noisy outputs.
Markov Property
The future depends only on the present, not the history. "Where I go next depends only on where I am now — not where I've been."
EM Algorithm
Expectation-Maximization: estimate the missing data (E-step), then improve parameters to fit (M-step). Guaranteed never to get worse — only better or the same.
Stochastic Matrix
A matrix where every row sums to 1.0. All our probability matrices (A, B, π) are stochastic — they represent valid probability distributions.
Log-likelihood
The logarithm of P(O|λ). Always negative; values closer to 0 mean a better fitting model. We maximize this during training.
Convergence
When the per-iteration improvement falls below ε, we've converged — further iterations won't meaningfully change the model.
Common Questions Answered
❓ Why do we use logarithms?

Multiplying many small probabilities (0.3 × 0.2 × 0.5 × … for 100 time steps) gives a number too tiny for computers to store (like 10⁻⁵⁰⁰). Log turns those multiplications into additions, keeping numbers in a manageable range.

❓ How many hidden states should I choose?

Start with 2. Add more states if the log-likelihood is still very low after convergence. Too many states → the model memorizes the data (overfitting) and loses generalizability.

❓ Why does the result change with the seed?

Baum-Welch can get stuck in different local optima depending on the random starting point. Try multiple seeds and pick the model with the highest (least negative) log-likelihood.

❓ What are the limitations?

Baum-Welch only guarantees a local maximum of P(O|λ), not the global best. It also requires you to specify N (number of hidden states) and M (number of observation symbols) in advance — these aren't learned automatically.