Imagine you're trying to understand a system, but you can only see indirect clues — not the system itself. That's exactly what a Hidden Markov Model (HMM) is designed for.
You're inside a windowless room. You can only tell the weather by whether people bring umbrellas. The actual weather is hidden — what you observe is people's behaviour.
A casino secretly switches between a fair die and a loaded one. You see the numbers rolled, but never which die is active. The die choice is the hidden state.
When you speak, your mouth produces sound waves (observable). The words and phonemes in your brain are hidden — HMMs help decode them from audio alone.
An HMM says: there are hidden "states" changing over time (sunny / rainy). At each moment, the current state produces an observable "output" (umbrella / no umbrella). We never see the states — only the outputs. Our job is to figure out the states from the outputs.
Given only a sequence of observations (what we can see), can we figure out the best possible values for A, B, and π?
We don't know the hidden states. We don't know A, B, or π. All we have is a sequence of observations — and Baum-Welch will learn the entire model from that alone.
Baum-Welch is an Expectation-Maximization (EM) algorithm — it repeatedly improves its guesses for A, B, and π. Think of it like tuning a radio: keep adjusting until the signal is clearest.
Start with random guesses for A, B, π → compute how well they explain your observations → adjust the matrices to make the observations more likely → repeat until improvement stops.
We randomly set starting values for A (transitions), B (emissions), and π (initial). These will be wrong — that's totally fine. The algorithm will correct them through iterations.
For each time step t, compute αt(i) — the probability of seeing observations o₁, o₂, …, ot AND being in hidden state i right now. We sweep left to right through the sequence.
"What is the probability that I followed this exact path of hidden states and produced these observations up to now?" We track this for every possible state at each time step.
Symmetrically, compute βt(i) — the probability of seeing future observations ot+1, …, oT given that we're currently in state i at time t. We sweep right to left.
"Given I'm in state i right now, how well do the rest of the observations 'make sense' from here?" Combining forward α and backward β gives us full context from both directions.
By multiplying forward and backward probabilities, we estimate the probability of being in each state at each time — even though we never directly see those states!
γt(i) answers: "How confident are we that the system was in state i at time t?" — a soft probability between 0 and 1.
ξt(i,j) answers: "How likely was the transition from state i → state j between time t and t+1?"
Use γ and ξ to compute better estimates. Each update is guaranteed to increase (or maintain) P(O|λ) — this is what makes EM so powerful.
For the transition matrix A: "Out of all the times I was in state i, how often did I go to state j next?"
For the emission matrix B: "Out of all the times I was in state i, how often did I produce symbol k?"
Compute log P(O|λ) — how well the updated model explains the observations. If it barely changed (less than threshold ε), the algorithm has converged. Otherwise, go back to Step 1 with the new λ.
Probabilities of long sequences get astronomically small (like 10⁻⁵⁰⁰). Taking the log converts these into manageable negative numbers and prevents the computer from rounding everything to zero.
Multiplying many small probabilities (0.3 × 0.2 × 0.5 × … for 100 time steps) gives a number too tiny for computers to store (like 10⁻⁵⁰⁰). Log turns those multiplications into additions, keeping numbers in a manageable range.
Start with 2. Add more states if the log-likelihood is still very low after convergence. Too many states → the model memorizes the data (overfitting) and loses generalizability.
Baum-Welch can get stuck in different local optima depending on the random starting point. Try multiple seeds and pick the model with the highest (least negative) log-likelihood.
Baum-Welch only guarantees a local maximum of P(O|λ), not the global best. It also requires you to specify N (number of hidden states) and M (number of observation symbols) in advance — these aren't learned automatically.