Probability and Bayesian Networks
Probability and Bayesian Networks
is - ykompis@ethz.ch Factor Graphs Reuse computation to perform inference for Expectation via GS Use GS to obtain samples X(1) : X(T ) .
Probability - Probability Rules many variables for fixed observations. A factor graph for a BN is Approximate expectation after burn-in time t0 :
Product rule P (X, Y ) = P (X|Y )P (Y ) bipartite graph consisting of variables and factors. 1
E[f (X)|xB ] ≈ T −t
PT (τ )
0 τ =t0 +1 f (X )
Chain rule P (X1 , X2 , . . . , Xn ) = P (X1:n ) = Belief Propagation Only works for non-loopy graphs. Send
P (X1 )P (X2P |X1 )P (X3 |X1:2 ) · · · P (Xn |X1:n−1 ) messages form leaves to root and from root to leaves. Nodes V , Temporal Models Create a copy of each variable for every
Sum rule P (X1:n ) = y P (X1:n , Y = y) = factors u, neighbors N (·). Messages µ from u → v and v → u are time step. State augmentation can be used to reduce order of
P
defined by: µv→u (xv ) = Π{u0 ∈N (v)\u} µu0 →v (xv ) temporal dependence.
y P (X1:n |Y = y)P (Y = y)
Independence P (X|Y ) = P (X), P (X, Y ) = P (X)P (Y )
P
µu→v (xv ) = xu ∼xv f (xu )Π{v0 ∈N (u)\v} µv0 →u (xv ) Hidden Markov Models/Kalman Filters
Bayes’ rule P (X|Y ) = P (Y P|X)P (Y )
(X)
Use messages to compute probabilities:
States (hidden variables) X1 , . . . , XT
R Observations Y1 , . . . , YT
Expectation E[x] = X xp(x)dx P (Xv = xv ) ∝ Πu0 ∈N (v) µu0 →v (xv ) Hidden Markov Model Xi and Yi categorical
V ar[x] = E[x2 ] − E[x]2 = X (x − µx )2 p(x)dx
R
Variance Incorporate observations Z by multiplying all factors containing Z Kalman Filter Xi and Yi Gaussian distributions
2
Gaussian N (x|µ, σ) = σ√12π exp(− (x−µ)
2σ 2
) with indicator function [Z = z] = {1 if Z = z, 0 else} Useful for filtering P (Xt |y1:t ), prediction P (Xt+1 |y1:t ),
Multivariate N (x|µ, Σ) = √ 1 k exp(− 12 (x − µ)> Σ−1 (x − µ)) Loopy Belief Propagation Initialize all messages as smoothing P (Xt |y1:T ) for 1 ≤ t ≤ T , and most probable
(2π) |Σ|
Bernoulli P (X = 1) = θ, P (X = 0) = 1 − θ
uniform distributions, update messages following some ordering explanation argmaxX1:T P (X1:T |y1:T ).
until converged. Loopy BP is often over-confident and does not Bayesian Filtering Instead of re-computation at each time
Bayesian Networks A BN is a directed acyclic graph always converge. step, use P (Xt |y1:t ) to compute P (Xt+1 |y1:t+1 ) recursively.
(DAG). It is a compact representation of distributions which often Variational Inference Reduce inference problem to Conditioning P (Xt |y1:t ) = Z1 PP(Xt |y1:t−1 )P (yt |xt )
allows efficient exact inference. Each vertex represents a random optimization. Try to find simple distribution Q that approximates Prediction P (Xt+1 |y1:t ) = x P (Xt = x|y1:t )P (Xt+1 |Xt = x)
variable with unspecified distribution, each edge indicates P well: Q∗ ∈ argminQ KL(Q||P )
conditional dependence. Any distribution P (X1:n ) = Multivariate Gaussians Random vectors
KL Divergence KL(Q||P ) = x Q(x) log Q(x)
P
P (X1 )P (X2 |X1 ) . . . can be described by a BN using chain rule. P (x) XA ∼ N (µA , ΣAA ) and XB ∼ N (µB , ΣBB ).
Not symmetric, non-negative, close to zero if P and Q agree. P (XA |XB = xB ) = N (µA|B , ΣA|B ) with
d-separation X is cond. indep. of Y given Z if there does not µA|B = µA ΣAB Σ−1 −1
Mean Field VI Assume independent product distribution: BB (xB − µB ) and ΣA|B = ΣAA − ΣAB ΣBB ΣBA
exist an active trail between X and Y for observations Z.
Q(x) = Πi Qi (xi ). This leads to the evidence lower bound function Kalman Filter Continuous domain Gaussian variables.
Active Trails Active trails→, inactive trails → (ELBO) which can be optimized. Transition model is assumed to be linear.
Sampling Based Inference Stochastic approximation that Prior at time t1 P (X1 )
uses randomness to estimate probabilities. Guaranteed to converge Motion model P (Xt+1 |Xt ) : Xt+1 = FXt + t , t ∈ N (0, Σx )
to right answer for number of samples N → ∞. Sensor model P (Yt |Xt ) : Yt = HXt + ηt , ηt ∈ N (0, Σy )
Rejection Sampling Draw N samples x(1) . . . x(n) , Update µt+1 = Fµt + Kt+1 (yt+1 − HFµt )
(i) (i) Σt+1 = (I − Kt+1 )(FΣt F> + Σx )
x(i) = [x1 , . . . , xn ] from BN. Bad, if P (XB = xb ) is rare.
Gain Kt+1 = (FΣt F> + Σx )H> (H(FΣt F> + Σx )H> + Σy )−1
Marginals P (XA = xA ) = N1 Count(XA = xA )
Algorithm for d-separation X and Y are d-separated by Dynamic Bayesian Networks If there is more than one
Conditionals P (XA = xA |Xb = xb ) = Count(X A =xA ,XB =xB )
Z in DAG G if they are not connected in the sub-graph G0 : Count(XB =xB )
variable, we have a DBN with a BN at each time step called slice
1. recursively remove all leaves that are not in X,Y or Z Markov Chain A stationary Markov chain is a sequence of St . Temporal edges connect St+1 with St . Unrolled DBNs have
2. delete all outgoing edges of vertices in Z. RVs X1 , . . . , XN with prior P (X1 ) and transition probability many loops and exact inference is intractable. Complexity of
Inference Compute conditional probability distributions over P (Xt+1 |Xt ) independent of t. It is ergodic if ∃ t such that every standard techniques increases at each time step.
variables of interest. Either exact, using e.g. variable elimination, state is reachable in exactly t steps.
Particle Filter Approximate posterior by sampling and
or approximate using a Markov chain. Markovian Assumption Future and past states are simulating particles, which are reweighted based on the
Variable Elimination Calculate P (X|E = e) independent: X1:t−1 ⊥ Xt+1:T |Xt ∀t > 1 observations. Can handle non-linear models and multi-modal
1. choose variable ordering: X1 , . . . , Xn Stationary Distribution An ergodic MC has a unique and distributions.
2. for i = 1:n create initial factors fi = P (Xi |parents(Xi )) positive stationary distribution π(X) > 0, such that for all x: Prediction x0i ∼ P (Xt+1 |xi,t )
3. for i = 1:n if Xi ∈
/ {X, E} limN →∞ P (XN = x) = π(x). It is independent of P (X1 ). Conditioning weight wi = Z1 P (yt+1 |x0i )
resample xi,t+1 ∼ N
P
multiply all factors f that include Xi Markov Chain Monte Carlo Given an unnormalized Pi=1
wi δx0i
distribution Q(x) > 0 we want to find π(x) = Z1 Q(x). A MC P (Xt+1 |y1:t+1 ) ≈ N N
P 1
marginalize out Xi into new factor → gXi = Xi Πj fj Distribution i=1 δxi,t+1
4. normalize remaining factor P (X, E = e) → P (X|E = e) satisfies the detailed balance equation (DBE): Extended Kalman Filter The EKF propagates the mean
VE for Most Probable Explanation Replace steps: Q(x)P (x0 |x) = Q(x0 )P (x|x0 ) which implies stationarity. function through non-linear dynamics and then propagates the
3. Use gXi = maxXi Πj fj Gibbs Sampling GS is asymptotically correct, but is slow. covariance through a linearization of the system.
4. for i = n:−1:1 if Xi ∈
/ E x̂i = argmaxxi gi (xi , x̂i+1:n ) Init x1 , . . . , xm , fix observed variables
Assumed Density Filter Approximate marginal
Repeat for j = 1 : m xj ← Z1j Q(X1 = x1 , . . . , Xj , . . . , Xm = xm )
Polytrees A DAG is a polytree iff dropping edge directions distributions by projecting them to simple distributions. Infer
results in a tree. Sufficient, but not necessary: each node has only Random order fulfills DBE, finds correct distribution parameters θ by minimizing KL Divergence or through moment
one parent. Deterministic does not fulfill DBE, still correct distribution matching. E.g. UKF.
Unscented Kalman Filter The UKF propagates weighted Parameter Learning Learn conditional probabilities of a Bayesian Learning Marginalize parameters θ to obtain
particles through non-linear dynamics then approximates the mean BN G given a data set D of complete observations. Each Xi predictions. Learning is reduced to probabilistic inference.
and variance of the posterior from these particles. described by Bernoulli distribution with parameter θ. Estimate for P (Y (x0 )|D) = P (θ|D)P (Y (x0 )|θ)dθ
R
Probabilistic Planning Control based on prob. model. each variable Xi : θ̂Xi |parents(Xi ) = Count(X i ,parents(Xi ))
Count(parents(Xi ))
Markov Decision Process An MDP is a controlled Pseudo-counts Make prior assumptions about parameters: Gaussian Processes Normal distribution over functions
parameterized by co-variance function K(x, x0 ). Set of RVs s.t.
markov chain defined by hallucinate data αc and αl (eqivalent to Beta prior):
Count(Xi ,parents(Xi ))+αc YA = [Yx1 , . . . , Yxk ] ∼ N (µA , ΣAA ), with Σi,j = K(xi , xj ).
Finite set of states X = {1, . . . , n} θ̂Xi |parents(Xi ) = Count(parents(X i ))+αc +αl
Finite set of actions A = {1, . . . , m} GP Predictions Closed form prediction:
Structure Learning Learn the structure of a BN. The P (f ) = GP (f ; µ, K) with observations yi = f (xi ) + wi . Then
Transition probabilities P (x0 |x, a) scoring function S(G; D) quantifies for each structure G the fit to 2
Reward function r(x, a) or r(x, a, x0 ) P (f (x)|x1:k , y1:k ) = N (f (x); µx|A , σx|A ).
the data D. Find the best structure: G∗ = argmaxG S(G; D)
may be random with mean r(x, a) Bayesian Neural Networks Maintain uncertainty about
MLE Score Use maximum-likelihood of data to score BN.
Discount factor γ ∈ [0, 1] ˆ i ; parents(Xi )) + const. weights of a NN, start with prior over weights.
maxθ log P (D|θ, G) = maxθ N n
P
i=1 I(X
The state is fully observed at every time step.
where mutual information I(Xi ; Xj ) is a measure of independence. Uncertainty
Bellman Equation An optimal policy satisfies the BE: I(Xi ; Xj ) = 0 iff Xi ⊥ Xj and (Xi ; Xj ) = (Xj ; Xi ) Aleatoric noise in observations given perfect model
X P P (xi ,xj ) Epistemic uncertainty about parameters due to limited data
V ∗ (x) = max r(x, a) + γ P (x0 |x, a)V ∗ (x0 ) I(Xi ; Xj ) = xi ,xj P (xi , xj ) log P (xi )P (xj )
a∈A
x0 ∈X Empirical mutual independence I(X ˆ i ; Xj ) uses empirical Reinforcement Learning Agent action changes the world.
probabilities P̂ (xi , xj ) = N1 Count(xi , xj ). The optimal solution to Planning in an unknown MDP.
V ∗ (x) = maxa∈A Ex0 [r(x, a) + γV ∗ (x0 )] this problem is always the fully connected graph. (overfitting) On-policy agent has full control
Off-policy only observational data, no control
Value Iteration Start with any value function V0 Bayesian Information Criterion Regularize the BN
using BIC score. BIC will identify correct stucture as N → ∞ Model-based RL Learn MDP, optimize policy on estimated
SBIC (G; D) = n
P ˆ log N MDP.
i=1 I(Xi ; parents(Xi )) − 2N |G|
X 0 0
Vt+1 (x) = max r(x, a) + γ P (x |x, a)Vt (x )
a∈A
x0 ∈X |G| number of parameters of G Exploration and Exploitation Pick best action, but
n number of variables with small probability pick random action.
Stop, if kVt+1 − Vt k∞ ≤ . Finds -optimal solution in polynomial N number of training examples Count(Xt+1 ,Xt ,A)
Transitions P (Xt+1 |Xt , A) ≈
steps. Complexity per iteration: O(n2 m) Finding the optimal BN is NP hard. Local search (add or remove 1
P Count(Xt ,A)
Rewards r(x, a) ≈ Nx,a t:Xt =x,At =a Rt
Max. init. error kV0 − V ∗ k∞ ≤ 2R max
1−γ
, ∀x, a: |r(x, a)| ≤ Rmax edges to increase score) may get stuck in local optima.
After N iterations γ 1−γ ≤ → N = (log 1/γ)−1 log (1−γ)
N 2Rmax 2Rmax
Rmax Algorithm Set unknown r(x, a) to Rmax . Add fairy
Chow-Liu Algorithm Solvable special case: Find optimal
Policy Iteration Start with any proper policy µ tree-shaped Bayes Net in time O(|E| log |E|). tale state x∗ and set P (x∗ |x, a) = 1. Repeatedly execute policy
k
Evaluation Vµk (x) = r(x, µk (x)) + γ x0 P (x0 |x, µk )Vµk (x0 )
P
1. calculate P̂ (Xi , Xj ) and I(X ˆ i , Xj ) ∀ pairs Xi , Xj while updating r(x, a) and P (x0 |x, a), then recompute policy.
−1 2. draw complete graph with edge weights ˆ Will reach -optimal policy with probability 1 − δ. Computation
Vµk = (I − γPµk ) rµk P we = I(Xi , Xj )
P
Improvement µk+1 (x) = argmaxa r(x, a) + γ x0 P (x |x, a)Vµk (x ) 0 0 3. find spanning tree T that maximizes e w e time polynomial in |X|, |A|, T , 1/, and log(1/δ). Memory
Stop, if Vµk+1 = Vµk . PI monotonically improves all values add edge with largest we to T , that does not create a loop required: P (x0 |x, a) → O(|X|2 |A|), r(x, a) → O(|X||A|)
Vµk +1 (x) ≥ Vµk (x) ∀x. Needs at most mn iterations. Typically 4. pick root node and orient edges away from root Model-free RL Directly estimate value function.
much faster. Complexity per iteration: O(n3 + n2 m) Learning with Function Approximation Use parametrized Q-learning Q-function V ∗ (x) = maxa Q∗ (x, a)
POMDP Partially observable Markov decision process probability density functions to reduce search space.
Init. Q(x, a) = R1−γ
max
ΠTt=1
init
(1 − αt )−1
A POMDP is a controlled hidden Markov model. We only obtain P (Y |X1:k ) → P 0 (Y ; f (X1:k ; θ))
Update Q(x, a) ← (1 − αt )Q(x, a) + αt (r + γ maxa0 Q(x0 , a))
noisy observations Yt of the hidden state Xt . For a finite horizon T , Logistic Regression (Classification) Logistic Computation time: O(|A|) Memory required: Q(x, a) → O(|X||A|)
set of reachable belief states is exponential in T . Since most belief regression is a MLE which assumes Bernoulli noise:
If learning rate αt satisfies t αt → ∞ and t αt2 < ∞ and actions
P P
states are never reached, a POMDP can be approximately solved P (y|x, w) = Ber(y; σ(w> x)) ∈ {0, 1} with link function ∗
are random, Q-learning converges to Q .
by: discretizing the belief space by sampling, using point based σ(w> x) = (1 + exp (−w >
Pnx))
−1
methods (point based VI and point based PI), dimensionality w = argminw i=1 log (1 + exp (−yi w> xi ))
∗ Parametric Q-function Approximation Use a
reduction, or policy gradients with a parametric policy π(b; θ). Logistic loss llogistic (w) = log(1 + exp (−yw> x)) parametric Q-function Q(x, a; θ) for large state spaces. Fit
Belief-state MDP A POMDP can be interpreted as an MAP logistic regression: Add L2 (Gaussian prior) or L1 (Laplace parameters P
to data:
MDP with enlarged state space, where the states correspond to prior) regularizer to loss function. L(θ) = x,a,r,x0 (r + γ maxa0 Q(x0 , a0 ; θold ) − Q(x, a; θ))2
beliefs P (Xt |y1:t ) in the original POMDP. SGD for logistic regression: w ← w + ηt yx(1 + exp (yw> x))−1 Deep Q Networks Use a CNN to approximate Q. Clone
States B = {b : b ∈ [0, 1]n , x∈X b(x) = 1}
P Size of SGD step is based on confidence (probability of misclass.). network to maintain constant “target” values.
Actions A = {1, . . . , P
m} Multi-class Logistic Regression Maintain one weight Double DQN Use current network to evaluate argmax.
Transitions P (b0 |b, a) = y P (b0 |y, a, b)P (y|a, b) vector per class and model
P 0 P 0 P 0 exp (w> x) Upper Confidence Sampling Focus exploration, where
= y P (b |y, a, b) x0 P (y|x ) x P (x |x, a)b(x) P (Y = i|x, w1 , . . . , wc ) = Pc expi(w> x)
P j=1 j upper confidence bound ≥ best lower bound. Statistically certify
Reward r(b, a) = x b(x)r(x, a) Corresponds to cross-entropy loss: safety where lower bound > safety threshold.
Learning Learn models from data. l(y; w1 , . . . , wc ) = − log P (Y = y|x, w1 , . . . , wc )