变分贝叶斯推断(Variational Bayes Inference)简介

Carl-Xie

53242人浏览 · 2017-02-25 16:42:02

Carl-Xie · 2017-02-25 16:42:02 发布

通常在研究贝叶斯模型中，很多情况下我们关注的是如何求解后验概率(Posterior)，不幸的是，在实际模型中我们很难通过简单的贝叶斯理论求得后验概率的公式解，但是这并不影响我们对贝叶斯模型的爱——既然无法求得精确解，来个近似解在实际中也是可以接受的:-)。一般根据近似解的求解方式可以分为随机(Stochastic)近似方法（代表是MCMC，在上一篇中我们提到的利用Gibbs Sampling训练LDA的模型便是一种），另外一种确定性(Deterministic)近似方法。本篇要介绍的变分推断便属于后者，一般情况下确定性近似方法会比随机近似方法更快和更容易判断收敛。变分贝叶斯推断是一种求解框架，类似于EM算法，在求解概率模型中有很广泛的运用，是纵横江湖不可或缺的利器:-)。本篇试图从理论上简单介绍这个方法，做到从精神上领会其脉络，具体应用例子我会给出链接，读者可以自己过过瘾:-)。

变分法(Calculus of variations)

对于普通的函数 f(x) <script type="math/tex" id="MathJax-Element-3">f(x)</script>，我们可以认为 f <script type="math/tex" id="MathJax-Element-4">f</script>是一个关于 $x$ <script type="math/tex" id="MathJax-Element-5">x</script>的一个实数算子，其作用是将实数 x <script type="math/tex" id="MathJax-Element-6">x</script>映射到实数 $f (x)$ <script type="math/tex" id="MathJax-Element-7">f(x)</script>，那么可以类比这种模式，假设存在函数算子 F <script type="math/tex" id="MathJax-Element-8">F</script>，它是关于 $f (x)$ <script type="math/tex" id="MathJax-Element-9">f(x)</script>的函数算子，可以将 f(x) <script type="math/tex" id="MathJax-Element-10">f(x)</script>映射成实数 F(f(x)) <script type="math/tex" id="MathJax-Element-11">F(f(x))</script>，在机器学习中，常见的函数算子有信息熵 H(p(x)) <script type="math/tex" id="MathJax-Element-12">H(p(x))</script>，它将概率密度函数 p(x) <script type="math/tex" id="MathJax-Element-13">p(x)</script>映射成一个具体值。用贴近程序语言的说法就是在变分法中，我们研究的对象是高阶函数，它接受一个函数作为参数，并返回一个值。

在求解函数 f(x) <script type="math/tex" id="MathJax-Element-14">f(x)</script>极值的时候，我们利用微分法，假设它存在极小值 x0 <script type="math/tex" id="MathJax-Element-15">x_0</script>，那么其导数 f′(x0)=0 <script type="math/tex" id="MathJax-Element-16">f'(x_0)=0</script>，并且对于任意接近0的数 ϵ <script type="math/tex" id="MathJax-Element-17">\epsilon</script>有：

f (x 0) \leq f (x 0 + ϵ)

Φ(ϵ)=f(x0+ϵ) Φ ( ϵ ) = f ( x 0 + ϵ ) <script type="math/tex" id="MathJax-Element-19">\Phi(\epsilon)=f(x_0+\epsilon)</script>，那么另一种说明函数

f f <script type="math/tex" id="MathJax-Element-20">f</script>在

x_{0}

Φ' (0) = d Φ ( ϵ ) d ϵ ∣ ∣ ∣ ϵ = 0 = f' (x 0 + 0) = f' (x 0) = 0

J (y) = \int x 2 x 1 L (y (x), y' (x), x) d x

x1,x2 x 1 , x 2 <script type="math/tex" id="MathJax-Element-24">x_1,x_2</script>为常数，

y(x) y ( x ) <script type="math/tex" id="MathJax-Element-25">y(x)</script>是连续二阶可导，

L L <script type="math/tex" id="MathJax-Element-26">L</script>对于

y, y^{'}, x

J(y) J ( y ) <script type="math/tex" id="MathJax-Element-28">J(y)</script>在

y=f y = f <script type="math/tex" id="MathJax-Element-29">y=f</script>时存在极小值，那么对于任意函数

η η <script type="math/tex" id="MathJax-Element-30">\eta</script>，只要其满足

η(x1)=0 η ( x 1 ) = 0 <script type="math/tex" id="MathJax-Element-31">\eta(x_1)=0</script>且

η(x2)=0 η ( x 2 ) = 0 <script type="math/tex" id="MathJax-Element-32">\eta(x_2)=0</script>，那么对于任意小的

ϵ ϵ <script type="math/tex" id="MathJax-Element-33">\epsilon</script>如下不等式成立：

J (f) \leq J (f + ϵ η)

ϵη ϵ η <script type="math/tex" id="MathJax-Element-35">\epsilon \eta</script>称为函数

f f <script type="math/tex" id="MathJax-Element-36">f</script>的变分，记为

δ f

Φ (ϵ) = J (f + ϵ η)

Φ' (0) = d Φ ( ϵ ) d ϵ ∣ ∣ ∣ ϵ = 0 = d J ( f + ϵ η ) d ϵ ∣ ∣ ∣ ϵ = 0 = \int x 2 x 1 d L d ϵ ∣ ∣ ∣ ϵ = 0 d x = 0

d L d ϵ = \partial L \partial y \partial y \partial ϵ + \partial L \partial y ' \partial y ' \partial ϵ

y=f+ϵη y = f + ϵ η <script type="math/tex" id="MathJax-Element-41">y = f+\epsilon \eta</script>，

y′=f′+ϵη′ y ′ = f ′ + ϵ η ′ <script type="math/tex" id="MathJax-Element-42">y' =f' + \epsilon \eta'</script>，因此

\partial y \partial ϵ = η \partial y ' \partial ϵ = η'

d L d ϵ = \partial L \partial y η + \partial L \partial y ' η'

\int x 2 x 1 d L d ϵ ∣ ∣ ∣ ϵ = 0 d x = \int x 2 x 1 {\partial L \partial y η + \partial L \partial y ' η'} ∣ ∣ ∣ ϵ = 0 d x = \int x 2 x 1 η {\partial L \partial f - d d x \partial L \partial f '} d x + \partial L \partial f ' η ∣ ∣ ∣ x 2 x 1 = \int x 2 x 1 η {\partial L \partial f - d d x \partial L \partial f '} d x = 0

ϵ=0 ϵ = 0 <script type="math/tex" id="MathJax-Element-46">\epsilon=0</script>时，

y−>f y − > f <script type="math/tex" id="MathJax-Element-47">y->f</script>，

y′−>f′ y ′ − > f ′ <script type="math/tex" id="MathJax-Element-48">y'->f'</script>，又由于

η η <script type="math/tex" id="MathJax-Element-49">\eta</script>在

x1,x2 x 1 , x 2 <script type="math/tex" id="MathJax-Element-50">x_1,x_2</script>取值为0，所以

∂L∂f′η∣∣∣x2x1=0 ∂ L ∂ f ′ η | x 1 x 2 = 0 <script type="math/tex" id="MathJax-Element-51">\frac{\partial L}{\partial f'}\eta\bigg|_{x_1}^{x_2}=0</script>。最后根据变分法基本引理，我们最终可得 欧拉－拉格朗日方程：

\partial L \partial f - d d x \partial L \partial f ' = 0

平均场定理(Mean Field Theory)

对，变分推断确实和这个定理有莫大联系，有必要稍微了解一下，免得后面一脸茫然。先看看维基百科的介绍：

In physics and probability theory, mean field theory (MFT also known as self-consistent field theory) studies the behavior of large and complex stochastic models by studying a simpler model. Such models consider a large number of small individual components which interact with each other. The effect of all the other individuals on any given individual is approximated by a single averaged effect, thus reducing a many-body problem to a one-body problem.

很遗憾，笔者对平均场理论理解无法做出直观解释，期待读者解惑。总而言之，平均场理论是用于简化复杂模型的理论，譬如对于一个概率模型:

P (x 1, x 2, x 3, . . ., x n) = P (x 1) P (x 2 | x 1) P (x 3 | x 2, x 1) . . . P (x n | x n - 1, x n - 2, x n - 3, . . ., x 1)

Q (x 1, x 2, x 3, . . ., x n) = Q (x 1) Q (x 2) Q (x 3) . . . Q (x n)

Q Q <script type="math/tex" id="MathJax-Element-55">Q</script>尽量和

P

p(x1,x2,x3,...,xn) p ( x 1 , x 2 , x 3 , . . . , x n ) <script type="math/tex" id="MathJax-Element-57">p(x_1,x_2,x_3,...,x_n)</script>

变分贝叶斯推断

在贝叶斯模型中，我们通常需要计算模型的后验概率 P(Z|X) <script type="math/tex" id="MathJax-Element-382">P(Z|X)</script>，然而许多实际模型中，想要计算出 P(Z|X) <script type="math/tex" id="MathJax-Element-383">P(Z|X)</script>通常是行不通的。利用平均场理论，我们通过找另一个模型 Q(Z)＝∏iQ(zi) <script type="math/tex" id="MathJax-Element-384">Q(Z)＝\prod _{i} Q(z_i)</script>来近似代替 P(Z|X) <script type="math/tex" id="MathJax-Element-385">P(Z|X)</script>，这是变分贝叶斯推断的唯一假设！然后问题在于如何找出这样的模型 Q(Z) <script type="math/tex" id="MathJax-Element-386">Q(Z)</script>了。

熟悉信息理论的同学应该知道，想要衡量两个概率模型有多大差异，可以利用KL-Divergence。于是我们将问题转化为如何找到 Q(Z) <script type="math/tex" id="MathJax-Element-387">Q(Z)</script>使得

K L (Q | | P) = \int Q (Z) l o g Q ( Z ) P ( Z | X ) d Z

KL(Q||P) K L ( Q | | P ) <script type="math/tex" id="MathJax-Element-389">KL(Q||P)</script>而不是

KL(P||Q) K L ( P | | Q ) <script type="math/tex" id="MathJax-Element-390">KL(P||Q)</script>呢？我们先看看PRML第十章里面的一副图：
这里写图片描述

图中绿色图是 P <script type="math/tex" id="MathJax-Element-391">P</script>的分布，图(a)红色线利用通过最小化 $K L (Q | | P)$ <script type="math/tex" id="MathJax-Element-392">KL(Q||P)</script>也就是变分推断获得的结果，图(b)红色线是通过最小化 KL(P||Q) <script type="math/tex" id="MathJax-Element-393">KL(P||Q)</script>的结果。如果选择最小化 KL(P||Q) <script type="math/tex" id="MathJax-Element-394">KL(P||Q)</script>，那么其实是对应于另外一种近似框架——Expectation Propagation，超出本篇要讨论的，暂且搁置。

那么既然我们有了目标对象——最小化 KL(Q||P) <script type="math/tex" id="MathJax-Element-395">KL(Q||P)</script>，接下来就是如何求得最小化时的 Q <script type="math/tex" id="MathJax-Element-396">Q</script>了。我们将公式稍微变换一下：

K L (Q | | P) = \int Q (Z) l o g \frac{Q (Z)}{P (Z | X)} d Z = - \int Q (Z) l o g \frac{P (Z | X)}{Q (Z)} d Z = - \int Q (Z) l o g \frac{P (Z, X)}{Q (Z) P (X)} d Z = \int Q (Z) [l o g Q (Z) + l o g P (X)] d Z - \int Q (Z) l o g P (Z, X) d Z = l o g P (X) + \int Q (Z) l o g Q (Z) d Z - \int Q (Z) l o g P (Z, X) d Z

L(Q)=∫Q(Z)logP(Z,X)dZ−∫Q(Z)logQ(Z)dZ L ( Q ) = ∫ Q ( Z ) l o g P ( Z , X ) d Z − ∫ Q ( Z ) l o g Q ( Z ) d Z <script type="math/tex" id="MathJax-Element-398">L(Q) =\int Q(Z) logP(Z,X)dZ-\int Q(Z) logQ(Z) dZ </script>，
那么有：

l o g P (X) = K L (Q | | P) + L (Q)

KL(Q||P) K L ( Q | | P ) <script type="math/tex" id="MathJax-Element-400">KL(Q||P)</script>，由于

logP(X) l o g P ( X ) <script type="math/tex" id="MathJax-Element-401">logP(X)</script> 不依赖于

Z Z <script type="math/tex" id="MathJax-Element-402">Z</script>的数据似然函数，可以当作是常数。那么为了最小化

K L (Q | | P)

L(Q) L ( Q ) <script type="math/tex" id="MathJax-Element-404">L(Q)</script>，所以我们的目标可以转移为：

m a x L (Q)

KL(Q||P)≥0 K L ( Q | | P ) ≥ 0 <script type="math/tex" id="MathJax-Element-406">KL(Q||P) \ge 0</script>那么有

l o g P (X) \geq L (Q)

L(Q) L ( Q ) <script type="math/tex" id="MathJax-Element-408">L(Q)</script>可以看成是

logP(X) l o g P ( X ) <script type="math/tex" id="MathJax-Element-409">logP(X)</script>的下界，通常称为：ELOB(Evidence Lower Bound)。也就是我们通过最大化对数数据似然函数

logP(X) l o g P ( X ) <script type="math/tex" id="MathJax-Element-410">logP(X)</script>的下界来逼近对数似然函数的

logP(X) l o g P ( X ) <script type="math/tex" id="MathJax-Element-411">logP(X)</script>。

好，现在目标函数是：

L (Q) = \int Q (Z) l o g P (Z, X) d Z - \int Q (Z) l o g Q (Z) d Z

Q (Z) = \prod i Q (z i)

\int Q (Z) l o g Q (Z) d Z ＝ \int \prod i Q (z i) l o g \prod j Q (z j) d Z ＝ \int \prod i Q (z i) \sum j l o g Q (z j) d Z = \sum j \int \prod i Q (z i) l o g Q (z j) d Z ＝ \sum j \int Q (z j) l o g Q (z j) d z j \int \prod i : i \neq j Q (z i) d z i = \sum j \int Q (z j) l o g Q (z j) d z j

Q(Z) Q ( Z ) <script type="math/tex" id="MathJax-Element-415">Q(Z)</script>拆分成单变量形式。此处的变换需要注意的记号是

dZ=∏idzi d Z = ∏ i d z i <script type="math/tex" id="MathJax-Element-416">dZ = \prod_i dz_i</script>，而不是代表对向量的微分，并且注意到

∫Q(zi)dzi=1 ∫ Q ( z i ) d z i = 1 <script type="math/tex" id="MathJax-Element-417">\int Q(z_i) dz_i = 1</script>，因此

∫∏i:i≠jQ(zi)dzi=∏i:i≠j∫Q(zi)dzi=1 ∫ ∏ i : i ≠ j Q ( z i ) d z i = ∏ i : i ≠ j ∫ Q ( z i ) d z i = 1 <script type="math/tex" id="MathJax-Element-418">\int \prod_{i: i\neq j} Q(z_i) dz_i=\prod_{i: i\neq j} \int Q(z_i) dz_i=1</script>。

再来看看另外一部分：

\int Q (Z) l o g P (Z, X) d Z = \int \prod i Q (z i) l o g P (Z, X) d Z = \int Q (z j) (\prod i : i \neq j Q (z i) l o g P (Z, X) d z i) d z j = \int Q (z j) E i \neq j [l o g P (Z, X)] d z j = \int Q (z j) l o g {e x p (E i \neq j [l o g P (Z, X)])} d z j = \int Q (z j) l o g e x p ( E i \neq j [ l o g P ( Z , X ) ] ) \int e x p ( E i \neq j [ l o g P ( Z , X ) ] ) d z j - C = \int Q (z j) l o g Q * (z j) d z j - C

L (Q) = \int Q (z j) l o g Q * (z j) d z j - \sum j \int Q (z j) l o g Q (z j) d z j - C = \int Q (z j) l o g Q * ( z j ) Q ( z j ) d z j - \sum i : i \neq j \int Q (z i) l o g Q (z i) d z i - C ＝ - K L (Q (z j) | | Q * (z j)) + \prod i : i \neq j H (Q (z i)) - C

H(Q(zi))=−∫Q(zi)logQ(zi)dzi H ( Q ( z i ) ) = − ∫ Q ( z i ) l o g Q ( z i ) d z i <script type="math/tex" id="MathJax-Element-421">H(Q(z_i)) = -\int Q(z_i) logQ(z_i)dz_i</script>为信息熵，又由于

K L (Q (z j) | | Q * (z j)) \geq 0

H (Q (z i)) \geq 0

L(Q(Z)) L ( Q ( Z ) ) <script type="math/tex" id="MathJax-Element-424">L(Q(Z))</script>只需要令

- K L (Q (z j) | | Q * (z j)) = 0

Q (z j) = Q * (z j) = e x p ( E i \neq j [ l o g P ( Z , X ) ] ) n o r m a l i z e c o n s t a n t

δ δ Q ( z j ) {\int Q (z j) l o g Q * (z j) d z j - \int Q (z j) l o g Q (z j) d z j + λ i (\int i Q (z i) d z i - 1)}

循环直到收敛：
- 对于每一个 Q(zj) <script type="math/tex" id="MathJax-Element-428">Q(z_j)</script>:
- 令 Q(zj)=Q∗(zj) <script type="math/tex" id="MathJax-Element-429">Q(z_j) = Q^*(z_j)</script>

虽然从理论上推导了变分推断的框架算法，但是对于不同模型，我们必须手动推导 Q∗(zj) <script type="math/tex" id="MathJax-Element-430">Q^*(z_j)</script>，简要来说，推导变分贝叶斯模型一般分为四个步骤：

确定好研究模型各个参数的的共轭先验分布如果想做full bayes model
写出研究模型的联合分布 P(Z,X) <script type="math/tex" id="MathJax-Element-431">P(Z,X)</script>
根据联合分布确定变分分布的形式 Q(Z) <script type="math/tex" id="MathJax-Element-432">Q(Z)</script>
对于每个变分因子 Q(zj) <script type="math/tex" id="MathJax-Element-433">Q(z_j)</script>求出 P(Z,X) <script type="math/tex" id="MathJax-Element-434">P(Z,X)</script>关于不包含变量 zj <script type="math/tex" id="MathJax-Element-435">z_j</script>的数学期望，再规整化为概率分布

当然这个过程并不简单，对于实际模型，其推导一般比较繁冗复杂，很容易出错，想一看究竟的同学可以参考文末中给的链接。所以后来便有学者研究出更加一般更加自动化的基于概率图模型的算法框架——VMP(Varaitional Message Passing) 。如果模型是指数族的模型，都可以套用VMP自动得到算法求解:-)。有兴趣的同学可以参考：《variational message passing》

参考文献

《A Tutorial on Variational Bayesian Inference》
《Pattern Recognition and Machine Learning》第十章
有兴趣的同学可以目睹LDA的变分推断过程:-)
《Latent Dirichlet Allocation》

MCP技术社区

欢迎加入 MCP 技术社区！与志同道合者携手前行，一同解锁 MCP 技术的无限可能！

更多推荐

【CodeBuddy + 自制MCP】给AI装上翅膀，快速绘制思维导图

MCP技术社区

如何将普通HTTP API接口改造为MCP服务器

创建.proto通过本文的四步改造法，你可获得：✅ 配置更新延迟降低90%✅ 网络带宽消耗减少70%✅ 服务端资源占用下降60%✅ 原生支持百万级节点连接升级到MCP不仅是协议转换，更是配置分发模式的架构进化。立即行动，让你的微服务配置管理进入实时推送时代！更多Istio进阶技巧请关注专栏【Service Mesh深度实践】