Skip to content

General Descent Methods

Definition

Formally, the descent direction is defined as follows.

Definition 4 (Descent Direction)\boldsymbold is a descent direction for f at x if f(\boldsymbolx+t\boldsymbold)<f(\boldsymbolx) for all t>0 sufficiently small.

Also , the following proposition is easy to obtain:

Propsition 5 (Descent Direction)if f is continuously differentiable in a neighborhood of x,then any d such that\boldsymboldf(\boldsymbolx)<0 is a descent direction.

Steepest Descent Direction

给定一个范数 ,我们就可以定义这个范数意义下的 steepest descent direction

$$ \boxed{ \begin{aligned} &\textbf{Definition 5 (Steepest Descent Direction)} \newline &\newline &\Delta_{|\cdot|}\boldsymbol{x}\triangleq\underset{\boldsymbol{v}:|\boldsymbol{v}|\leq1}{\operatorname{argmin}}\langle\boldsymbol{v},\nabla f(\boldsymbol{x})\rangle \end{aligned} } $$ Note*:

  1. 其实,我们是希望让 Δ\boldsymbolxargmin\boldsymbolv:\boldsymbolv1f(\boldsymbolx+\boldsymbolv) 的。但是由于直接最小化是不现实的,因此我们只能采用近似策略:f(\boldsymbolx+\boldsymbolv)f(\boldsymbolx)+\boldsymbolv,Δf(\boldsymbolx),where f(\boldsymbolx) is constant
  2. 显然,如果使用 2-范数的话,那么这个 domain 就是 high-dimensional sphere,等价于梯度下降

Examples: Different Norms

Δ2\boldsymbolx=f(\boldsymbolx),Δ1\boldsymbolx=sign(f(\boldsymbolx)\boldsymbolxl)\boldsymbolel,l=argmaxi1,,n|f(\boldsymbolx)\boldsymbolxi|,ΔA\boldsymbolx=f(\boldsymbolx)A11A1f(\boldsymbolx)
证明

因此,对于三种范数,descent direction \boldsymbold 就是:

dgd=1f(\boldsymbolx)f(\boldsymbolx),dcd=sign(f(\boldsymbolx)\boldsymbolxl)\boldsymbolel,l=argmaxi1,,n|f(\boldsymbolx)\boldsymbolxi|,dA=f(\boldsymbolx)A11A1f(\boldsymbolx)

Example: TRPO

TRPO 是强化学习的一个最优化方式。

Let πθ denote a policy with parameters θ. The theoretical TRPO update is:

θk+1=argmaxθL(θk,θ) s.t. D¯KL(θθk)δ

where L(θk,θ) is the surrogate advantage, a measure of how policy πθ performs relative to the old policy πθk using data from the old policy:

L(θk,θ)=Eτπθk[t=0TγtAπθk(st,at)]

and D¯KL(θθk) is an average KL-divergence between policies across states visited by the old policy:

D¯KL(θθk)=Esπθk[DKL(πθ(|s)πθk(|s))]

The theoretical TRPO update isn't the easiest to work with, so TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around θk :

L(θk,θ)gT(θθk)D¯KL(θθk)12(θθk)TH(θθk)
  • g=θkL(θk,θ)
  • By the property of KL-divergence, DKL(πθkπθ)|θk=θ=0,θkDKL(πθkπθ)|θk=θ=0, thus the second order Taylor expansion of DKL only consists of the Hessian matrix

resulting in an approximate optimization problem,

θk+1=argmaxθgT(θθk) s.t. 12(θθk)TH(θθk)δ

Randomized Schemes

Coordinate Descent

For coordinate descent method, the direction of f at \boldsymbolx can be randomly chosen, and is given by $$ d_{c d-\text { rand }} \triangleq-[\nabla f(\boldsymbol{x})]{i_k} \boldsymbol{e}{i_k}, $$ where ik chosen uniformly at random from {1,2,,n} at each k.

Note: 由于 [f(\boldsymbolx)]ik\boldsymboleik,f(\boldsymbolx)=f(\boldsymbolx)ik20,因此满足条件

Stochastic Gradient

For stochastic gradient method, the direction of f at \boldsymbolx is given by $$ d_{s g c} \triangleq-g\left(\boldsymbol{x}_k, \xi_k\right), $$ where ξk is a random variable, such that Eξkg(\boldsymbolxk,ξk)=f(\boldsymbolxk). That is, g(\boldsymbolxk,ξk) is an unbiased ( but often very noisy ) estimate of the true gradient f(\boldsymbolxk).

Note: 最常见的,就是 batch gradient descent。均匀(不重复)抽样一组 batch,然后进行 descent。