ブログ

・直感的に理解するTransformerの仕組み(統計の森作成)

Contents

1 前提の確認
2 Decision Transformer

前提の確認

Transformer

Dot Product Attentionに主に基づくTransformerの仕組みについては既知である前提で当記事はまとめました。下記などに解説コンテンツを作成しましたので、合わせて参照ください。

ネットワーク分析から直感的に理解するTransformerの仕組みと処理の流れ

強化学習の基本知識

Deep Learningを用いた強化学習を理解するにあたって知っておくとよい内容を下記で取りまとめました。

・仕組みから理解するChatGPT(統計の森作成)

強化学習の基本トピックについて詳しく学ぶ際には日本語では「ゼロから作るDeep Learning④ー強化学習編」、英語では「Reinforcement Learning, second edition: An Introduction」がおすすめです。

ゼロから作るDeep Learning ❹ ―強化学習編

Sutton, Richard S., Barto, Andrew G.

Reinforcement Learning, second edition: An Introduction (Adaptive Computation and Machine Learning s…

16,595円(06/26 20:16時点)

・Offline RL論文
 ・A Survey on Transformers in Reinforcement Learning

Offline RL

Deep Q NetworkやAlphaZeroなど、多くの強化学習のアルゴリズムではエージェントと環境の相互作用によって軌道(trajectory)を生成し、「価値関数の近似」や「方策の最適化」を通して意思決定の最適化を行います。

一方で、予め既に得た軌道(trajectory)を元に教師あり学習のような学習を行うことで意思決定の最適化を行うことができます。このような強化学習の枠組みはオフライン強化学習(Offline RL)といわれます。

Decision TransformerではこのOffline RLの枠組みを用いるので、このような強化学習の研究分野があることは抑えておくと良いです。Offline RLについて詳しくは下記などを参照すると良いと思います。

ゼロから作るDeep Learning ❹ ―強化学習編

・Transformer論文：Attention is All you need[2017]
・Decision Transformer論文
 ・Offline RL論文
 ・A Survey on Transformers in Reinforcement Learning
・Transformer Decoder論文

Decision Transformer

Causal Transformer

Decision Transformerの処理概要は上図のように表されます。図より、メインの処理がCausal Transformerによって実現できることが確認できますが、Decision Transformer論文のSection.$3$のMethodより、基本的にはGPTと同様の処理を用いることが確認できます。

GPT(Generative Pre-Training)は自己回帰型(Auto Regressive)の処理に基づきます。処理の詳細はGPTの論文が参照するTransformer Decoderの解説で取り扱いました。

Transformerの構成の分類：Encoder-Decoder・Decoder onlyなど

returns-to-go

Decision Transformerでは通常の強化学習で用いるRewardの$r_{t}$に基づいて下記のように定義される報酬和$\hat{R}_{t}$を用います。
$$
\large
\begin{align}
\hat{R}_{t} = \sum_{t’=t}^{T} r_{t’} \quad (1)
\end{align}
$$

上記をDecision Transformer論文では「returns-to-go」のような用語で表されます。一般的な強化学習では軌道(trajectory)の$\tau$を$\tau=(r_0,s_0,a_0,r_1,s_1,a_1, \cdots)$のように表すことが多い一方で、Decision Transformerでは$(1)$式で定義した「returns-to-go」を用いて下記のように軌道$\tau$を定義します。
$$
\large
\begin{align}
\tau=(\hat{R}_0,s_0,a_0,\hat{R}_1,s_1,a_1, \cdots , \hat{R}_{t}, s_{t}, a_{t}, \cdots)
\end{align}
$$

また、上記の$s_{t}$は時点$t$における状態、$a_{t}$は時点$t$における行動を表します。

Decision Transformerの擬似コード

Decision Transformer論文のAlgorithm$\, 1$にDecision Transformerの擬似コードが掲載されています。

# R, s, a, t: returns-to-go, states, actions, or timesteps
# transformer: transformer with causal masking (GPT)
# embed_s , embed_a , embed_R: linear embedding layers
# embed_t: learned episode positional embedding
# pred_a: linear action prediction layer

# main model
def DecisionTransformer(R, s, a, t):
    # compute embeddings for tokens
    pos_embedding = embed_t(t) # per-timestep (note: not per-token) s_embedding = embed_s(s) + pos_embedding
    a_embedding = embed_a(a) + pos_embedding
    R_embedding = embed_R(R) + pos_embedding

    # interleave tokens as (R_1, s_1, a_1, ..., R_K, s_K)
    input_embeds = stack(R_embedding , s_embedding , a_embedding)

    # use transformer to get hidden states
    hidden_states = transformer(input_embeds=input_embeds)

    # select hidden states for action prediction tokens
    a_hidden = unstack(hidden_states).actions

    # predict action
    return pred_a(a_hidden)

# training loop
for (R, s, a, t) in dataloader: # dims: (batch_size, K, dim)
    a_preds = DecisionTransformer(R, s, a, t)
    loss = mean((a_preds - a)**2) # L2 loss for continuous actions optimizer.zero_grad(); loss.backward(); optimizer.step()

# evaluation loop
target_return = 1 # for instance , expert -level return
R, s, a, t, done = [target_return], [env.reset()], [], [1], False
while not done: # autoregressive generation/sampling
    # sample next action
    action = DecisionTransformer(R, s, a, t)[-1] # for cts actions new_s, r, done, _ = env.step(action)
    
    # append new tokens to sequence
    R = R + [R[-1] - r] # decrement returns-to-go with reward s, a, t = s + [new_s], a + [action], t + [len(R)]
    R, s, a, t = R[-K:], ... # only keep context length of K

上記は28行目のa_preds = DecisionTransformer(R, s, a, t)がDecision Transformerを用いたActionの予測に対応するところから理解すると理解しやすいです。Decision Transformerは「状態」、「行動」、「returns-to-go」の入力に基づいて行動の予測を行います。

Decision Transformerの学習

Decision Transformerの学習では学習に用いる「軌道」のサンプルセットが最適である前提で、同様の振る舞いをTransformerが模倣するように学習を行います。

予測結果の$a_t$と「学習に用いる軌道サンプル」を元に、行動が離散値の場合はCross Entropy、連続値の場合は平均二乗誤差(mean-squared error)がlossに用いられます。前項で取り扱った擬似コードでは平均二乗誤差が計算されます。

Evaluations

参考

ゼロから作るDeep Learning ❹ ―強化学習編

Sutton, Richard S., Barto, Andrew G.

Reinforcement Learning, second edition: An Introduction (Adaptive Computation and Machine Learning s…

16,595円(06/26 20:16時点)

・直感的に理解するTransformerの仕組み(統計の森作成)

MoE(Mixture of Experts)とSwitch Transformers

投稿日: 2023-11-292023-11-29 投稿者: lib-arts

Transformerに分岐処理を行うMoE(Mixture of Experts)を導入することで計算コストを大きく増やさずにパラメータ数を増やすことが可能になります。当記事ではこのような方針に基づいてTransformerの学習を行った研究であるSwitch Transformerについて取りまとめを行いました。
MoE(Mixture of Experts)論文や、Switch Transformers論文などの内容を参考に作成を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

Contents

1 前提の確認
- 1.1 Transformer
2 MoEとSwich Transformersの仕組み
3 分散処理とauxiliary loss
- 3.1 Expert CapacityとCapacity Factor
- 3.2 Load Balancing Loss
4 全体処理とパラメータ
5 参考

前提の確認

Transformer

ネットワーク分析から直感的に理解するTransformerの仕組みと処理の流れ

MoEとSwich Transformersの仕組み

Switch Transformersの概要

Switch Transformersでは大まかに下図のような処理が行われます。

図の左が通常のTransformerにおける処理を表しており、オレンジのブロックがself-attention、水色のブロックがMLP(FFN)処理にそれぞれ対応します。右側がSwitch Transformerの処理を表しており、複数のFFNをExpertと見なし、Routerでtokenを各Expertに割り当てる処理が導入されています。

このような処理を行うことで同一レイヤーに複数のMLP処理が存在することから計算コストは上げずにパラメータ数を増やすことが可能になります。以下、詳しい処理についてSwitch Transformers論文の数式を元に確認を行います。

Mixture of Expert Routing

Switch Transformerの論文ではtokenのベクトル表現を$\mathbf{x} \in d_{model}$とおくとき、全$N$個のExpertsの中で$i$番目のExpertのgate-valueの$p_{i}(\mathbf{x})$が下記のように定義されます。
$$
\large
\begin{align}
p_{i}(\mathbf{x}) &= \frac{e^{h_{i}(\mathbf{x})}}{\sum_{j=1}^{N} e^{h_{j}(\mathbf{x})}} = \mathrm{Softmax}(h_{i}(\mathbf{x})) \\
h(\mathbf{x}) &= W_{r} \mathbf{x} \\
W_{r} & \in \mathbb{R}^{N \times d_{model}}
\end{align}
$$

上記の$h(\mathbf{x})$は要素数が$N$のベクトルであり、$h_{i}(\mathbf{x})$や$h_{j}(\mathbf{x})$は$h(\mathbf{x})$の$i$番目の要素と$j$番目の要素にそれぞれ対応します。

一般的なMixture of ExpertにおけるRoutingでは、このように計算を行ったgate-valueの$p_{1}(\mathbf{x}), \cdots , p_{N}(\mathbf{x})$の中から上位$k$個の値のインデックスを選び各Expertの出力の線形和によって全体の出力を得ます。ここで上位$k$個に対応するインデックスの集合を$\mathcal{T}$とおくと、全体の出力$\mathbf{y} \in \mathbb{R}^{d_{model}}$は下記のような式で定義されます。
$$
\large
\begin{align}
\mathbf{y} &= \sum_{i \in \mathcal{T}} p_{i}(\mathbf{x}) E_{i}(\mathbf{x}) \\
E_{i}(\mathbf{x}) &= FFN_{i}(\mathbf{x})
\end{align}
$$

上記の$E_{i}(\mathbf{x})$は各Expertの出力に対応するので、$\mathbf{x}$に$i$番目のExpertのMLP処理を施したと理解すると良いです。

Switch Routing

MoE(Mixture of Experts)論文では前項の「Mixture of Expert Routing」のように複数のExpertを用いて処理を行うことが必須であるとされた一方で、Switch Transformerでは$1$つのトークンに対し$1$つのExpertのみが用いられます。

Switch Transformer論文では$k=1$のExpertを用いるRoutingの方針に基づいて構成されるレイヤーを「Switch Layer」、このようなRoutingを「Switch Routing」のようにそれぞれ表されます。

分散処理とauxiliary loss

Expert CapacityとCapacity Factor

Switch Transformerでは前節で確認を行ったようにRouterがExperts(複数のFFNに対応)に処理を分岐させます。したがって、各Expertの処理は分散処理が可能です。

一方で単に処理を分岐させるだけの場合、$1$つのExpertに処理が偏り結果的にRouterの処理の意義がなくなる懸念があります。このような場合の対処にあたって、Switch Transformerでは各Expertに下記の数式に基づいて「Expert Capacity」が設定されます。
$$
\large
\begin{align}
\mathrm{expert \,\, capacity} = \left( \frac{\mathrm{tokens \,\, per \,\, batch}}{\mathrm{number \,\, of \,\, experts}} \right) \times \mathrm{capacity \,\, factor}
\end{align}
$$

上記の$\mathrm{tokens \,\, per \,\, batch}$はSwitch Transformerに入力するバッチのトークンの数、$\mathrm{number \,\, of \,\, experts}$はExpertsの数にそれぞれ対応します。

要するに基本的には各Expertに均等に処理を分岐させる前提でExpertの容量の上限が定義され、$\mathrm{capacity \,\, factor}$は生じうる分岐の偏りに対するバッファと解釈すると良いです。

各Expertの容量の上限であるExpert Capacityの数をトークンが超えた場合は超えた分のトークンの計算がその層ではスキップされます。ここまでに確認した内容について論文では下記のような図で図式化されます。

上図のようにSwitch Transformerでは各Expertにトークンを分岐させFFN処理を行います。

図の左のCapacity Factorが$1.0$の場合にExpert Capacityの上限がオーバーし処理されないトークンが出たことが、赤の点線より確認できます。

Load Balancing Loss

それぞれのExpertになるべく均等にtokenが配分されるように、Switch TransformerではLoad Balancing Lossというlossが導入されます。lossの式を下記で表しました。
$$
\large
\begin{align}
\mathrm{loss} &= \alpha N \sum_{i=1}^{N} f_{i} P_{i} \\
f_{i} &= \frac{1}{T} \sum_{x \in \mathcal{B}} \mathbb{1} \{ \mathrm{argmax} \, p(x) = i \} \\
P_{i} &= \frac{1}{T} \sum_{x \in \mathcal{B}} p_{i}(x)
\end{align}
$$

上記は『Switch Routingに用いられる$1$hot表現の平均$(f_{1}, \cdots , f_{N})$と、一般的なMixture of Expert Routingに用いられる確率(gated-value)の平均$(P_{1}, \cdots , P_{N})$の分布が大きく乖離しないようにlossを導入した』と解釈すると良いです。

$N$のより具体的な解釈にあたっては、$f_{1}, \cdots , f_{N}$と$P_{1}, \cdots , P_{N}$の分布がどちらも一様分布の場合、$\displaystyle N \sum_{i=1}^{N} f_{i} P_{i}$が下記のように計算できることを確認しておくと良いと思います。
$$
\large
\begin{align}
N \sum_{i=1}^{N} f_{i} P_{i} &= N \sum_{i=1}^{N} \frac{1}{N} \cdot \frac{1}{N} \\
&= \cancel{N^{2}} \cdot \frac{1}{\cancel{N^{2}}} \\
&= 1
\end{align}
$$

全体処理とパラメータ

参考

・Transformer論文：Attention is All you need[2017]
・Switch Transformer論文
 ・Mixture of Experts論文

ブロック対角行列の行列式の計算と固有多項式(characteristic polynomial)

投稿日: 2023-11-272023-11-23 投稿者: lib-arts

固有多項式(characteristic polynomial)は固有値を計算する際の固有方程式に用いられる多項式です。当記事ではブロック対角行列(block-diagonal matrix)の行列式の計算と、固有多項式の計算について取り扱いました。
作成にあたっては「チャート式シリーズ大学教養線形代数」の第$8$章「固有値問題と行列の対角化」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養線形代数 (チャート式・シリーズ)

Contents

1 ブロック対角行列の固有多項式
2 計算例
- 2.1 基本例題$156$

ブロック対角行列の固有多項式

ブロック行列の行列式

$$
\large
\begin{align}
X = \left( \begin{array}{cc} A & B \\ O & D \end{array} \right)
\end{align}
$$

上記のように定義した$X$の行列式$\det{(X)}$について下記が成立する。
$$
\large
\begin{align}
\det{(X)} = \left| \begin{array}{cc} A & B \\ O & D \end{array} \right| = |A||D| = \det{(A)}\det{(D)} \quad (1)
\end{align}
$$

固有多項式の定義

$n$次正方行列$A$の固有多項式$F_{A}(t)$は$F_{A}(t)=\det{(tI_{n} \, – \, A)}$のように定義される。固有多項式の定義は下記でも取り扱った。

固有多項式(characteristic polynomial)の定義と三角行列の固有多項式

ブロック対角行列の固有多項式

次節の基本例題$156$で取り扱った。

計算例

以下、「チャート式シリーズ大学教養線形代数」の例題の確認を行う。

基本例題$156$

$$
\large
\begin{align}
A = \left( \begin{array}{cccc} A_1 & O & \cdots & O \\ O & A_2 & \cdots & O \\ \vdots & \vdots & \ddots & \vdots \\ O & O & \cdots & A_r \end{array} \right)
\end{align}
$$

上記のブロック対角行列$A$に対し、$(1)$式を繰り返し適用することで下記のように固有方程式$F_{A}(t)$を得ることができる。
$$
\large
\begin{align}
F_{A}(t) &= \left| \begin{array}{cccc} t I_1 \, – \, A_1 & O & \cdots & O \\ O & t I_2 \, – \, A_2 & \cdots & O \\ \vdots & \vdots & \ddots & \vdots \\ O & O & \cdots & t I_r \, – \, A_r \end{array} \right| \\
&= \det{(t I_1 \, – \, A_1)} \left| \begin{array}{cccc} t I_2 \, – \, A_2 & \cdots & O \\ \vdots & \ddots & \vdots \\ O & \cdots & t I_r \, – \, A_r \end{array} \right| \\
&= \cdots \\
&= \det{(t I_1 \, – \, A_1)} \cdots \det{(t I_r \, – \, A_r)} \\
&= F_{A_1}(t) \cdots F_{A_r}(t)
\end{align}
$$

上記より、ブロック対角行列$A$の固有多項式について下記が成立する。
$$
\large
\begin{align}
F_{A}(t) = F_{A_1}(t) \cdots F_{A_r}(t)
\end{align}
$$

最小多項式(minimal polynomial)の定義と計算例

投稿日: 2023-11-222023-11-21 投稿者: lib-arts

行列$A$を代入すると零行列$O$になる多項式の中で「次数が最小」かつ「最高次の係数が$1$」である多項式を最小多項式(minimal polynomial)といいます。当記事では最小多項式の定義とチャート式線形代数の演習を題材に計算例について取りまとめを行いました。
作成にあたっては「チャート式シリーズ大学教養線形代数」の第$8$章「固有値問題と行列の対角化」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養線形代数 (チャート式・シリーズ)

Contents

1 最小多項式の定義と求め方
- 1.1 最小多項式の定義
- 1.2 最小多項式の求め方
2 最小多項式の使用例
- 2.1 基本例題$173$

最小多項式の定義と求め方

最小多項式の定義

$n$次正方行列$A$に対して集合$I_{A}$を下記のように定義する。
$$
\large
\begin{align}
I_{A} = \{ f(t)| f(A) = O \}
\end{align}
$$

上記は$I_{A}$が「$A$を代入すると零行列になるような多項式の全体がなす集合」と解釈できる。このように定義を行なった集合$I_{A}$における「次数が最小」かつ「多項式の最高次数の係数が$1$」である多項式を最小多項式(minimal polynomial)という。

最小多項式の求め方

固有多項式$F_{A}(t)$を因数分解の形式で表した後に、$2$乗より大きい要素は$1$乗から順に最小多項式の定義が成立するかを確認すれば良い。具体的な導出の流れは次節の演習で取り扱った。

最小多項式の使用例

以下、「チャート式シリーズ大学教養線形代数」の例題の確認を行う。

基本例題$173$

・$[1]$
$$
\large
\begin{align}
A &= \left( \begin{array}{cc} 2 & 1 \\ 2 & 3 \end{array} \right)
\end{align}
$$

上記の$A$の固有多項式を$F_{A}(t)$とおくと、$F_{A}(t)$は下記のように表せる。
$$
\large
\begin{align}
F_{A}(t) &= \det{(tI_{2} \, – \, A)} = \left| \begin{array}{cc} t-2 & -1 \\ -2 & t-3 \end{array} \right| \\
&= (t-2)(t-3) – 2 \\
&= t^{2} – 5t + 6 – 2 \\
&= t^{2} – 5t + 4 \\
&= (t-1)(t-4)
\end{align}
$$

上記より、行列$A$の最小多項式は$(t-1)(t-4)$である。

・$[2]$
$$
\large
\begin{align}
A = \left( \begin{array}{ccc} 3 & 1 & 1 \\ 2 & 4 & 2 \\ 1 & 1 & 3 \end{array} \right)
\end{align}
$$

上記の$A$の固有多項式を$F_{A}(t)$とおくと、$F_{A}(t)$は下記のように表せる。
$$
\large
\begin{align}
F_{A}(t) &= \det{(tI_{2} \, – \, A)} = \left| \begin{array}{ccc} t-3 & -1 & -1 \\ -2 & t-4 & -2 \\ -1 & -1 & t-3 \end{array} \right| \\
&= -\left| \begin{array}{ccc} -1 & -1 & t-3 \\ -2 & t-4 & -2 \\ t-3 & -1 & -1 \end{array} \right| \\
&= \left| \begin{array}{ccc} 1 & 1 & 3-t \\ -2 & t-4 & -2 \\ t-3 & -1 & -1 \end{array} \right| \\
&= \left| \begin{array}{ccc} 1 & 0 & 0 \\ -2 & t-2 & -2(t-2) \\ t-3 & -(t-2) & (t-2)(t-4) \end{array} \right| \\
&= (-1)^{1+1} \left| \begin{array}{cc} t-2 & -2(t-2) \\ -(t-2) & (t-2)(t-4) \end{array} \right| \\
&= (t-2)^{2} \left| \begin{array}{cc} 1 & -2 \\ -1 & t-4 \end{array} \right| \\
&= (t-2)^{2} (t-4-2) = (t-2)^{2} (t-6)
\end{align}
$$

ここで$p(t)=(t-2)(t-6)$とおくと、$p(A)$は下記のように計算できる。
$$
\large
\begin{align}
p(A) &= (A \, – \, 2I_{3})(A \, – \, 6I_{3}) \\
&= \left( \begin{array}{ccc} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 1 & 1 & 1 \end{array} \right) \left( \begin{array}{ccc} -3 & 1 & 1 \\ 2 & -2 & 2 \\ 1 & 1 & -3 \end{array} \right) \\
&= \left( \begin{array}{ccc} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right) = O
\end{align}
$$

よって最小多項式の定義より、行列$A$の最小多項式は$(t-2)(t-6)$である。

固有多項式(characteristic polynomial)の定義と三角行列の固有多項式

投稿日: 2023-11-202023-11-22 投稿者: lib-arts

固有多項式(characteristic polynomial)は固有値を計算する際の固有方程式に用いられる多項式です。当記事では固有多項式の定義・活用と、三角行列(triangular matrix)における固有多項式の計算について取り扱いました。
作成にあたっては「チャート式シリーズ大学教養線形代数」の第$8$章「固有値問題と行列の対角化」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養線形代数 (チャート式・シリーズ)

Contents

1 固有多項式
- 1.1 固有多項式の定義
- 1.2 三角行列の固有多項式
2 固有多項式の使用例
- 2.1 基本例題$154$
- 2.2 基本例題$155$

固有多項式

固有多項式の定義

$n$次正方行列$A$の$t$を変数とする固有方程式$F_{A}(t)$は行列式$\det$と$n$次の単位行列$I_{n}$を用いて下記のように定義される。
$$
\large
\begin{align}
F_{A}(t) = \det{(tI_{n} \, – \, A)}
\end{align}
$$

固有方程式は上記を用いて$F_{A}(t)=\det{(tI_{n} \, – \, A)}=0$のように表す。

三角行列の固有多項式

固有多項式の使用例

以下、「チャート式シリーズ大学教養線形代数」の例題の確認を行う。

基本例題$154$

$[1]$
$$
\large
\begin{align}
A = \left( \begin{array}{cc} 2 & 1 \\ 2 & 3 \end{array} \right)
\end{align}
$$

上記より行列$A$の固有値は$1$と$4$である。

・固有値$1$に対応する固有空間の基底
$A-I_{2}$は下記のように行基本変形を行うことができる。
$$
\large
\begin{align}
A \, – \, I_{2} &= \left( \begin{array}{cc} 1 & 1 \\ 2 & 2 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{cc} 1 & 1 \\ 0 & 0 \end{array} \right)
\end{align}
$$

よって、$(A-I_{2})\mathbf{x}=\mathbf{0}$の解は$\displaystyle \mathbf{x} = c \left( \begin{array}{c} 1 \\ -1 \end{array} \right)$であり、このベクトルが固有値$1$に対応する固有空間の基底である。

・固有値$4$に対応する固有空間の基底
$A-4I_{2}$は下記のように行基本変形を行うことができる。
$$
\large
\begin{align}
A \, – \, 4I_{2} &= \left( \begin{array}{cc} -2 & 1 \\ 2 & -1 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{cc} -2 & 1 \\ 0 & 0 \end{array} \right)
\end{align}
$$

よって、$(A \, – \, 4I_{2})\mathbf{x}=\mathbf{0}$の解は$\displaystyle \mathbf{x} = c \left( \begin{array}{c} 1 \\ 2 \end{array} \right)$であり、このベクトルが固有値$1$に対応する固有空間の基底である。

・$[2]$
$$
\large
\begin{align}
A = \left( \begin{array}{ccc} 3 & 1 & 1 \\ 2 & 4 & 2 \\ 1 & 1 & 3 \end{array} \right)
\end{align}
$$

上記より行列$A$の固有値は$2$と$6$であり、それぞれの重複度は$2$と$1$である。以下、それぞれの固有値に対応する固有ベクトルの計算を行う。

・固有値$2$に対応する固有空間の基底
$A \,- \, 2I_{3}$は下記のように行基本変形を行うことができる。
$$
\large
\begin{align}
A \, – \, 2I_{3} &= \left( \begin{array}{ccc} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 1 & 1 & 1 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{ccc} 1 & 1 & 1 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right)
\end{align}
$$

よって、$(A \, – \, 2I_{3})\mathbf{x}=\mathbf{0}$の解は$\displaystyle \mathbf{x} = c \left( \begin{array}{c} 1 \\ -1 \\ 0 \end{array} \right) + d \left( \begin{array}{c} 1 \\ 0 \\ -1 \end{array} \right)$であるので、$\displaystyle c \left( \begin{array}{c} 1 \\ -1 \\ 0 \end{array} \right)$と$d \left( \begin{array}{c} 1 \\ 0 \\ -1 \end{array} \right)$が固有値$2$に対応する固有空間の基底である。

・固有値$6$に対応する固有空間の基底
$A \, – \, 6I_{3}$は下記のように行基本変形を行うことができる。
$$
\large
\begin{align}
A \, – \, 6I_{2} &= \left( \begin{array}{ccc} -3 & 1 & 1 \\ 2 & -2 & 2 \\ 1 & 1 & -3 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{ccc} 1 & 1 & -3 \\ 2 & -2 & 2 \\ -3 & 1 & 1 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{ccc} 1 & 1 & -3 \\ 0 & -4 & 8 \\ 0 & 4 & -8 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{ccc} 1 & 1 & -3 \\ 0 & 1 & -2 \\ 0 & 0 & 0 \end{array} \right) \\
& \longrightarrow \left( \begin{array}{ccc} 1 & 0 & -1 \\ 0 & 1 & -2 \\ 0 & 0 & 0 \end{array} \right)
\end{align}
$$

よって、$(A \, – \, 6I_{2})\mathbf{x}=\mathbf{0}$の解は$\displaystyle \mathbf{x} = c \left( \begin{array}{c} 1 \\ 2 \\ 1 \end{array} \right)$であり、このベクトルが固有値$6$に対応する固有空間の基底である。

基本例題$155$

行列の固有多項式とケーリー・ハミルトンの定理(Cayley–Hamilton theorem)

投稿日: 2023-11-182023-11-18 投稿者: lib-arts

ケーリー・ハミルトンの定理(Cayley–Hamilton theorem)は行列の次数下げなどにあたって用いられる式です。当記事では行列の固有多項式に基づくケーリー・ハミルトンの定理の一般的な式を確認した後に、$2$次正方行列のケーリー・ハミルトンの定理の式との対応について確認します。
作成にあたっては「チャート式シリーズ大学教養線形代数」の第$8$章「固有値問題と行列の対角化」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養線形代数 (チャート式・シリーズ)

Contents

1 前提の確認
- 1.1 行列の固有多項式
- 1.2 $2$次正方行列におけるケーリー・ハミルトンの定理
2 ケーリー・ハミルトンの定理
- 2.1 固有多項式とケーリー・ハミルトンの定理
- 2.2 $2$次正方行列の式の導出

前提の確認

行列の固有多項式

$n$次正方行列$A$の変数$t$の固有多項式$F_{A}(t)$は行列式$\det$と$n$次の単位行列$I_{n}$を元に下記のように定義される。
$$
\large
\begin{align}
F_{A}(t) = \det{(tI_{n} – A)}
\end{align}
$$

$2$次正方行列におけるケーリー・ハミルトンの定理

$$
\large
\begin{align}
A = \left( \begin{array}{cc} a & b \\ c & d \end{array} \right)
\end{align}
$$

上記のように定義される$2$次正方行列$A$について下記が成立する。
$$
\large
\begin{align}
A^{2} – (a+d)A + (ad – bc) I_{2} &= O \\
A^{2} &= (a+d)A – (ad – bc) I_{2} \quad (1)
\end{align}
$$

上記の$O$は零行列を表す。

ケーリー・ハミルトンの定理

固有多項式とケーリー・ハミルトンの定理

$n$次正方行列$A$の固有多項式が$F_{A}(t)$のように表されるとき、下記が成立する。
$$
\large
\begin{align}
F_{A}(A) = O
\end{align}
$$

上記をケイリー・ハミルトンの定理という。

$2$次正方行列の式の導出

$$
\large
\begin{align}
A = \left( \begin{array}{cc} a & b \\ c & d \end{array} \right)
\end{align}
$$

上記のように定義される$2$次正方行列$A$の固有多項式$F_{A}(t)$は下記のように表すことができる。
$$
\large
\begin{align}
F_{A}(t) &= \det{(tI_{n} – A)} \\
&= \left| \begin{array}{cc} t-a & b \\ c & t-d \end{array} \right| \\
&= (t-a)(t-d) – bc \\
&= t^{2} -(a+d)t + ad-bc
\end{align}
$$

上記より、$F_{A}(A)=O$は下記のように変形できる。
$$
\large
\begin{align}
F_{A}(A) &= O \\
A^{2} -(a+d)A + (ad-bc)I_{2} &= O \\
A^{2} &= (a+d)A – (ad-bc)I_{2} \quad (2)
\end{align}
$$

$(2)$式は$(1)$式に一致する。

拡散モデルのlossの導出②：正規分布のKLダイバージェンスの計算に基づくlossの導出

投稿日: 2023-11-162023-11-20 投稿者: lib-arts

拡散とDenoisingに基づく拡散モデル(Diffision Model)は多くの生成モデル(generative model)に導入される概念です。当記事では正規分布のKLダイバージェンス(KL-Divergence)の計算を元にDDPM論文におけるlossの導出について取り扱いました。
Diffusion Model論文・DDPM論文や「拡散モデルーデータ生成技術の数理(岩波書店)」の$2$章の「拡散モデル」などを参考に作成を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

拡散モデル　データ生成技術の数理

岡野原大輔

3,520円(06/26 22:10時点)

発売日: 2023/02/17

Contents

1 前提の確認
- 1.1 DDPM論文$(5)$式
- 1.2 $2$つの正規分布のKLダイバージェンスの計算
2 拡散モデルのlossの導出

前提の確認

DDPM論文$(5)$式

$$
\begin{align}
L &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ D_{KL}(q(\mathbf{x}_{T}|\mathbf{x}_{0})||p_{\theta}(\mathbf{x}_{T})) + \sum_{t>1} D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})||p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))) – \log{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})} \right] \quad (1.1) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ L_{T} + \sum_{t>1} L_{t-1} + L_{0} \right] \quad (1.1)’
\end{align}
$$

上記の$(1.1), \, (1.1)’$式がDDPM論文の$\mathrm{Eq}. \: (5)$に対応する。導出については下記で詳しく取り扱った。

拡散モデルのlossの導出①：イェンセンの不等式に基づく変分下限とKLダイバージェンスを用いた表記

$2$つの正規分布のKLダイバージェンスの計算

$2$つの確率分布$p(x)$と$q(x)$に関するKLダイバージェンス$D_{KL}(p||q)$は下記のように表される。
$$
\large
\begin{align}
D_{KL}(p||q) = – \log{ \frac{q(x)}{p(x)} }
\end{align}
$$

$2$つの正規分布$\mathcal{N}(\mu_{a}, \Sigma_{a})$と$\mathcal{N}(\mu_{b}, \Sigma_{b})$のKLダイバージェンス$D_{KL}(\mathcal{N}(\mu_{a}, \Sigma_{a})||\mathcal{N}(\mu_{b}, \Sigma_{b}))$は$(1.2)$式より下記を用いて計算できる。
$$
\begin{align}
D_{KL}(\mathcal{N}(\mu_{a}, \Sigma_{a})||\mathcal{N}(\mu_{b}, \Sigma_{b})) &= \frac{1}{2} \left[ \log{ \frac{|\Sigma_{b}|}{|\Sigma_{a}|} } – d + \mathrm{tr} \left( \Sigma_{b}^{-1} \Sigma_{a} \right) + (\mu_{b}-\mu_{a})^{\mathrm{T}} \Sigma_{b}^{-1} (\mu_{b}-\mu_{a}) \right]
\end{align}
$$

拡散モデルのlossの導出

$q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})$の導出

$(1.1)$式の$q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})$は下記のように表される。
$$
\large
\begin{align}
q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}) &= \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}), \tilde{\beta}_{t} \mathbf{I}) \quad (2.1) \\
\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) &= \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{\bar{\beta}_{t}} \mathbf{x}_{0} + \frac{\bar{\beta}_{t-1}}{\bar{\beta}_{t}} \mathbf{x}_{t} \quad (2.2) \\
\bar{\beta}_{t} &= \frac{\bar{\beta}_{t-1}}{\bar{\beta}_{t}} \beta_{t} \quad (2.3)
\end{align}
$$

$\displaystyle \small L_{t-1} = \mathbb{E}_{q} \left[ \frac{1}{2 \sigma_{t}^{2}} || \tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) – \mu_{\theta}(\mathbf{x}_{t}, t) ||^{2} \right] + C$の導出

$\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0})$の詳細

以下、$\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0})$の詳細について式変形を元に確認を行う。まず、$(2.2)$式より$\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0})$は下記のように表される。
$$
\large
\begin{align}
\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{\bar{\beta}_{t}} \mathbf{x}_{0} + \frac{\bar{\beta}_{t-1}}{\bar{\beta}_{t}} \mathbf{x}_{t} \quad (2.2)
\end{align}
$$

また、任意時刻の拡散条件付き確率の式$q(\mathbf{x}_{t}|\mathbf{x}_{0})$は下記のように表すことができる。
$$
\large
\begin{align}
q(\mathbf{x}_{t}|\mathbf{x}_{0}) &= \mathcal{N}(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}, (1 \, – \, \bar{\alpha}_{t}) \mathbf{I}) \quad (2.4) \\
\alpha_{t} &= 1-\beta_{t} \\
\bar{\alpha}_{t} &= \prod_{s=1}^{t} \alpha_{s}
\end{align}
$$

上記の導出は下記で詳しく取り扱った。

拡散モデル(Diffusion Model)の概要と式定義まとめ

$(2.4)$式より、$\mathbf{x}_{t}$は$\mathbf{x}_{0}$とノイズ$\epsilon$を用いて下記のように表せる。
$$
\large
\begin{align}
\mathbf{x}_{t}(\mathbf{x}_{0}, \epsilon) = \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0} + \sqrt{1 \, – \, \bar{\alpha}_{t}} \epsilon, \quad \mathcal{N}(\mathbf{0}, \mathbf{I}) \quad (2.5)
\end{align}
$$

$(2.5)$式を$\mathbf{x}_{0}$について変形を行うと下記が得られる。
$$
\large
\begin{align}
\mathbf{x}_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}} \left( \mathbf{x}_{t}(\mathbf{x}_{0}, \epsilon) \, – \, \sqrt{1 \, – \, \bar{\alpha}_{t}} \epsilon \right) \quad (2.6)
\end{align}
$$

ここで$(2.2)$式に$(2.6)$式を代入すると下記のように変形できる。
$$
\large
\begin{align}
\tilde{\mu}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) &= \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{\bar{\beta}_{t}} \mathbf{x}_{0} + \frac{\bar{\beta}_{t-1}}{\bar{\beta}_{t}} \mathbf{x}_{t} \quad (2.2) \\
&=
\end{align}
$$

$\mu_{\theta}(\mathbf{x}_{t}, t)$の詳細

$\displaystyle \small L_{t-1} = \mathbb{E}_{\mathbf{x}_{0}, \epsilon} \left[ \frac{\beta_{t}^{2}}{2 \sigma_{t}^{2} \alpha_{t} \bar{\beta}_{t}} \left| \middle| \epsilon – \epsilon_{\theta} \left( \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0} + \sqrt{\bar{\beta}_{t}} \epsilon, t \right) \middle| \right|^{2} \right] + C$の導出

DDPMを用いたサンプリング

拡散モデルのlossの導出①：イェンセンの不等式に基づく変分下限とKLダイバージェンスを用いた表記

投稿日: 2023-11-142023-11-16 投稿者: lib-arts

拡散とDenoisingに基づく拡散モデル(Diffision Model)は多くの生成モデル(generative model)に導入される概念です。当記事ではイェンセンの不等式(Jensen’s Inequality)やKLダイバージェンスの定義を用いた拡散モデルの負の対数尤度の変分下限の導出について取り扱いました。
Diffusion Model論文・DDPM論文や「拡散モデルーデータ生成技術の数理(岩波書店)」の$2$章の「拡散モデル」などを参考に作成を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

Contents

1 前提の確認
2 拡散モデルのlossの導出
- 2.1 DDPM論文$(3)$式の導出
- 2.2 DDPM論文$(5)$式の導出

前提の確認

イェンセンの不等式

$$
\large
\begin{align}
\lambda_i & \geq 0 \\
\sum_{i=1}^{M} \lambda_{i} &= 1
\end{align}
$$

上記のように$\lambda_1, \cdots , \lambda_M$を定義するとき、下に凸の関数$f(x)$の任意の点$(x_i, f(x_i))$について下記の不等式が成立する。
$$
\large
\begin{align}
f \left( \sum_{i=1}^{M} \lambda_{i} x_{i} \right) \leq \sum_{i=1}^{M} \lambda_{i} f \left( x_{i} \right) \quad (1.1)
\end{align}
$$

上記をイェンセンの不等式(Jensen’s Inequality)という。イェンセンの不等式については下記でも取り扱った。

イェンセンの不等式(Jensen’s inequality)と凸関数の期待値・凸集合まとめ

当記事で取り扱う導出で出てくる関数$f(x)=-\log{x}$が下に凸の関数であるので、当項では下に凸の関数についてのイェンセンの不等式を取り扱ったが、上に凸の関数についてのイェンセンの不等式は不等号が逆になることも合わせて抑えておくと良い。

期待値の定義式へのイェンセンの不等式の適用

前項$(1.1)$式の$\lambda_{i}$について$\displaystyle \lambda_i \geq 0, \, \sum_{i=1}^{M} \lambda_i = 1$が成立することから、$\lambda_{i}$に確率関数$p(x_i)$を対応させることができる。このとき下に凸の関数$f$について下記のような式が導出できる。
$$
\large
\begin{align}
f \left( \sum_{i=1}^{M} p(x_i) x_{i} \right) & \leq \sum_{i=1}^{M} p(x_i) f \left( x_{i} \right) \\
f \left( \mathbb{E} \left[ x_{i} \right] \right) & \leq \mathbb{E} \left[ f \left( x_{i} \right) \right]
\end{align}
$$

上記は離散型確率分布の式から導出したが、連続変数についても同様に下記が成立する。
$$
\large
\begin{align}
f \left( \int x p(x) dx \right) & \leq \int f(x) p(x) dx
\end{align}
$$

KLダイバージェンスの定義と解釈

連続型確率分布$p(x)$と$q(x)$のKLダイバージェンス$\mathrm{KL}(p||q)$は下記のように定義される。
$$
\large
\begin{align}
\mathrm{KL}(p||q) &= -\int \left[ p(x) \log{\frac{q(x)}{p(x)}} \right] dx \\
&= -\int \left[ p(x) \log{q(x)} \, – \, p(x) \log{{p(x)}} \right] dx \\
&= \int \left[ p(x) \log{\frac{p(x)}{q(x)}} \right] dx \quad (1.2)
\end{align}
$$

上記の式は$p$に対応する確率分布の期待値の記号$\mathbb{E}_{p}$を用いて下記のように表すこともできる。
$$
\large
\begin{align}
\mathrm{KL}(p||q) &= \mathbb{E}_{p} \left[ – \log{\frac{q(x)}{p(x)}} \right] \quad (1.3)
\end{align}
$$

$(1.3)$式は離散型確率分布でも成立する。KLダイバージェンスの解釈にあたっては、確率分布$p(x)$と$q(x)$の類似度を表すと解釈すればよい。下記ではソフトマックス関数を題材にKLダイバージェンスの値の変化について具体的に取り扱った。

「一様分布」と「Softmax関数による確率化」間のKLダイバージェンスの計算

拡散モデルのlossの導出

DDPM論文$(3)$式の導出

$\mathbf{x}_{0}$は学習に用いるサンプルに対応するので、lossに用いるnegative log-likelihood関数の期待値$l$は下記のように表せる。
$$
\large
\begin{align}
l = \mathbb{E} \left[ -\log{p_{\theta}(\mathbf{x}_{0})} \right] \quad (2.1)
\end{align}
$$

ここで「拡散モデル(Diffusion Model)の概要と式定義まとめ」の$(1)$式より、$p_{\theta}(\mathbf{x}_{0})$は下記のように表せる。
$$
\large
\begin{align}
p_{\theta}(\mathbf{x}_{0}) = \int p_{\theta}(\mathbf{x}_{0:T}) \, d\mathbf{x}_{1:T}
\end{align}
$$

上記は同時確率分布に関する基本演算に基づいて下記のように変形できる。
$$
\large
\begin{align}
p_{\theta}(\mathbf{x}_{0}) &= \int p_{\theta}(\mathbf{x}_{0:T}) \, d\mathbf{x}_{1:T} \\
&= \int p_{\theta}(\mathbf{x}_{0:T}) \cdot \frac{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \, d\mathbf{x}_{1:T} \\
&= \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot \frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \, d\mathbf{x}_{1:T} \\
&= \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot p_{\theta}(\mathbf{x}_{T}) \frac{p_{\theta}(\mathbf{x}_{0:(T-1)}|\mathbf{x}_{T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \, d\mathbf{x}_{1:T} \\
&= \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} \, d\mathbf{x}_{1:T} \quad (2.2)
\end{align}
$$

$(2.1)$式に$(2.2)$式を代入することで下記が得られる。
$$
\large
\begin{align}
l &= \mathbb{E} \left[ -\log{p_{\theta}(\mathbf{x}_{0})} \right] \quad (2.1) \\
&= \int -\log{p_{\theta}(\mathbf{x}_{0})} d \mathbf{x}_{0} \\
&= \int -\log{ \left[ \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} d\mathbf{x}_{1:T} \right] } d \mathbf{x}_{0} \quad (2.3)
\end{align}
$$

ここで$(2.3)$式の確率関数$q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$に着目しイェンセンの不等式を適用することで下記が得られる。
$$
\large
\begin{align}
l &= \int -\log{ \left[ \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} d\mathbf{x}_{1:T} \right] } d \mathbf{x}_{0} \quad (2.3) \\
& \leq \int q(\mathbf{x}_{1:T}|\mathbf{x}_{0}) \cdot \left( -\log{ \left[ p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} \right] } \right) d \mathbf{x}_{0:T} \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \left( p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} \right) } \right] \quad (2.4) \\
\end{align}
$$

条件付き確率の期待値の式に基づいて$(2.4)$式は下記のように変形できる。
$$
\large
\begin{align}
l &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \left( p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} \right) } \right] \quad (2.4) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} } \right] \quad (2.5)
\end{align}
$$

また、$(2.4)$式を$\log$に着目することで下記のように和の形式で表すこともできる。
$$
\large
\begin{align}
l &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \left( p_{\theta}(\mathbf{x}_{T}) \prod_{t=1}^{T} \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} \right) } \right] \quad (2.4) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T})} \, – \, \sum_{t \geq 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} } \right] \quad (2.6)
\end{align}
$$

$(2.5)$式と$(2.6)$式より、DDPMの論文の$\mathrm{Eq}. \, (3)$が正しいことが確認できる。

DDPM論文$(5)$式の導出

$(2.5)$式、$(2.6)$式は$\mathbf{x}_{0}$の負の対数尤度の変分下限であり、この変分下限を$L$とおくと$L$は下記のように変形できる。
$$
\large
\begin{align}
L &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} } \right] \quad (2.5) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \sum_{t \geq 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} } \right] \quad (2.6) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} } \, – \, \log{ \frac{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}{q(\mathbf{x}_{1}|\mathbf{x}_{0})} } \right] \quad (2.7)
\end{align}
$$

ここで$t>1$のとき$q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$はマルコフ連鎖の定義とベイズの定理に基づいて下記のように変形を行える。
$$
\large
\begin{align}
q(\mathbf{x}_{t}|\mathbf{x}_{t-1}) &= \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{t})q(\mathbf{x}_{t})}{q(\mathbf{x}_{t-1})} \\
&= \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})q(\mathbf{x}_{t}|\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})} \quad (2.8)
\end{align}
$$

$(2.8)$式を$(2.7)$式に代入すると下記が得られる。
$$
\begin{align}
L &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})} } \, – \, \log{ \frac{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}{q(\mathbf{x}_{1}|\mathbf{x}_{0})} } \right] \quad (2.7) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} \cdot \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} } \, – \, \log{ \frac{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}{q(\mathbf{x}_{1}|\mathbf{x}_{0})} } \right] \quad (2.9)
\end{align}
$$

ここで$(2.9)$式の$\displaystyle \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} \cdot \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} }$について下記が成立する。
$$
\large
\begin{align}
& \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} \cdot \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} } \\
&= \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } + \sum_{t > 1} \log{ \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} } \\
&= \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } + \log{ \prod_{t > 1} \frac{q(\mathbf{x}_{t-1}|\mathbf{x}{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} } \\
&= \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } + \log{ \left[ \frac{q(\mathbf{x}_{1}|\mathbf{x}_{0}) \cdot \cancel{q(\mathbf{x}_{2}|\mathbf{x}_{0})} \cdots \cancel{q(\mathbf{x}_{T-1}|\mathbf{x}_{0})}}{\cancel{q(\mathbf{x}_{2}|\mathbf{x}_{0})} \cdots \cancel{q(\mathbf{x}_{T-1}|\mathbf{x}_{0})} \cdot q(\mathbf{x}_{T}|\mathbf{x}_{0})} \right] } \\
&= \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } + \log{ \frac{q(\mathbf{x}_{1}|\mathbf{x}_{0})}{q(\mathbf{x}_{T}|\mathbf{x}_{0})} } \quad (2.10)
\end{align}
$$

$(2.10)$式を$(2.9)$式に代入することで下記が得られる。
$$
\begin{align}
L &= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} \cdot \frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q(\mathbf{x}_{t}|\mathbf{x}_{0})} } \, – \, \log{ \frac{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}{q(\mathbf{x}_{1}|\mathbf{x}_{0})} } \right] \quad (2.9) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{p_{\theta}(\mathbf{x}_{T}}) \, – \, \log{ \frac{\cancel{q(\mathbf{x}_{1}|\mathbf{x}_{0})}}{q(\mathbf{x}_{T}|\mathbf{x}_{0})} } \, – \, \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } \, – \, \log{ \frac{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})}{\cancel{q(\mathbf{x}_{1}|\mathbf{x}_{0})}} } \right] \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ -\log{ \frac{p_{\theta}(\mathbf{x}_{T})}{q(\mathbf{x}_{T}|\mathbf{x}_{0})} } – \sum_{t > 1} \log{ \frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})} } – \log{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})} \right] \quad (2.11) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ D_{KL}(q(\mathbf{x}_{T}|\mathbf{x}_{0})||p_{\theta}(\mathbf{x}_{T})) + \sum_{t>1} D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})||p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))) – \log{p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})} \right] \quad (2.12) \\
&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})} \left[ L_{T} + \sum_{t > 1} L_{t-1} + L_{0} \right] \quad (2.12)’
\end{align}
$$

機械学習・DeepLearning分野の論文読解に役に立つ論文著者索引

投稿日: 2023-11-122023-12-10 投稿者: lib-arts

論文の本文中では「oo et al., yyyy」のように先行研究を参照することが多いです。それぞれ「References」に具体的な論文を確認することができる一方で、都度確認するのは大変です。そこで当記事では論文の著者名から該当する研究を確認できるような索引の作成を行いました。

作成にあたってはよく引用される各分野の有名論文を中心にabc順でまとめました。厳密さではなく、「概ねこの論文が該当するだろう」を重視しているので、正確にはそれぞれの論文の「References」を確認してください。特に複数該当する場合は多くの場合「yyyya, yyyyb, $\cdots$」のように表記されます。

Contents

1 DeepLearning
2 強化学習
- 2.1 強化学習×Transformer

DeepLearning

CNN

著者名	該当研究
He et al., 2015	ResNet
Krizhevsky et al., 2012	AlexNet
Simonyan et al., 2014	VGGNet

RNN・Transformer・LLM

著者名	該当研究
Brown et al., 2020	GPT-3
Devlin et al., 2018	BERT
Du et al., 2021	GLaM
Hoffmann et al., 2022	Chinchilla
Kitaev et al., 2020	Reformer
Liu et al., 2019	RoBERTa
Mikolov et al., 2013	Word2vec
Radford et al., 2018	GPT・GPT-2
Rae et al., 2021	Gopher
Raffel et al., 2020	T5
Smith et al., 2022	Megatron–Turing NLG
Sutskever et al., 2014	seq2seq
Thoppilan et al., 2022	LaMDA
Vaswani et al., 2017	Transformer
Yang et al.,	XLNet

生成モデル

著者名	該当研究
Goodfellow et al., 2014	GAN: Generative Adversarial networks
Ho et al., 2020	DDPM
Sohl-Dickstein et al., 2015	Diffusion Model
Radford et al., 2021	CLIP
Ramesh et al., 2021	DALL-E

GNN

著者名	該当研究
Battaglia et al., 2018	Graph Network・Inductive Bias
Gilmer et al., 2017	MPNN: Message Passing Neural Network
Wang et al., 2018	NLNN: Non-Local Neural Network

点群・集合

著者名	該当研究
Lee et al., 2019	Set Transformer
Qi et al., 2017	PointNet
Zaheer et al., 2017	Deep sets

強化学習

強化学習×Transformer

著者名	該当研究
Chen et al., 2021	Decision Transformer

拡散モデル(Diffusion Model)の概要と式定義まとめ

投稿日: 2023-11-102023-11-16 投稿者: lib-arts

拡散とDenoisingに基づく拡散モデル(Diffision Model)は多くの生成モデル(generative model)に導入される概念です。当記事では拡散モデルの概要と式定義、イェンセンの不等式などを用いるloss関数の導出などについて取りまとめを行いました。
DDPM論文や「拡散モデルーデータ生成技術の数理(岩波書店)」の$2$章の「拡散モデル」などを参考に作成を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

拡散モデル　データ生成技術の数理

岡野原大輔

3,520円(06/26 22:10時点)

発売日: 2023/02/17