ブログ

最大内積探索(MIPS; Maximum Inner Product Search)まとめ

投稿日: 2023-11-062023-11-06 投稿者: lib-arts

Routing TransformerのようなContent-based Sparse Attentionでは最大内積探索(MIPS; Maximum Inner Product Search)と類似した処理が行われます。当記事では最大内積探索の概要やクラスタリングを用いた計算の効率化について取りまとめました。
「Clustering is efficient for approximate maximum inner product search.」などを参考に作成を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

Contents

1 最大内積探索の問題設定
2 クラスタリングを用いた最大内積探索の効率化
- 2.1 k-means
- 2.2 階層k-means

最大内積探索の問題設定

最大内積探索の立式

$n$個の$D$次元ベクトルの集合$\mathcal{X}$を$\mathcal{X} = \{ \mathbf{x}_{1}, \cdots , \mathbf{x}_{n} \}, \, \mathbf{x}_{i} \in \mathbb{R}^{D}$のように定義する。このとき$D$次元のクエリ$q \in \mathbb{R}^{D}$と集合$\mathcal{X}$の要素であるベクトル$\mathbf{x}_{i}$に関する最大内積探索問題(MIPS; Maximum Inner Product Search problem)は下記のように立式することができる。
$$
\large
\begin{align}
\mathrm{arg} \max_{i} q^{\mathrm{T}} \mathbf{x}_{i}
\end{align}
$$

同様に、内積の計算結果が上位$K$に含まれるベクトルを探すK-MIPS問題は下記のように立式される。
$$
\large
\begin{align}
\mathrm{argmax}_{i}^{(K)} q^{\mathrm{T}} \mathbf{x}_{i} \quad (1.1)
\end{align}
$$

最近傍探索と一致する場合

最大内積探索(MIPS)問題は最近傍探索(NNS; Nearest Neighbor Search)と関連する。より厳密には「ベクトル$\mathbf{x}_{i} \in \mathbb{R}^{D}$が同じユークリッドノルム(Euclidean norm)を持つ場合」にMIPSとNNSが一致する。K-NNSの式はクエリ$q$とベクトル$\mathbf{x}_{i}$間のノルムに着目することで上記のように表すことができる。
$$
\large
\begin{align}
\mathrm{argmin}_{i}^{(K)} ||q \, – \, \mathbf{x}_{i}||^{2} \quad (1.2)
\end{align}
$$

ここでクエリ$q$とベクトル$\mathbf{x}_{i}$のユークリッドノルムが定数$C_1, C_2$を用いて$||q||^{2}=C_1, \, ||\mathbf{x}_{i}||^{2}=C_2$のように表せるとき、$(1.2)$式は下記のように同値変形できる。
$$
\large
\begin{align}
\mathrm{argmin}_{i}^{(K)} ||q \, – \, \mathbf{x}_{i}||^{2} &= \mathrm{argmin}_{i}^{(K)} (q \, – \, \mathbf{x}_{i})^{\mathrm{T}} (q \, – \, \mathbf{x}_{i}) \\
&= \mathrm{argmin}_{i}^{(K)} \left( ||q||^{2} \, – \, 2 q^{\mathrm{T}} \mathbf{x}_{i} + ||\mathbf{x}_{i}||^{2} \right) \\
&= \mathrm{argmin}_{i}^{(K)} \left( C_1 \, – \, 2 q^{\mathrm{T}} \mathbf{x}_{i} + C_2 \right) \\
&= \mathrm{argmin}_{i}^{(K)} \left( – \, 2 q^{\mathrm{T}} \mathbf{x}_{i} \right) \\
&= \mathrm{argmax}_{i}^{(K)} q^{\mathrm{T}} \mathbf{x}_{i} \quad (1.1)
\end{align}
$$

上記より、$\mathbf{x}_{i}$のノルムが同じ場合は「同一のクエリ$q$についてのK-MIPS問題がK-NNS問題に帰着する」ことが確認できる。

MCSSと一致する場合

最大内積探索(MIPS)問題は最大余弦類似度探索(MCSS; Maximum Cosine Similarity Search)とも関連する。NNSと同様に厳密には「ベクトル$\mathbf{x}_{i} \in \mathbb{R}^{D}$が同じユークリッドノルム(Euclidean norm)を持つ場合」にMIPSとMCSSSが一致する。K-MCSSの式はクエリ$q$とベクトル$\mathbf{x}_{i}$間のノルムに着目することで上記のように表すことができる。
$$
\large
\begin{align}
\mathrm{argmin}_{i}^{(K)} \frac{q^{\mathrm{T}} \mathbf{x}_{i}}{||q|| ||\mathbf{x}_{i}||} \quad (1.3)
\end{align}
$$

ここでクエリ$q$とベクトル$\mathbf{x}_{i}$のユークリッドノルムが定数$C_1, C_2$を用いて$||q||^{2}=C_1, \, ||\mathbf{x}_{i}||^{2}=C_2$のように表せるとき、$(1.3)$式は下記のように同値変形できる。
$$
\large
\begin{align}
\mathrm{argmin}_{i}^{(K)} \frac{q^{\mathrm{T}} \mathbf{x}_{i}}{||q|| ||\mathbf{x}_{i}||} &= \mathrm{argmin}_{i}^{(K)} \frac{q^{\mathrm{T}} \mathbf{x}_{i}}{C_1 C_2} \\
&= \mathrm{argmax}_{i}^{(K)} q^{\mathrm{T}} \mathbf{x}_{i} \quad (1.1)
\end{align}
$$

上記より、$\mathbf{x}_{i}$のノルムが同じ場合は「同一のクエリ$q$についてのK-MIPS問題がK-MCSS問題に帰着する」ことが確認できる。

クラスタリングを用いた最大内積探索の効率化

k-means

実際の問題でK-MIPS・K-NNS・K-MCSS問題をそれぞれクエリ$q$とベクトル集合$\mathcal{X} = \{ \mathbf{x}_{1}, \cdots , \mathbf{x}_{n} \}$の総当たりで解こうとすると計算量が大きい場合が多い。

総当たりで計算を行う場合は計算量が大きい一方で正確な結果が得られるが、「検索」のように概ね高い類似性が得られればよく「厳密な正確さ」が必要ない場合も多くある。このような際に、k-means法を用いたクラスタリングに基づく解の近似も有力な手法となる。

具体的には下記のような手順に基づいて近似解を得ることができる。
$[1] \,$ ベクトル集合$\mathcal{X} = \{ \mathbf{x}_{1}, \cdots , \mathbf{x}_{n} \}$をk-means法に基づいて$k$個のクラスタに分類
$[2] \,$ クエリ$q$とクラスタの平均ベクトルの内積 or 類似度を計算、値の大きなクラスタを選択
$[3] \,$ 類似度の高いクラスタからクエリ$q$との類似度が高いベクトル$\mathbf{x}_{i}$を選択

上記の$[1]$は参考論文では下記のようなアルゴリズムの形式で表される。

Algorithm$. 1 \,$ (Clustering is efficient for approximate maximum inner product search.)

上記のようにクラスタの平均ベクトルとの内積 or 類似度を計算することによって、計算量を減らすことが可能である。たとえば$10,000$ベクトルを$100$個のクラスタに$100$ずつ分類する場合、類似度の計算回数は$100 \times 2 = 200$であり、総当たりの$10,000$から大きく減らすことができる。

一方で、クエリ$q$がクラスタの境界付近に位置する場合など、必ずしも厳密な結果が得られるわけではないことに注意が必要である。この課題に対してはクラスタを複数選ぶなどの対応策が用いられる場合も多い。

階層k-means

前項で確認を行ったk-meansクラスタリングに基づく近似では「クラスタの数が多い：精度が高いが計算量が多い」、「クラスタの数が少ない：計算量が少ないが精度も低い」というトレードオフがある。この問題については階層k-meansが一つの解決策になりうる。

Figure$. 1 \,$ (Clustering is efficient for approximate maximum inner product search.)

上図のようにベクトルをいくつかの階層型のクラスタに分類しておくことで順々に分類を行うことが可能になる。たとえば$10,000$を$1,000$を$10$個、$1,000$を$10$個、$100$を$10$個のように階層的に分類すると、探索の計算量は$10 \times 4$まで削減できる。

この階層型クラスタは、多くのクラスタを先に作り、徐々に階層型クラスタリングを行うことで構築が可能である。

ウォリスの公式の導出とウォリスの公式を用いた円周率の近似計算

投稿日: 2023-11-052023-11-05 投稿者: lib-arts

ウォリスの公式は円周率の近似などにあたって役立つ公式で、ウォリス積分(Wallis integral)の式から導出を行うことができます。当記事ではウォリス積分に基づくウォリスの公式の導出とPythonを用いた円周率の近似計算例について取り扱いました。
作成にあたっては「チャート式シリーズ大学教養微分積分」の第$4$章「積分($1$変数)」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養微分積分 (チャート式・シリーズ)

3,080円(06/26 22:32時点)

Amazon

Contents

1 ウォリスの公式の導出
- 1.1 ウォリス積分の公式
- 1.2 ウォリスの公式の導出
2 円周率の近似

ウォリスの公式の導出

ウォリス積分の公式

ウォリス積分の公式は下記のように表される。
$$
\large
\begin{align}
\int_{0}^{\frac{\pi}{2}} \cos^{2m}{x} dx &= \int_{0}^{\frac{\pi}{2}} \sin^{2m}{x} dx = \frac{(2m-1)!!}{2m!!} \cdot \frac{\pi}{2} \quad (1) \\
\int_{0}^{\frac{\pi}{2}} \cos^{2m+1}{x} dx &= \int_{0}^{\frac{\pi}{2}} \sin^{2m+1}{x} dx = \frac{2m!!}{(2m+1)!!} \quad (2) \\
m & \in \mathbb{N}
\end{align}
$$

ウォリスの公式の導出

$\displaystyle x \in \left[ 0, \frac{\pi}{2} \right]$の$x$について、$0 \leq \sin{x} \leq 1$であるので下記が成立する。
$$
\large
\begin{align}
\sin^{2n+2}{x} \leq \sin^{2n+1}{x} \leq \sin^{2n}{x}
\end{align}
$$

上記は$\displaystyle x \in \left[ 0, \frac{\pi}{2} \right]$の任意の$x$について成立するので、下記が成立する。
$$
\large
\begin{align}
\int_{0}^{\frac{\pi}{2}} \sin^{2n+2}{x} dx \leq \int_{0}^{\frac{\pi}{2}} \sin^{2n+1}{x} dx \leq \int_{0}^{\frac{\pi}{2}} \sin^{2n}{x} dx
\end{align}
$$

上記に対し、ウォリス積分をそれぞれ適用することで下記が得られる。
$$
\large
\begin{align}
\int_{0}^{\frac{\pi}{2}} \sin^{2n+2}{x} dx \leq & \int_{0}^{\frac{\pi}{2}} \sin^{2n+1}{x} dx \leq \int_{0}^{\frac{\pi}{2}} \sin^{2n}{x} dx \\
\frac{(2n+1)!!}{(2n+2)!!} \cdot \frac{\pi}{2} \leq & \frac{(2n)!!}{(2n+1)!!} \leq \frac{(2n-1)!!}{(2n)!!} \cdot \frac{\pi}{2} \\
\frac{2n+1}{2n+2} \cdot \frac{\pi}{2} \leq & \frac{(2n)!!}{(2n+1)!!} \cdot \frac{(2n)!!}{(2n-1)!!} \leq \frac{\pi}{2}
\end{align}
$$

ここで$\displaystyle \lim_{n \to \infty} \frac{2n+1}{2n+2} = 1$なので、下記が成立する。
$$
\large
\begin{align}
\lim_{n \to \infty} \frac{(2n)!!}{(2n+1)!!} \cdot \frac{(2n)!!}{(2n-1)!!} &= \frac{2^{2} \cdot 4^{2} \cdot 6^{2} \cdot \cdots}{(1 \cdot 3)(3 \cdot 5)(5 \cdot 7) \cdots} \\
&= \prod_{n=1}^{\infty} \frac{(2n)^{2}}{(2n-1)(2n+1)} \\
&= \frac{\pi}{2}
\end{align}
$$

円周率の近似

下記のような計算を行うことでウォリスの公式に基づいて円周率の近似を行うことができる。

import numpy as np

pi = 2.
for i in range(1000):
    n = i+1
    pi = pi * (2*n)**2 / ((2*n-1)*(2*n+1))
    if n%200 == 0:
        print("n: {}, estimated_pi: {:.3f}".format(n, pi))

・実行結果

n: 200, estimated_pi: 3.138
n: 400, estimated_pi: 3.140
n: 600, estimated_pi: 3.140
n: 800, estimated_pi: 3.141
n: 1000, estimated_pi: 3.141

【Transformer】Sparse Attentionの分類とそれぞれの研究例

投稿日: 2023-10-312023-11-03 投稿者: lib-arts

Transformerの計算量は入力系列の長さの二乗に比例することから長い系列を取り扱う際に計算コストの課題が生じます。当記事ではこのような課題の解決にあたって用いられるSparse Attentionの分類とそれぞれの研究例について確認を行います。
作成にあたってはTransformerに関するSurvey論文である「A Survey of Transformers」やそれぞれの論文を参考にしました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

Contents

1 Sparse Attention
- 1.1 Sparse Attentionの概要
- 1.2 Sparse Attentionの分類
2 Position-based Sparse Attention
3 Content-based Sparse Attention
- 3.1 Reformer
  - 3.1.1 locality-sensitive hashing
  - 3.1.2 LSH Attention
- 3.2 Routing Transformer
  - 3.2.1 Routing Attention
  - 3.2.2 Centroid vectorsの計算
4 まとめ
5 参考

Sparse Attention

Sparse Attentionの概要

Transformerでは$Q K^{\mathrm{T}}$の計算に基づいてAttention Matrixの計算を行うが、このAttention Matrixの行列のサイズは系列の長さの二乗となり、計算コストが大きい。

このような際にAttention Matrixを疎行列で置き換える手法をSparse Attentionという。Aparse Attentionの活用にあたっては「どのような疎行列で表せば効率良く近似できるか」などを重視すると理解しやすい。

Sparse Attentionの分類

Sparse Attentionを実現する方法は「位置に基づく疎行列の作成」と「隠れ層のベクトルの類似度に基づく疎行列の作成」の二つに大別することができる。

当記事ではSurvey論文を元に位置に基づくSparse Attentionを「Position-based Sparse Attention」、隠れ層のベクトルの類似度に基づくSparse Attentionを「Content-based Sparse Attention」と表し、次節と次々節でそれぞれ取り扱う。

Position-based Sparse Attention

基本パターン

下図を元にPosition-based Sparse Attentionを構成する$5$つのパターンを確認することができる。

図の上側が$l$層から$l+1$層を計算するにあたってのAttentionの重みが$0$かどうかを表し、下側がAttention Matrixの数値に対応する($0$の場合白、$0$以外の場合色付けされる)。上側が特殊な形式のグラフ、下側が隣接行列であると解釈することもできる。

以下で取り扱うStar-Transformer、Longformer、ETC、BigBirdなどの研究ではそれぞれこの基本パターンの組み合わせによってSparse Attentionを構成する。

基本パターンの解釈

基本パターンの解釈を以下にまとめた。

global	特定のトークン(ノード)を全てのトークンのAttention計算に用いるかつ、特定のトークンのAttention計算には全てのトークンの隠れ層を用いる。
band	近隣のトークンのみを用いてAttention計算を行う。CNNとRNNはbandの特殊な例であると解釈することもできる。
dilated	bandによる局所処理のreceptive field(特定の位置の出力を構成する入力領域)を広げるにあたって用いられる処理。詳しくはWaveNet論文などを参照すると良い。
random	ランダムなグラフは完全グラフ(通常のTransformer)と似た性質を持つことを活用。
block local	入力をセグメント単位に分けてAttention計算を行う。

Star-Transformer

A Survey of TransformerではStar-Transformerが下記のような図で表される。

上図より、Star-Transformerが「global attention」と「band attention」の組み合わせであることが確認できる。

Longformer

A Survey of TransformerではLongformerが下記のような図で表される。

上図より、Star-Transformerが「global attention」と「band attention」の組み合わせであることが確認できる。

また、上図で表されるLongformer論文のFigure$\, 2$より、Attention Matrixの詳細が確認できる。Surveyでは「global」＋「band」のみが記載されているが、Longformer論文では「global」＋「dilated」のパターンも記載されている。

ETC

A Survey of TransformerではETC(Extended Transformer Construction)が下記のような図で表される。

上図より、ETCがStar-Transformerと同様に「global attention」と「band attention」の組み合わせであることが確認できる。

BigBird

A Survey of TransformerではBig Birdが下記のような図で表される。

上図より、Big Birdが「global attention」＋「band attention」＋「random attention」であることが確認できる。

また、Big Bird論文でもSurveyと同様な図が確認できる。色遣いが近いことから、Surveyの図はBig Birdをある程度参考に作成したことが推察される。

Content-based Sparse Attention

Content-based Sparse Attentionでは隠れ層のベクトルの値に基づいてAttention Matrixを疎行列で表す手法である。以下、LSH(locality-sensitive hashing)を用いるReformerやRouting Transformerなどの具体的な研究例について確認を行う。

Reformer

ReformerはLSH(locality-sensitive hashing)に基づくLSH AttentionというSparse Attentionを行う。以下、LSHとLSH Attentionについて確認を行う。

locality-sensitive hashing

次元$d_{k}$のベクトル$\mathbf{x} \in \mathbb{R}^{d_{k}}$を$b$値に分類するハッシュ関数を$h$とおく。このとき$h(\mathbf{x}) \in \{1, 2, \cdots, b \}$である。

Reformer論文では乱数に基づく行列$R \in \mathbb{R}^{d_{k} \times b/2}$に基づいてハッシュ関数$h$を定める。ハッシュ関数の定義にあたって、$[\mathbf{u}; \mathbf{v}]$をベクトル$\mathbf{u}$と$\mathbf{v}$の連結(concatnation)であると定義する。

このときハッシュ関数$h(\mathbf{x})$を下記のように定義する。
$$
\large
\begin{align}
h(\mathbf{x}) = \mathrm{argmax}([ \mathbf{x}R; -\mathbf{x}R])
\end{align}
$$

上記の$\mathbf{x}R$が要素数$\displaystyle \frac{b}{2}$のベクトルであるので、ハッシュ関数$h$は要素数$b$のベクトル$[\mathbf{x}R; -\mathbf{x}R]$から最大値を持つインデックスを選択する関数であると解釈できる。

locality-sensitive hashingでは行列$R$がランダムにベクトルを回転させる行列であると解釈することもできるので、直感的には下図のような図を元に理解しておくとと良い。

ReformerにおけるLSH Attentionは当項で確認を行なったハッシュ関数$h$に基づくlocality-sensitive hashingを用いたSparse Attentionである。

LSH Attention

LSH AttentionではTransformerのself-Attentionの式を下記のように改変を行う。
$$
\large
\begin{align}
o_{i} &= \sum_{j \in \mathcal{P}_{i}} \exp{\left( q_{i} \cdot k_{j} \; – \, z(i, \mathcal{P}_{i}) \right)} v_{j} \\
\mathcal{P}_{i} &= \{ j: i \geq j \}
\end{align}
$$

上記の理解にあたっては$\mathcal{P}_{i}$が「位置$i$のqueryである$q_{i}$が類似する位置の集合」であることに着目することが重要である。また、$z$は分配関数(partition function)であり、ソフトマックス関数と同様に出力の正規化を行う関数である。

LSH Attentionでは集合$\mathcal{P}_{i}$を前項で定義したハッシュ関数$h$を用いて下記のように定義する。
$$
\large
\begin{align}
\mathcal{P}_{i} = \{ j : h(q_{i}) = h(k_{j}) \}
\end{align}
$$

上記は「位置$i$に対して$h(q_{i}) = h(k_{j})$が成立する位置$j$の集合を$\mathcal{P}_{i}$と定義する」と解釈することができる。

Routing Transformer

Routing Transformerではクラスタリングと同様の枠組みを用いてSparse Attention処理を行う。以下、クラスタリングに基づくSparse Attentionの計算とクラスタリングに用いるCentroid vectorsの計算について確認する。

Routing Attention

Routing Transformer論文では下記のような式でSparse Attentionの計算を行う。
$$
\large
\begin{align}
X_{i}’ &= \sum_{\substack{j: \mu(K_{j}) = \mu(Q_{i}), \\ j<i}} A_{ij} V_j \\
\mu(Q_{i}) & \in \{ 1, 2, \cdots , k \}, \, \mu(K_{j}) \in \{ 1, 2, \cdots , k \} \\
\boldsymbol{\mu} &= (\mu_{1}, \cdots , \mu_{k}), \quad \mu_{l} \in \mathbb{R}^{d}
\end{align}
$$

上記の式はRouting Transformer論文の$(8)$式に改変を行なったものである。$A$はAttention Matrix、$i$は位置$i$のインデックス、$j$はAttention処理後の位置$i$の隠れ層を計算するにあたって用いる位置$j$のインデックスをそれぞれ表す。$j$は$A$の列に対応する一方で、$V$の行に対応することに注意が必要である。また、$X_{i}’$はFFN層の入力に用いるAttentionの処理結果、$\mu(Q_{i})$は$Q_i$に最も近いCentroid vectorのインデックスを得る関数である。

上記のSparse AttentionにおけるAttention Matrixは下図のように図示することができる。

$(a)$と$(b)$は前節で取り扱ったPosition-based Sparse Attentionの計算を自己回帰的(auto regressive)に表したものである。同様に$(c)$がRouting Transformerで用いるRouting Attentionに対応する。

$(c)$の理解にあたっては、まず図では「赤・青・緑」の三色に$3$つのクラスタを対応させたと解釈すると良い。ここで「緑」の行に注目すると、$(5,2)$成分が薄い緑で表されていることが確認できる。この$(5,2)$成分は同じクラスタの位置の隠れ層を参照することに対応する。同様に下からの$2$行は緑であるが、どちらも$2$列目と$5$列目が薄い緑で表されており、$2$番目と$5$番目の隠れ層を参照してそれぞれ計算を行うことも確認できる。

このようにRouting TransformerにおけるRouting Attentionではそれぞれの隠れ層を$k$個のクラスタに分類し、その分類結果に基づいてSparseなAttention Matrixの生成を行う。

Centroid vectorsの計算

前項で確認を行ったRouting Attentionはクラスタリングの結果に基づいてSparse Attentionの計算を行う手法である。当項ではクラスタリングに用いるCentroid vectorsの$\boldsymbol{\mu} = (\mu_{1}, \cdots , \mu_{k})$の計算について確認する。

$\boldsymbol{\mu} = (\mu_{1}, \cdots , \mu_{k})$はTransformerの隠れ層の次元である$d$次元の$k$個のベクトルに対応するので、下記のような行列で表すこともできる。
$$
\large
\begin{align}
\boldsymbol{\mu} \in \mathbb{R}^{k \times d}
\end{align}
$$

上記の$\boldsymbol{\mu}$は下記のような式に基づいて都度計算を行う。
$$
\large
\begin{align}
\mu_{l} \longleftarrow \lambda \mu_{l} + \frac{1-\lambda}{2} \sum_{i: \mu(Q_i) = l} Q_{i} + \frac{1-\lambda}{2} \sum_{j: \mu(Q_j) = l} K_{j}
\end{align}
$$

要確認：上記の計算では$\displaystyle \sum_{i: \mu(Q_i) = l} Q_{i}$のように和を計算する一方で、本来は平均を計算すべきではないか。

まとめ

参考

・Transformer論文：Attention is All you need[2017]
・A Survey of Transformer
・Star-Transformer
・Longformer
・ETC: Extended Transformer Construction
・Big Bird
・Reformer
・Routing Transformer

Transformerの構成の分類：Encoder-Decoder・Decoder onlyなど

投稿日: 2023-10-282024-10-28 投稿者: lib-arts

BERT・GPT-$3$などのTransformerの応用研究を理解するにあたってはEncoder-Decoder、Encoder only、Decoder onlyのようなTransformerの構成の分類を理解しておくと良いです。当記事ではそれぞれの概要や具体的な研究の分類を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

Contents

1 前提の確認
- 1.1 Transformerの仕組み
- 1.2 Transformerの事前学習
2 Transformerの構成の分類

前提の確認

Transformerの仕組み

Dot Product Attentionに主に基づくTransformerの仕組みについては既知である前提で当記事はまとめました。下記などに解説コンテンツを作成しましたので、合わせて参照ください。

・直感的に理解するTransformer(統計の森作成)

Transformerの事前学習

Transformerは強力なモジュールである一方で、大量の学習セットが必要です。そこで予めWikipediaのような巨大なコーパスを元に事前学習(Pre-training)を行なったのちに手元のタスクについてfine-tuningを行うという形式で学習が行われることが多いです。

具体的にはBERT・T$5$・GPT-$3$などの研究が有名です。これらはLLM(Large Language Model)のように総称されることも多く、Transformerの事前学習は昨今の主要なトレンドであると解釈することもできます。

Transformerの構成の分類

概要

Transformerの構成はEncoder-Decoder、Encoder only、Decoder onlyに大別されます。

Encoder-Decoder、Encoder only、Decoder onlyはどれもTransformer論文のFigure$\, 1$を元に理解することが可能です。以下、当節ではEncoder-Decoder、Encoder only、Decoder onlyについてそれぞれ詳しく確認を行います。

Encoder-Decoder

「Encoder-Decoder」はオリジナルのTransformerの論文で用いられた形式で、主に系列変換に用いられる構成です。

上図の全体に対応しており、Encoderについて処理したのちにDecoderの処理を行います。このようなEncoder-Decoderの形式はRNNを用いた$2014$年のSequence to sequenceの研究と同様であり、この頃よく取り組まれていたタスクが機械翻訳であることにこの形式は起因すると解釈できます。

Transformerがこの機械翻訳タスクで高いパフォーマンスを示したことで、その後様々な応用研究がなされるようになり、用いられるTransformerの形式も多様化しました。T$5$もEncoder-Decoderの形式であることも合わせて抑えておくと良いと思います。

Encoder only

オリジナルのTransformerがEncoder-Decoderの形式である一方で、Transformerの応用研究で有名なBERTではFine-tuningによって分類タスクを中心に取り扱うことからEncoder onlyの形式が用いられます。

Decoder only

Decoderのみを用いるTransformerはBERTのようなEncoder onlyと反対にDecoderのみを用いて学習を行います。

Transformer Decoder論文では上記のような数式に基づいて誤差関数の定義や学習を行いますが、同様のlossはGPT系列でも用いられることからChatGPTやGPT-$4$も概ね同じような構成であることが推測できます。

上図はGPT論文のFigure$\, 1$で、図の左側がTransformerの論文の図のDecoderに一致することが確認できます。

Decoder onlyを理解するにあたって重要になるのがMasked Attention(Transformerの論文の図ではMasked Multi Head Attention、GPT論文の図ではMasked Multi Self Attention)です。

Masked Attention: Transformer Decoderの論文より

上記はTransformer論文とTransformer Decoder論文のMasked Attentionについての記載です。時系列が逆にならないようにAttentionを計算する際の行列(隣接行列)をマスクすると解釈して良いと思います。

より具体的には対角成分より上の成分が$0$かつそれ以外の成分が$1$である下三角行列を用いてAttentionに用いる行列との要素積を計算すれば自己回帰(Auto Regressive)処理を実現することができます。

具体的な研究の分類

具体的な研究の分類はGLaM(Generalist Language Model)論文のTable$\, 2$が参考になります。

GPT-$3$以外にもGopherやGLaMでDecoder onlyが用いられることから、LLMの研究ではDecoder onlyが用いられることが多いと解釈して良いと思います。

自動微分の理論と応用⑤：TransformerとDot Product Attention

投稿日: 2023-10-212023-10-30 投稿者: lib-arts

自動微分(Automatic Differentiation)は大規模なニューラルネットワークであるDeepLearningの学習における誤差逆伝播などに用いられる手法です。当記事ではDot Product Attentionに基づくTransformerの簡易版の実装を行いました。
作成にあたっては「ゼロから作るDeep Learning②」の第$5$章「リカレントニューラルネットワーク」の内容を主に参照しました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

ゼロから作るDeep Learning ❷ ―自然言語処理編

斎藤康毅

3,960円(06/27 00:08時点)

Amazon

・直感的に理解するTransformerの仕組み

Contents

1 Dot Product Attention
2 Transformer
- 2.1 Transformerレイヤーの計算
  - 2.1.1 Affineクラス
  - 2.1.2 Transformer_Layerクラス
- 2.2 $2$層Transformer

Dot Product Attention

ソフトマックス関数

def softmax(a):    # a is 2D-Array
    c = np.max(a, axis=1)
    exp_a = np.exp(a.T-c)
    sum_exp_a = np.sum(exp_a, axis=0)
    y = (exp_a / sum_exp_a).T
    return y

def softmax_3D(a):    # a is 3D-Array
    N, L, H = a.shape
    c = np.max(a, axis=2)
    exp_a = np.exp(a-c.reshape([N, L, 1]).repeat(H, axis=2))
    sum_exp_a = np.sum(exp_a, axis=2)
    y = exp_a / sum_exp_a.reshape([N, L, 1]).repeat(H, axis=2)
    return y

class Softmax:
    def __init__(self):
        self.loss = None
        self.y = None

    def forward(self, x):
        self.y = softmax_3D(x)
        return self.y

    def backward(self, dout):
        dx = dout * self.y * (1-self.y)
        return dx

内積の計算

Transformerの論文ではDot Product Attentionの計算を下記のように定義する。
$$
\large
\begin{align}
\mathrm{Attention}(K, Q, V) = \mathrm{softmax} \left( \frac{Q K^{\mathrm{T}}}{d_k} \right) V
\end{align}
$$

上記の$\displaystyle \mathrm{softmax} \left( \frac{Q K^{\mathrm{T}}}{d_k} \right)$の演算は下記のように実装することができる。

class CalcAttentionWeight:
    def __init__(self):
        self.h = None
        self.graph = None
        self.softmax = Softmax()

    def forward(self, h):
        self.h = h
        N, L, H = self.h.shape
        scaled_dp = np.zeros([N, L, L])
        for i in range(N):
            scaled_dp[i, :, :] = np.dot(h[i, :, :], h[i, :, :].T) / np.sqrt(H)

        self.graph = self.softmax.forward(scaled_dp)

        return self.graph

    def backward(self, dgraph):
        N, L, H = self.h.shape
        ds = self.softmax.backward(dgraph)
        dh = np.zeros_like(self.h)
        for i in range(N):
            dh[i, :, :] = 2*np.dot(dgraph[i, :, :], self.h[i, :, :]) / np.sqrt(H)
        return dh

上記のbackwardメソッドの計算にあたっては、行列$X$について下記のような行列微分が成立することを用いた。
$$
\large
\begin{align}
\frac{\partial}{\partial X} (X X^{\mathrm{T}}) = 2 X = \frac{\partial}{\partial X} (X^{\mathrm{T}} X)
\end{align}
$$

TransformerのDot Product Attentionでは行列の$Q$と$K$にどちらも隠れ層を用いるので、上記の計算が基本的に対応する。

Dot Product Attention

class MessagePassing3D:
    def __init__(self, adj_mat_3D):
        self.params, self.grads = [], []
        self.graph = adj_mat_3D
        self.h = None

    def forward(self, h):
        self.h = h
        N, L, H = h.shape
        m_t = np.zeros_like(h)
        for i in range(self.graph.shape[1]):
            ar = self.graph[:, i,:].reshape(N, L, 1).repeat(H, axis=2)
            t = h * ar
            m_t[:, i, :] = np.sum(t, axis=1)
        return m_t

    def backward(self, dm_t):
        N, L, H = self.h.shape
        dh = np.zeros_like(self.h)
        dar = np.zeros_like(self.graph)
        for i in range(self.graph.shape[1]):
            ar = self.graph[:, i, :].reshape(N, L, 1).repeat(H, axis=2)
            dh[:, i, :] = np.sum(dm_t * ar, axis=1)
        for i in range(self.graph.shape[0]):
            dar[i, :, :] = np.dot(dm_t[i, :, :], self.h[i, :, :].T)
        return dh, dar

基本的にはグラフニューラルネットワークと同様の実装を行なったが、内積による類似度計算を反映できるようにグラフの隣接行列に対応する配列を(L,L)から(N,L,L)に変えた。

自動微分・誤差逆伝播の理論と応用④：Attentionとグラフニューラルネットワーク

Transformer

Transformerレイヤーの計算

`Affine`クラス

class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
        self.dW = None
        self.db = None

    def forward(self, x):
        self.x = x
        out = np.dot(x, self.W) + self.b
        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        return dx

Affineクラスについては下記で詳しく取り扱った。

自動微分・誤差逆伝播の理論と応用②：Affine変換の自動微分とニューラルネットワークの実装

`Transformer_Layer`クラス

class Transformer_Layer:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.dW = np.zeros_like(W)
        self.db = np.zeros_like(b)
        self.h = None
        self.affines = []
        self.graph = None
        self.calc_mp = None
        self.calc_weight = CalcAttentionWeight()

    def forward(self, h):
        self.h = h
        self.graph = self.calc_weight.forward(self.h)
        self.calc_mp = MessagePassing3D(self.graph)
        m_t = self.calc_mp.forward(h)
        h_next = np.zeros_like(h)
        for i in range(self.graph.shape[1]):
            affine = Affine(self.W, self.b)
            h_next[:, i, :] = affine.forward(m_t[:, i, :])
            self.affines.append(affine)
            
        return h_next

    def backward(self, dh_next):
        N, T, H = self.h.shape
        
        dh_affine = np.zeros_like(self.h)
        for i in range(self.graph.shape[1]):
            dh_affine[:, i, :] = self.affines[i].backward(dh_next[:, i, :])
            self.dW += self.affines[i].dW
            self.db += self.affines[i].db
            
        dh_mp, dgraph = self.calc_mp.backward(dh_affine)
        dh = self.calc_weight.backward(dgraph)
        return dh_mp+dh

$2$層Transformer

出力層

class Aggregate:
    def __init__(self):
        self.h = None
        self.y = None

    def forward(self, h):
        self.h = h
        self.y = np.sum(h, axis=1)
        return self.y

    def backward(self, dy):
        N, T, H = self.h.shape
        dh = dy.reshape(N, 1, H).repeat(T, axis=1)
        return dh

def cross_entropy_error(y,t):
    delta = 1e-7
    return -np.sum(t * np.log(y+delta))

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None
        self.t = None

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        return self.loss

    def backward(self, dout=1.):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size
        return dx

$2$層Transformer

from collections import OrderedDict

class TwoLayerTransformer:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        self.params = {}
        self.params["W1"] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params["b1"] = np.zeros(hidden_size)
        self.params["W2"] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params["b2"] = np.zeros(output_size)

        # generate layers
        self.layers = OrderedDict()
        self.layers["Transformer_layer1"] = Transformer_Layer(self.params["W1"], self.params["b1"])
        self.layers["Softmax1"] = Softmax()
        self.layers["Transformer_layer2"] = Transformer_Layer(self.params["W2"], self.params["b2"])
        self.layers["Aggregate"] = Aggregate()
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def calc_gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = self.lastLayer.backward(1.)
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # output
        grads = {}
        grads["W1"] = self.layers["Transformer_layer1"].dW
        grads["b1"] = self.layers["Transformer_layer1"].db
        grads["W2"] = self.layers["Transformer_layer2"].dW
        grads["b2"] = self.layers["Transformer_layer2"].db

        return grads

実行例

np.random.seed(0)

alpha = 0.1

x = np.ones([2, 5, 2])
x[0, :, :1] = -1.
t = np.array([[1., 0.], [0., 1.]])

network = TwoLayerTransformer(2, 2, 2)

for i in range(50):
    grad = network.calc_gradient(x, t)

    for key in ("W1", "b1", "W2", "b2"):
        network.params[key] -= alpha * grad[key]

    if (i+1)%10==0:
        print(softmax(network.predict(x)))

・実行結果

[[0.46977806 0.53022194]
 [0.46352429 0.53647571]]
[[0.73811463 0.26188537]
 [0.30103311 0.69896689]]
[[0.96535871 0.03464129]
 [0.03273874 0.96726126]]
[[0.99548783 0.00451217]
 [0.00521023 0.99478977]]
[[9.99530595e-01 4.69404946e-04]
 [1.19280865e-03 9.98807191e-01]]

自動微分・誤差逆伝播の理論と応用④：Attentionとグラフニューラルネットワーク

投稿日: 2023-10-202023-10-30 投稿者: lib-arts

自動微分(Automatic Differentiation)は大規模なニューラルネットワークであるDeepLearningの学習における誤差逆伝播などに用いられる手法です。当記事ではAttention処理とグラフニューラルネットワーク(GNN; Graph Neural Network)の実装について取り扱いました。
作成にあたっては「ゼロから作るDeep Learning②」の第$8$章「Attention」の内容を元に大幅に改変を行いました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

ゼロから作るDeep Learning ❷ ―自然言語処理編

斎藤康毅

3,960円(06/27 00:08時点)

Amazon

・GNN入門

グラフニューラルネットワーク(GNN; Graph Neural Network)入門

Contents

1 MessagePassing
2 GNNの実装
- 2.1 GNNのレイヤーの実装
  - 2.1.1 クラスの実装
  - 2.1.2 使用例
- 2.2 $2$層GNNの実装
  - 2.2.1 関数・クラスの実装
  - 2.2.2 実行例

MessagePassing

NumPy.repeat

AttentionのようなMessagePassing処理を取り扱う際に係数のベクトルへの掛け算などを行列積で取り扱うのはなかなか難しいので、NumPy.repeatを用いることで要素積の計算を行うというのも一つの手段である。NumPy.repeatはたとえば下記のように使用することができる。

import numpy as np

x1 = np.array([[1., 2., 3.]])

print(x1.shape)
print(x1.repeat(2, axis=0).shape)
print(x1.repeat(2, axis=1).shape)
print(x1.repeat(2, axis=0))
print(x1.repeat(2, axis=1))

・実行結果

(1, 3)
(2, 3)
(1, 6)
[[1. 2. 3.]
 [1. 2. 3.]]
[[1. 1. 2. 2. 3. 3.]]

上記は$2$次元配列の場合であるが、$3$次元の場合も下記のように繰り返し操作を行える。

x2 = np.array([[[1., 2., 3.]]])

print(x2.shape)
print(x2.repeat(2, axis=0).shape)
print(x2.repeat(2, axis=1).shape)
print(x2.repeat(2, axis=2).shape)
print(x2.repeat(2, axis=0))
print(x2.repeat(2, axis=1))
print(x2.repeat(2, axis=2))

・実行結果

(1, 1, 3)
(2, 1, 3)
(1, 2, 3)
(1, 1, 6)
[[[1. 2. 3.]]

 [[1. 2. 3.]]]
[[[1. 2. 3.]
  [1. 2. 3.]]]
[[[1. 1. 2. 2. 3. 3.]]]

Attentionの実装

Attentionの基本処理

「ゼロから作るDeepLearning②」ではAttentionの重み付け和の計算にあたって、下記のようなWeightSumクラスを定義する。

class WeightSum:
    def __init__(self):
        self.params, self.grads = [], []
        self.cache = None

    def forward(self, hs, a):
        N, T, H = hs.shape
        ar = a.reshape(N, T, 1).repeat(H, axis=2)
        t = hs * ar
        c = np.sum(t, axis=1)
        self.cache = (hs, ar)
        return c

    def backward(self, dc):
        hs, ar = self.cache
        N, T, H = hs.shape
        dt = dc.reshape(N, 1, H).repeat(T, axis=1)
        dar = dt * hs
        dhs = dt * ar
        da = np.sum(dar, axis=2)
        return dhs, da

順伝播の実行例

a = np.array([[1., 0.5], [0.2, 0.5]])
hs = np.ones([2, 2, 3])

calc_weight_sum = WeightSum()
c = calc_weight_sum.forward(hs, a)
print(c)

・実行結果

[[1.5 1.5 1.5]
 [0.7 0.7 0.7]]

逆伝播の実行例

dc = np.ones([2, 3])
dhs, da = calc_weight_sum.backward(dc)

print(dhs.shape)
print(dhs)
print("===")
print(da.shape)
print(da)

・実行結果

(2, 2, 3)
[[[1.  1.  1. ]
  [0.5 0.5 0.5]]

 [[0.2 0.2 0.2]
  [0.5 0.5 0.5]]]
===
(2, 2)
[[3. 3.]
 [3. 3.]]

MessagePassingの実装

隣接行列：グラフの定義

$5$ノードのグラフの隣接行列は下記のように定義することができる。

adj_mat = np.array([[1, 0, 1, 1, 0], [0, 1, 1, 0, 1], [1, 1, 1, 0, 1], [1, 0, 0, 1, 1], [0, 1, 1, 1, 1]])
print(adj_mat)
print(adj_mat==adj_mat.T)

・実行結果

[[1 0 1 1 0]
 [0 1 1 0 1]
 [1 1 1 0 1]
 [1 0 0 1 1]
 [0 1 1 1 1]]
[[ True  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]]

ここでは無向グラフを取り扱うので、隣接行列が対称行列であることも合わせて確認しておくと良い。

`MessagePassing`クラスの実装

WeightSumクラスを改変することで下記のようにMessagePassingクラスの実装を行なった。

class MessagePassing:
    def __init__(self, adj_mat):
        self.params, self.grads = [], []
        self.graph = adj_mat
        self.cache = None

    def forward(self, h):
        N, T, H = h.shape
        m_t = np.zeros_like(h)
        for i in range(self.graph.shape[0]):
            ar = self.graph[i,:].reshape(1, T, 1).repeat(N, axis=0).repeat(H, axis=2)
            t = h * ar
            m_t[:, i, :] = np.sum(t, axis=1)
        self.cache = h
        return m_t

    def backward(self, dc):
        h = self.cache
        N, T, H = h.shape
        dh = np.zeros_like(h)
        for i in range(self.graph.shape[0]):
            ar = self.graph[i, :].reshape(1, T, 1).repeat(N, axis=0).repeat(H, axis=2)
            dt = dc.reshape(N, 1, H).repeat(T, axis=1)
            dh[:, i, :] = np.sum(dt * ar, axis=1)
        return dh

上記のMessagePassingクラスではWeightSumクラスの重み付け和の計算にあたって、隣接行列を用いるように改変を行なった。

`MessagePassing`クラスの実行例

■ 順伝播

順伝播の計算は下記のように実行することができる。

calc_mp = MessagePassing(adj_mat)

h = np.ones([2, 5, 3])
h[0,0,0] = 5
m_t = calc_mp.forward(h)

print(m_t.shape)
print(m_t)

・実行結果

(2, 5, 3)
[[[7. 3. 3.]
  [3. 3. 3.]
  [8. 4. 4.]
  [7. 3. 3.]
  [4. 4. 4.]]

 [[3. 3. 3.]
  [3. 3. 3.]
  [4. 4. 4.]
  [3. 3. 3.]
  [4. 4. 4.]]]

上記の結果が正しいかどうかは、下記のような計算を行うことで確認できる。

print(np.sum(h[0, :, 0] * adj_mat[0, :]))
print(np.sum(h[0, :, 0] * adj_mat[2, :]))
print(np.sum(h[0, :, 0] * adj_mat[3, :]))

・実行結果

7.0
8.0
7.0

■ 逆伝播

dc = np.ones([2, 3])
dh = calc_mp.backward(dc)
print(dh)

・実行結果

[[[3. 3. 3.]
  [3. 3. 3.]
  [4. 4. 4.]
  [3. 3. 3.]
  [4. 4. 4.]]

 [[3. 3. 3.]
  [3. 3. 3.]
  [4. 4. 4.]
  [3. 3. 3.]
  [4. 4. 4.]]]

GNNの実装

GNNのレイヤーの実装

クラスの実装

■ Affineクラス

class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
        self.dW = None
        self.db = None

    def forward(self, x):
        self.x = x
        out = np.dot(x, self.W) + self.b
        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        return dx

■ GNN_Layerクラス

class GNN_Layer:
    def __init__(self, W, b, adj_mat):
        self.W = W
        self.b = b
        self.dW = np.zeros_like(W)
        self.db = np.zeros_like(b)
        self.h = None
        self.graph = adj_mat

    def forward(self, h):
        N, T, H = h.shape
        self.h = h

        self.affines = []
        
        self.calc_mp = MessagePassing(self.graph)
        m_t = self.calc_mp.forward(h)
        h_next = np.zeros_like(h)
        for i in range(self.graph.shape[0]):
            affine = Affine(self.W, self.b)
            h_next[:, i, :] = affine.forward(m_t[:, i, :])
            self.affines.append(affine)
            
        return h_next

    def backward(self, dh_next):
        N, T, H = self.h.shape
        
        dh = np.zeros_like(self.h)
        for i in range(self.graph.shape[0]):
            dh[:, i, :] = self.affines[i].backward(dh_next[:, i, :])
            self.dW += self.affines[i].dW
            self.db += self.affines[i].db
            
        dh = self.calc_mp.backward(dh)
        return dh

使用例

■ 順伝播

N, T, H = 2, 5, 3

h = np.ones([2, 5, 3])
W = np.ones([H, H])
b = np.ones(H)

adj_mat = np.array([[1, 0, 1, 1, 0], [0, 1, 1, 0, 1], [1, 1, 1, 0, 1], [1, 0, 0, 1, 1], [0, 1, 1, 1, 1]])

gnn_layer = GNN_Layer(W, b, adj_mat)
h_next = gnn_layer.forward(h)

print(h_next)

・実行例

[[[10. 10. 10.]
  [10. 10. 10.]
  [13. 13. 13.]
  [10. 10. 10.]
  [13. 13. 13.]]

 [[10. 10. 10.]
  [10. 10. 10.]
  [13. 13. 13.]
  [10. 10. 10.]
  [13. 13. 13.]]]

■ 逆伝播

dh_next = np.ones([2, 5, 3])

dh = gnn_layer.backward(dh_next)

print(dh)

・実行結果

[[[ 9.  9.  9.]
  [ 9.  9.  9.]
  [12. 12. 12.]
  [ 9.  9.  9.]
  [12. 12. 12.]]

 [[ 9.  9.  9.]
  [ 9.  9.  9.]
  [12. 12. 12.]
  [ 9.  9.  9.]
  [12. 12. 12.]]]

$2$層GNNの実装

関数・クラスの実装

■ 活性化関数・誤差関数

def softmax(a):    # a is 2D-Array
    c = np.max(a, axis=1)
    exp_a = np.exp(a.T-c)
    sum_exp_a = np.sum(exp_a, axis=0)
    y = (exp_a / sum_exp_a).T
    return y

def softmax_3D(a):    # a is 3D-Array
    N, L, H = a.shape
    c = np.max(a, axis=2)
    exp_a = np.exp(a-c.reshape([N, L, 1]).repeat(H, axis=2))
    sum_exp_a = np.sum(exp_a, axis=2)
    y = exp_a / sum_exp_a.reshape([N, L, 1]).repeat(H, axis=2)
    return y

def cross_entropy_error(y,t):
    delta = 1e-7
    return -np.sum(t * np.log(y+delta))

class Softmax:
    def __init__(self):
        self.loss = None
        self.y = None

    def forward(self, x):
        self.y = softmax_3D(x)
        return self.y

    def backward(self, dout):
        dx = dout * self.y * (1-self.y)
        return dx

class Aggregate:
    def __init__(self):
        self.h = None
        self.y = None

    def forward(self, h):
        self.h = h
        self.y = np.sum(h, axis=1)
        return self.y

    def backward(self, dy):
        N, T, H = self.h.shape
        dh = dy.reshape(N, 1, H).repeat(T, axis=1)
        return dh

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None
        self.t = None

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        return self.loss

    def backward(self, dout=1.):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size
        return dx

途中で用いたソフトマックス関数は下記のように定義される。
$$
\large
\begin{align}
y_k &= \mathrm{Softmax}(x_k) = \frac{\exp{(x_k)}}{S} \\
S &= \sum_{j} \exp{(x_j)}
\end{align}
$$

上記の$x_k$に関する偏微分は分数関数の微分の公式を元に下記のように計算できる。
$$
\large
\begin{align}
\frac{\partial y_k}{\partial x_k} &= \frac{\exp{(x_k)}S – \exp{(x_k)} \cdot \exp{(x_k)}}{S^{2}} \\
&= \frac{\exp{(x_k)}(S – \exp{(x_k)})}{S^{2}} \\
&= \frac{\exp{(x_k)}}{S} \cdot \left( 1 – \frac{\exp{(x_k)}}{S} \right) = y_k(1-y_k)
\end{align}
$$

上記に基づいてSoftmaxクラスのbackwardメソッドの実装を行なった。

■ $2$層GNN

from collections import OrderedDict

class TwoLayerGNN:
    def __init__(self, input_size, hidden_size, output_size, adj_mat, weight_init_std=0.01):
        self.params = {}
        self.params["W1"] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params["b1"] = np.zeros(hidden_size)
        self.params["W2"] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params["b2"] = np.zeros(output_size)

        # generate layers
        self.layers = OrderedDict()
        self.layers["GNN_layer1"] = GNN_Layer(self.params["W1"], self.params["b1"], adj_mat)
        self.layers["Softmax1"] = Softmax()
        self.layers["GNN_layer2"] = GNN_Layer(self.params["W2"], self.params["b2"], adj_mat)
        self.layers["Aggregate"] = Aggregate()
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def calc_gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = self.lastLayer.backward(1.)
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # output
        grads = {}
        grads["W1"] = self.layers["GNN_layer1"].dW
        grads["b1"] = self.layers["GNN_layer1"].db
        grads["W2"] = self.layers["GNN_layer2"].dW
        grads["b2"] = self.layers["GNN_layer2"].db

        return grads

実行例

■ softmax_3D関数

x = np.ones([2,5,3])
x[:, :, 1] = 2
softmax_3D(x)

・実行結果

array([[[0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156]],

       [[0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156],
        [0.21194156, 0.57611688, 0.21194156]]])

■ 学習の実行

np.random.seed(0)

alpha = 0.1
adj_mat = np.array([[1, 0, 1, 1, 0], [0, 1, 1, 0, 1], [1, 1, 1, 0, 1], [1, 0, 0, 1, 1], [0, 1, 1, 1, 1]])

x = np.ones([2, 5, 2])
x[0, :, :1] = -1.
t = np.array([[1., 0.], [0., 1.]])

network = TwoLayerGNN(2, 2, 2, adj_mat)

for i in range(10):
    grad = network.calc_gradient(x, t)

    for key in ("W1", "b1", "W2", "b2"):
        network.params[key] -= alpha * grad[key]

    if (i+1)%2==0:
        print(softmax(network.predict(x)))

・実行結果

[[9.99882931e-01 1.17068837e-04]
 [9.99882862e-01 1.17138203e-04]]
[[2.09556970e-08 9.99999979e-01]
 [3.65517109e-09 9.99999996e-01]]
[[9.99966650e-01 3.33497863e-05]
 [1.24346800e-12 1.00000000e+00]]
[[1.00000000e+00 1.81375702e-28]
 [1.59424189e-14 1.00000000e+00]]
[[1.00000000e+00 1.22993158e-51]
 [1.62446544e-16 1.00000000e+00]]

■ 推論

x = np.array([[[-1, 1], [-1, 1], [-0.5, 0.3], [-1.2, 0.8], [-0.9, 1.]]])
print(network.predict(x))
print(softmax(network.predict(x)))

・実行結果

[[ 58.68030203 -58.54451771]]
[[1.00000000e+00 1.23000717e-51]]

自動微分・誤差逆伝播の理論と応用③：RNN(Recurrent Neural Network)

投稿日: 2023-10-172023-10-17 投稿者: lib-arts

自動微分(Automatic Differentiation)は大規模なニューラルネットワークであるDeepLearningの学習における誤差逆伝播などに用いられる手法です。当記事ではリカレントニューラルネットワーク(RNN; Recurrent Neural Network)の実装について取り扱いました。
作成にあたっては「ゼロから作るDeep Learning②」の第$5$章「リカレントニューラルネットワーク」の内容を主に参照しました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

ゼロから作るDeep Learning ❷ ―自然言語処理編

斎藤康毅

3,960円(06/27 00:08時点)

Amazon

Contents

1 RNNモジュールの実装
2 RNN全体の実装
- 2.1 実装
- 2.2 使用例

RNNモジュールの実装

概要

RNNにおける基本的な再帰処理は下記のような数式で表すことができる。
$$
\large
\begin{align}
\mathbf{h}_{t} &= \tanh{(\mathbf{h}_{t-1} W_{h} + \mathbf{x}_{t} W_{x} + \mathbf{b})} \\
\mathbf{h}_{t-1}^{\mathrm{T}}, \, \mathbf{h}_{t}^{\mathrm{T}} & \in \mathbb{R}^{H}, \, W_{h} \in \mathbb{R}^{H \times H}, \, \mathbf{x}_{t}^{\mathrm{T}} \in \mathbb{R}^{D}, \, W_{x} \in \mathbb{R}^{D \times H}, \, \mathbf{b}^{\mathrm{T}} \in \mathbb{R}^{H}
\end{align}
$$

上記はバッチサイズが$N=1$の場合の表記である。同様にバッチサイズ$N$の場合は下記のような数式で再帰処理を表せる。
$$
\large
\begin{align}
\mathbf{h}_{t} &= \tanh{(\mathbf{h}_{t-1} W_{h} + \mathbf{x}_{t} W_{x} + \mathbf{b})} \\
\mathbf{h}_{t-1}, \, \mathbf{h}_{t} & \in \mathbb{R}^{N \times H}, \, W_{h} \in \mathbb{R}^{H \times H}, \, \mathbf{x}_{t} \in \mathbb{R}^{N \times D}, \, W_{x} \in \mathbb{R}^{D \times H}, \, \mathbf{b} \in \mathbb{R}^{N \times H}
\end{align}
$$

上記の微分はAffine変換の微分と同様に取り扱えば良い。

自動微分・誤差逆伝播の理論と応用②：Affine変換の自動微分とニューラルネットワークの実装

$\tanh{(x)}$の定義と微分

ハイパボリックタンジェント関数$\tanh{(x)}$は下記のように定義される。
$$
\large
\begin{align}
\tanh{(x)} = \frac{e^{x} – e^{-x}}{e^{x} + e^{-x}}
\end{align}
$$

上記の$x$に関する微分は分数関数の微分の公式に基づいて下記のように得られる。
$$
\large
\begin{align}
(\tanh{(x)})’ &= \frac{(e^{x} + e^{-x})^{2} – (e^{x} – e^{-x})^{2}}{(e^{x} + e^{-x})^{2}} \\
&= 1 – \frac{(e^{x} – e^{-x})^{2}}{(e^{x} + e^{-x})^{2}} \\
&= 1 – \tanh{(x)}
\end{align}
$$

実装と使用例

class RNN:
    def __init__(self, Wx, Wh, b):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.cache = None

    def forward(self, x, h_prev):
        Wx, Wh, b = self.params
        t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b
        h_next = np.tanh(t)
        self.cache = (x, h_prev, h_next)
        return h_next

    def backward(self, dh_next):
        Wx, Wh, b = self.params
        x, h_prev, h_next = self.cache
        
        dt = dh_next * (1. - h_next**2)
        db = np.sum(dt, axis=0)
        dWh = np.dot(h_prev.T, dt)
        dh_prev = np.dot(dt, Wh.T)
        dWx = np.dot(x.T, dt)
        dx = np.dot(dt, Wx.T)

        self.grads[0][...] = dWx
        self.grads[1][...] = dWh
        self.grads[2][...] = db

        return dx, dh_prev

順伝播の計算

下記のようにRNNクラスを用いることで順伝播の計算を行うことができる。

Wx = np.array([[-2., 1., 3.], [2., -3., 1.]])
Wh = np.array([[1., 2., -5.], [2., -5., 7.], [1., 5., 2]])
b = np.array([-2., -2., -5.])

x = np.array([[1., 2.]])
h_prev = np.array([[2., 1., 2.]])
dh_next = np.array([1., 1., 1.])

rnn = RNN(Wx, Wh, b)
h_next = rnn.forward(x, h_prev)

print(h_next)

・実行結果

[[0.99998771 0.96402758 0.76159416]]

上記では順伝播の計算を行ったが、計算の途中経過が正しいことは下記より確認できる。

t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b
h_next = np.tanh(t)

print(np.dot(h_prev, Wh))
print(np.dot(x, Wx))
print(b)
print(t)
print(h_next)

・実行結果

[[6. 9. 1.]]
[[ 2. -5.  5.]]
[-2. -2. -5.]
[[6. 2. 1.]]
[[0.99998771 0.96402758 0.76159416]]

上記の計算は下記の数式の計算と同様である。
・t = np.dot(h_prev, Wh) + np.dot(x, Wx) + bの計算
$$
\begin{align}
& H_{t-1} W_h + X W_x + B \\
&= \left( \begin{array}{ccc} 2 & 1 & 2 \end{array} \right)\left( \begin{array}{ccc} 1 & 2 & -5 \\ 2 & -5 & 7 \\ 1 & 5 & 2 \end{array} \right) + \left( \begin{array}{cc} 1 & 2 \end{array} \right)\left( \begin{array}{ccc} -2 & 1 & 3 \\ 2 & -3 & 1 \end{array} \right) + \left( \begin{array}{ccc} -2 & -2 & 5 \end{array} \right) \\
&= \left( \begin{array}{ccc} 6 & 9 & 1 \end{array} \right) + \left( \begin{array}{ccc} 2 & -5 & 5 \end{array} \right) + \left( \begin{array}{ccc} -2 & -2 & -5 \end{array} \right) \\
&= \left( \begin{array}{ccc} 6 & 2 & 1 \end{array} \right)
\end{align}
$$

・h_next = np.tanh(t)の計算
$$
\large
\begin{align}
\tanh{(6)} &= \frac{e^{6}-e^{-6}}{e^{6}+e^{-6}} = 0.999 \cdots \\
\tanh{(2)} &= \frac{e^{2}-e^{-2}}{e^{2}+e^{-2}} = 0.964 \cdots \\
\tanh{(1)} &= \frac{e^{1}-e^{-1}}{e^{1}+e^{-1}} = 0.761 \cdots
\end{align}
$$

RNN全体の実装

実装

class TimeRNN:
    def __init__(self, Wx, Wh, b):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.layers = None
        self.h, self.dh = None, None

    def forward(self, xs):
        Wx, Wh, b = self.params
        N, T, D = xs.shape
        D, H = Wx.shape
        
        self.layers = []
        hs = np.empty((N, T, H), dtype="f")
        
        if self.h is None:
            self.h = np.zeros((N, H), dtype="f")

        for t in range(T):
            layer = RNN(*self.params)
            self.h = layer.forward(xs[:, t, :], self.h)
            hs[:, t, :] = self.h
            self.layers.append(layer)

        return hs

使用例

RNN全体の実装は基本的には前節のRNNクラスを複数回用いて構築を行う。以下、バッチ数が$N=2$、時系列の数が$T=2$、各入力ベクトルの次元が$D=3$、隠れ層の次元が$H=2$の場合について確認を行う。まず、入力に対応する配列を下記のように定義する。

xs = np.array([[[1., 2., 1.], [2., 1., 1.]], [[1., -2., 1.], [2., -1., 1.]]])
print(xs.shape)  # (N, T, D)

・実行結果

(2, 2, 3)

また、パラメータの$W_x, W_h, \mathbf{b}$はRNNクラスの各インスタンスで共有するので、下記のように定義できる。

Wx = np.array([[1., 1.], [2., -1], [-1., 1.]])
Wh = np.array([[1., -1.], [-1., 2.]])
b = np.array([[-1., 1.]]).repeat(2, axis=0)

print(Wx.shape)  # (D, H)
print(Wh.shape)  # (H, H)
print(b.shape)   # (N, H)

・実行結果

(3, 2)
(2, 2)
(2, 2)

このとき、TimeRNNクラスを用いて下記のようにRNNの順伝播を計算することができる。

rnns = TimeRNN(Wx, Wh, b)
hs = rnns.forward(xs)

print(xs.shape)
print(hs.shape)
print(hs[:, 1, :])

・実行結果

(2, 2, 3)
(2, 2, 2)
[[ 0.97729546  0.9982775 ]
 [-0.99932903  0.99999976]]

上記の結果は下記の実行結果に一致する。

h1 = np.tanh(np.dot(xs[:, 0, :], Wx) + b)
h2 = np.tanh(np.dot(h1, Wh) + np.dot(xs[:, 1, :], Wx) + b)
print(h2)

・実行結果

[[ 0.97729548  0.99827751]
 [-0.99932906  0.99999977]]

自動微分・誤差逆伝播の理論と応用②：Affine変換の自動微分とニューラルネットワークの実装

投稿日: 2023-10-152023-10-15 投稿者: lib-arts

自動微分(Automatic Differentiation)は大規模なニューラルネットワークであるDeepLearningの学習における誤差逆伝播などに用いられる手法です。当記事ではAffine変換の自動微分とニューラルネットワーク(Neural Network)の実装について取りまとめました。
作成にあたっては「ゼロから作るDeep Learning」の第$5$章「誤差逆伝播法」の内容を主に参照しました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

ゼロから作るDeep Learning ―Pythonで学ぶディープラーニングの理論と実装

斎藤康毅

3,740円(06/27 10:20時点)

Amazon

Contents

1 Affine変換の自動微分
2 ニューラルネットワーク
- 2.1 誤差関数と誤差逆伝播
- 2.2 実装と使用例
  - 2.2.1 ソフトマックス関数と交差エントロピー誤差関数
  - 2.2.2 ニューラルネットワーク

Affine変換の自動微分

Affine変換の式定義と転置

MLP(Multi Layer Perceptron)の順伝播の計算はAffine変換の式を元に定義することができる。たとえば入力層を$\mathbf{x}$、計算に用いるパラメータを$W$、出力層を$\mathbf{y}$、バイアス項を$\mathbf{b}$とおくとき、Affine変換の式は下記のように表せる。
$$
\large
\begin{align}
\mathbf{y} = W \mathbf{x} + \mathbf{b} \quad (1.1)
\end{align}
$$

$\mathbf{y} \in \mathbb{R}^{d_x}, \, \mathbf{y} \in \mathbb{R}^{d_y}$であるとき、$(1.1)$式は$W \in \mathbb{R}^{d_y \times d_x}, \, \mathbf{b} \in \mathbb{R}^{d_y}$を用いて下記のように表すこともできる。
$$
\large
\begin{align}
\mathbf{y} &= W \mathbf{x} + \mathbf{b} \quad (1.1) \\
\left( \begin{array}{c} y_1 \\ \vdots \\ y_{d_y} \end{array} \right) &= \left( \begin{array}{ccc} w_{11} & \cdots & w_{1, d_x} \\ \vdots & \ddots & \vdots \\ w_{d_y, 1} & \cdots & w_{d_y, d_x} \end{array} \right) \left( \begin{array}{c} x_1 \\ \vdots \\ x_{d_x} \end{array} \right) + \left( \begin{array}{c} b_1 \\ \vdots \\ b_{d_y} \end{array} \right) \quad (1.2)
\end{align}
$$

上記の式はオーソドックスな行列とベクトルの積の形式である一方で、統計学や機械学習では$\mathbf{x}$に複数サンプルを用いるにあたって、下記のような転置の形式を用いることが多い。
$$
\large
\begin{align}
\left( \begin{array}{c} y_1 \\ \vdots \\ y_{d_y} \end{array} \right)^{\mathrm{T}} &= \left[ \left( \begin{array}{ccc} w_{11} & \cdots & w_{1, d_x} \\ \vdots & \ddots & \vdots \\ w_{d_y, 1} & \cdots & w_{d_y, d_x} \end{array} \right) \left( \begin{array}{c} x_1 \\ \vdots \\ x_{d_x} \end{array} \right) + \left( \begin{array}{c} b_1 \\ \vdots \\ b_{d_y} \end{array} \right) \right]^{\mathrm{T}} \\
\left( \begin{array}{ccc} y_1 & \cdots & y_{d_y} \end{array} \right) &= \left( \begin{array}{ccc} x_1 & \cdots & x_{d_x} \end{array} \right) \left( \begin{array}{ccc} w_{11} & \cdots & w_{1, d_y} \\ \vdots & \ddots & \vdots \\ w_{d_x, 1} & \cdots & w_{d_x, d_y} \end{array} \right) + \left( \begin{array}{ccc} b_1 & \cdots & b_{d_y} \end{array} \right) \quad (1.2)’
\end{align}
$$

$(1.2)’$式を元に$n$個の複数サンプルの入力と出力の対応の式を下記のように表すことができる。
$$
\begin{align}
\left( \begin{array}{ccc} y_{11} & \cdots & y_{1,d_y} \\ \vdots & \ddots & \vdots \\ y_{n,1} & \cdots & y_{n,d_y} \end{array} \right) &= \left( \begin{array}{ccc} x_{11} & \cdots & x_{1,d_x} \\ \vdots & \ddots & \vdots \\ x_{n,1} & \cdots & x_{n,d_x} \end{array} \right) \left( \begin{array}{ccc} w_{11} & \cdots & w_{1, d_y} \\ \vdots & \ddots & \vdots \\ w_{d_x, 1} & \cdots & w_{d_x, d_y} \end{array} \right) + \left( \begin{array}{ccc} b_{1} & \cdots & b_{d_y} \\ \vdots & \ddots & \vdots \\ b_{1} & \cdots & b_{d_y} \end{array} \right) \quad (1.3) \\
Y &= XW + B
\end{align}
$$

上記の$(1.3)$式の入力とパラメータの積の順序はデザイン行列と同様に理解しておくと良い。

入力・パラメータ・バイアスによる行列微分

多変数スカラー関数$\displaystyle L(Y) = \sum_{i=1}^{n} f(y_{i1}, \cdots , y_{i,d_y}) = \sum_{i=1}^{n} f(\mathbf{y}_{i})$を定義する。このとき関数$f$を$(1.3)$の表記に基づいて入力やパラメータについて行列微分を行う際の式について以下、確認を行う。

入力行列による行列微分

下記のように入力による行列微分$\displaystyle \frac{\partial L}{\partial X}$を定義する。
$$
\large
\begin{align}
\frac{\partial L}{\partial X} = \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial x_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial x_{1,d_x}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial x_{n1}} & \cdots & \displaystyle \frac{\partial L}{\partial x_{n,d_x}} \end{array} \right)
\end{align}
$$

このとき、$\displaystyle \frac{\partial L}{\partial x_{ij}}$は微分の連鎖律に基づいて下記のように計算できる。
$$
\large
\begin{align}
\frac{\partial L}{\partial x_{ij}} &= \frac{\partial L}{\partial y_{i1}} \cdot \frac{\partial y_{i1}}{\partial x_{ij}} + \cdots + \frac{\partial L}{\partial y_{i,d_y}} \cdot \frac{\partial y_{i,d_y}}{\partial x_{ij}} \\
&= \sum_{k=1}^{d_y} \frac{\partial f(\mathbf{y}_{i})}{\partial y_{ik}} \cdot \frac{\partial y_{ik}}{\partial x_{ij}} = \sum_{k=1}^{d_y} \frac{\partial f(\mathbf{y}_{i})}{\partial y_{ik}} w_{jk} \\
&= \left( \begin{array}{ccc} \displaystyle \frac{\partial f(\mathbf{y}_{i})}{\partial y_{i1}} & \cdots & \displaystyle \frac{\partial f(\mathbf{y}_{i})}{\partial y_{i,d_y}} \end{array} \right) \left( \begin{array}{c} w_{j1} \\ \vdots \\ w_{j,d_y} \end{array} \right)
\end{align}
$$

上記より、$\displaystyle \frac{\partial L}{\partial X}$は下記のように表せる。
$$
\large
\begin{align}
\frac{\partial L}{\partial X} &= \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial x_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial x_{1,d_x}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial x_{n1}} & \cdots & \displaystyle \frac{\partial L}{\partial x_{n,d_x}} \end{array} \right) \\
&= \left( \begin{array}{ccc} \displaystyle \frac{\partial f(\mathbf{y}_{1})}{\partial y_{11}} & \cdots & \displaystyle \frac{\partial f(\mathbf{y}_{1})}{\partial y_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial f(\mathbf{y}_{n})}{\partial y_{n1}} & \cdots & \displaystyle \frac{\partial f(\mathbf{y}_{n})}{\partial y_{n,d_y}} \end{array} \right) \left( \begin{array}{ccc} w_{11} & \cdots & w_{d_x,1} \\ \vdots & \ddots & \vdots \\ w_{1,d_y} & \cdots & w_{d_x,d_y} \end{array} \right) \\
&= \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial y_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial y_{n1}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{n,d_y}} \end{array} \right) \left( \begin{array}{ccc} w_{11} & \cdots & w_{1,d_y} \\ \vdots & \ddots & \vdots \\ w_{d_x,1} & \cdots & w_{d_x, d_y} \end{array} \right)^{\mathrm{T}} = \frac{\partial L}{\partial Y} W^{\mathrm{T}}
\end{align}
$$

パラメータ行列による行列微分

下記のようにパラメータ行列による行列微分$\displaystyle \frac{\partial L}{\partial W}$を定義する。
$$
\large
\begin{align}
\frac{\partial L}{\partial W} = \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial w_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial w_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial w_{d_x,1}} & \cdots & \displaystyle \frac{\partial f}{\partial w_{d_x,d_y}} \end{array} \right)
\end{align}
$$

このとき、$\displaystyle \frac{\partial L}{\partial w_{jk}}$は微分の連鎖律に基づいて下記のように計算できる。
$$
\large
\begin{align}
\frac{\partial L}{\partial w_{jk}} &= \sum_{i=1}^{n} \frac{\partial f(\mathbf{y}_{i})}{\partial w_{jk}} \\
&= \sum_{i=1}^{n} \frac{\partial f(\mathbf{y}_{i})}{\partial y_{ik}} \cdot \frac{\partial y_{ik}}{\partial w_{jk}} = \sum_{i=1}^{n} \frac{\partial f(\mathbf{y}_{i})}{\partial y_{ik}} x_{ij} \\
&= \left( \begin{array}{ccc} x_{1j} & \cdots & x_{nj} \end{array} \right) \left( \begin{array}{c} \displaystyle \frac{\partial f(\mathbf{y}_{1})}{\partial y_{1k}} \\ \vdots \\ \displaystyle \frac{\partial f(\mathbf{y}_{n})}{\partial y_{nk}} \end{array} \right)
\end{align}
$$

上記より、$\displaystyle \frac{\partial L}{\partial W}$は下記のように表せる。
$$
\large
\begin{align}
\frac{\partial L}{\partial W} &= \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial w_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial w_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial w_{d_x,1}} & \cdots & \displaystyle \frac{\partial L}{\partial w_{d_x,d_y}} \end{array} \right) \\
&= \left( \begin{array}{ccc} x_{11} & \cdots & x_{n1} \\ \vdots & \ddots & \vdots \\ x_{1,d_x} & \cdots & x_{n,d_x} \end{array} \right) \left( \begin{array}{ccc} \displaystyle \frac{\partial f(\mathbf{y}_{1})}{\partial y_{11}} & \cdots & \displaystyle \frac{\partial f(\mathbf{y}_{1})}{\partial y_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial f(\mathbf{y}_{n})}{\partial y_{n1}} & \cdots & \displaystyle \frac{\partial f(\mathbf{y}_{n})}{\partial y_{n,d_y}} \end{array} \right) \\
&= \left( \begin{array}{ccc} x_{11} & \cdots & x_{1,d_x} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{n,d_x} \end{array} \right)^{\mathrm{T}} \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial y_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial y_{n1}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{n,d_y}} \end{array} \right) = X^{\mathrm{T}} \frac{\partial L}{\partial Y}
\end{align}
$$

バイアス行列による行列微分

下記のようにパラメータ行列による行列微分$\displaystyle \frac{\partial L}{\partial B}$を定義する。
$$
\large
\begin{align}
\frac{\partial L}{\partial W} = \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial b_{1}} & \cdots & \displaystyle \frac{\partial L}{\partial b_{d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial b_{1}} & \cdots & \displaystyle \frac{\partial L}{\partial b_{d_y}} \end{array} \right)
\end{align}
$$

このとき、$i$行目の$\displaystyle \frac{\partial L}{\partial b_{k}}$は微分の連鎖律に基づいて下記のように計算できる。
$$
\large
\begin{align}
\frac{\partial L}{\partial b_{k}} &= \frac{\partial L}{\partial y_{ik}} \cdot \frac{\partial y_{ik}}{\partial b_{k}} \\
&= \frac{\partial L}{\partial y_{ik}} \cdot 1 \\
&= \frac{\partial L}{\partial y_{ik}}
\end{align}
$$

上記より、$\displaystyle \frac{\partial L}{\partial B}$は下記のように表せる。
$$
\large
\begin{align}
\frac{\partial L}{\partial B} &= \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial b_{1}} & \cdots & \displaystyle \frac{\partial L}{\partial b_{d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial b_{1}} & \cdots & \displaystyle \frac{\partial L}{\partial b_{d_y}} \end{array} \right) \\
&= \left( \begin{array}{ccc} \displaystyle \frac{\partial L}{\partial y_{11}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{1,d_y}} \\ \vdots & \ddots & \vdots \\ \displaystyle \frac{\partial L}{\partial y_{n1}} & \cdots & \displaystyle \frac{\partial L}{\partial y_{n,d_y}} \end{array} \right) = \frac{\partial L}{\partial Y}
\end{align}
$$

バイアスパラメータはベクトルで定義されるので、勾配法の計算などの際は行方向に上記の和を計算し、パラメータのUpdateを行う。次項のAffineクラスの17行目のself.db = np.sum(dout, axis=0)がバイアス行列の行方向の和の計算に対応する。

実装と使用例

class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
        self.dW = None
        self.db = None

    def forward(self, x):
        self.x = x
        out = np.dot(x, self.W) + self.b
        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        return dx

上記のクラスは下記のように用いることができる。

# x = np.array([1., 2.]) # error
x = np.array([[1., 2.]])
# x = np.array([[1., 3.], [3., 5.], [5., 7.]]) # batch
W = np.array([[1., 2., 3.], [1., 2., 3.]])
b = np.array([1., 1., 1.])

mlp_layer = Affine(W, b)

# forward
y = mlp_layer.forward(x)

# backward
dout = np.ones(y.shape)
dx = mlp_layer.backward(dout)

print(y)
print(dx)

・実行結果

[[ 4.  7. 10.]]
[[6. 6.]]

入力に与えるxを1Dではなく2Dで与える必要があることに注意が必要である。また、バイアス項のbはNumPyのブロードキャストという機能に基づいて行われていることに着目しておくと良い。

ニューラルネットワーク

誤差関数と誤差逆伝播

ニューラルネットワークでは出力層のソフトマックスを計算したのちに、交差エントロピー(Cross Entropy)誤差関数を用いて学習を行う。中間層やパラメータなどの誤差関数の勾配を誤差逆伝播法に基づいて取得し、勾配法などを用いてパラメータの推定を行う。

実装と使用例

ソフトマックス関数と交差エントロピー誤差関数

ソフトマックス関数は下記のように実装できる。

def softmax(a):    # a is 2D-Array
    c = np.max(a, axis=1)
    exp_a = np.exp(a.T-c)
    sum_exp_a = np.sum(exp_a, axis=0)
    y = (exp_a / sum_exp_a).T
    return y

softmax関数の実装は「ゼロから作るDeepLearning」の$3$章の実装は1Dの配列用であるので、2D用に改変を行った。softmax関数は下記のように実行することができる。

x1 = np.array([[1., 2., 3.]])
x2 = np.array([[1., 2., 3.], [1., 3., 5]])
print(softmax(x1))
print("---")
print(softmax(x2))

・実行結果

[[0.09003057 0.24472847 0.66524096]]
---
[[0.09003057 0.24472847 0.66524096]
 [0.01587624 0.11731043 0.86681333]]

ソフトマックス関数の演算とクロスエントロピー誤差関数の計算などの処理は下記のように実装することができる。

def cross_entropy_error(y,t):
    delta = 1e-7
    return -np.sum(t * np.log(y+delta))

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None
        self.t = None

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        return self.loss

    def backward(self, dout=1.):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size
        return dx

上記のSoftmaxWithLossクラスは下記のように使用できる。

x = np.array([[1., 2., 3.]])  # x is 2D-Array
t = np.array([0., 1., 0.])

calc_loss = SoftmaxWithLoss()

# forward
loss = calc_loss.forward(x,t)

# backward
dx = calc_loss.backward()

print(loss)
print(dx)

・実行結果

1.407605555828337
[[ 0.03001019 -0.25175718  0.22174699]]

ニューラルネットワーク

二層のニューラルネットワークを取り扱うにあたって、下記のようにTwoLayerNetクラスを定義する。

from collections import OrderedDict

class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        self.params = {}
        self.params["W1"] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params["b1"] = np.zeros(hidden_size)
        self.params["W2"] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params["b2"] = np.zeros(output_size)

        # generate layers
        self.layers = OrderedDict()
        self.layers["Affine1"] = Affine(self.params["W1"], self.params["b1"])
        self.layers["Relu1"] = Relu()
        self.layers["Affine2"] = Affine(self.params["W2"], self.params["b2"])
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def calc_gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = self.lastLayer.backward(1.)
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # output
        grads = {}
        grads["W1"] = self.layers["Affine1"].dW
        grads["b1"] = self.layers["Affine1"].db
        grads["W2"] = self.layers["Affine2"].dW
        grads["b2"] = self.layers["Affine2"].db

        return grads

上記のTwoLayerNetを元に下記のようにニューラルネットワークの学習を行える。

np.random.seed(10)

alpha = 0.1
x = np.array([[1., 1., 1.], [1., 3., 5.]])
t = np.array([[1., 0.], [0., 1.]])

network = TwoLayerNet(input_size=3, hidden_size=5, output_size=2)

for i in range(100):
    grad = network.calc_gradient(x, t)

    for key in ("W1", "b1", "W2", "b2"):
        network.params[key] -= alpha * grad[key]

    if (i+1)%20==0:
        print(softmax(network.predict(x)))

・実行結果

[[0.48243772 0.51756228]
 [0.38127236 0.61872764]]
[[0.58016734 0.41983266]
 [0.1216153  0.8783847 ]]
[[0.75082177 0.24917823]
 [0.05060992 0.94939008]]
[[0.81603073 0.18396927]
 [0.01974114 0.98025886]]
[[0.85941286 0.14058714]
 [0.01267276 0.98732724]]

自動微分・誤差逆伝播の理論と応用①：自動微分の仕組みと基本的な実装方針

投稿日: 2023-10-132024-02-22 投稿者: lib-arts

自動微分(Automatic Differentiation)は大規模なニューラルネットワークであるDeepLearningの学習における誤差逆伝播などに用いられる手法です。当記事では自動微分の仕組みとPythonを用いた基本的な実装方針について取りまとめました。
作成にあたっては「ゼロから作るDeep Learning」の第$5$章「誤差逆伝播法」の内容を主に参照しました。

・用語/公式解説
https://www.hello-statisticians.com/explain-terms

ゼロから作るDeep Learning ―Pythonで学ぶディープラーニングの理論と実装

斎藤康毅

3,740円(06/27 10:20時点)

Amazon

Contents

1 自動微分の仕組み
2 自動微分の基本的な実装方針

自動微分の仕組み

合成関数の微分

$y=f(u), u = g(x)$のとき合成関数$y=f(g(x))$の$x$に関する微分は下記のように得られる。

$$
\large
\begin{align}
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
\end{align}
$$

上記の「合成関数の微分の公式」は「連鎖率(chain rule)」のようにいう場合もある。式の詳しい導出は下記で取り扱った。

微分の公式とその導出まとめ　〜積・商・合成関数・逆関数の導関数、三角関数 etc〜

数式微分・数値微分の活用における課題

計算機で微分を取り扱う際の手法は数式微分(Symbolic differenetiation)、数値微分(Numerical differenetiation)、自動微分(Automatic differenetiation)の$3$つに大別できる。

「数式微分」は数学的に導関数を計算し入力値を代入することで計算を行う手法、「数値微分」は$\displaystyle \frac{f(x+\Delta)-f(x)}{\Delta x}$に基づいて近似的に計算を行う手法である。ニューラルネットワークの誤差関数のような複雑な合成関数の微分を取り扱う場合、数式微分では導関数の導出が大変複雑になり、数値微分はパラメータ数が多い場合計算量が多くなる。このような解決にあたって用いられるのが自動微分である。

たとえば下記の演習ではロジスティック回帰の勾配の計算について取り扱ったが、途中式がなかなか複雑である。ニューラルネットワークはロジスティック回帰を複雑化したものであると見なすこともできるので、数式微分を用いると計算がさらに複雑になる。

最尤法とGLM(Generalized Linear Model)｜問題演習で理解する統計学【7】

自動微分の仕組み

前項の課題の解決にあたってニューラルネットワークの学習にあたって実用的に用いられるのが自動微分(Automatic Differentiation)である。自動微分は数式微分と同様に「合成関数の微分の公式」を用いるが、「中間変数の値を代入してから積を計算すること」によりスムーズに計算を行うことができる。

たとえば$f(x) = e^{x^{2}}$の$f'(1)$の値を計算するにあたって、数式微分と自動微分はそれぞれ下記のように値を計算する。

・数式微分
$u=x^{2}, y=e^{u}$とおき、下記のように$f'(x)$を計算する。
$$
\large
\begin{align}
f'(x) &= \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \\
&= e^{u} \cdot 2x \\
&= 2x e^{x^{2}}
\end{align}
$$

上記より$f'(1) = 2e^{1} = 2e$の値を計算する。

・自動微分
$u=x^{2}, y=e^{u}$とおき、下記のように$f'(x)$を計算する。
$$
\large
\begin{align}
f'(x) &= \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \\
&= e^{u} \cdot 2x
\end{align}
$$

$x=1$のとき$u=1$であるので、$f'(1) = e^{1} \cdot 2 = 2e$を計算する。

「数式微分」と「自動微分」は基本的には同じ公式に基づいて同様な計算を行うが、「数式微分」が「すべて$x$の式で表さなければいけない」一方で「自動微分」は「中間変数の$u$が得られれば計算が可能」である。

よって、中間変数の$u$の値を効率よく保持しておけるならば「自動微分」に基づいてシンプルに結果を計算することができる。

自動微分の基本的な実装方針

実装方針

基本的には「演算」をクラス化し、順伝播(forward propagation)と逆伝播(back propagation)を同じクラスで取り扱うことで逆伝播の計算の効率化を実現する。

乗算のクラス化

全容

乗算は下記のようにクラス化することができる。

class MulLayer:
    def __init__(self):
        self.x = None
        self.y = None

    def forward(self, x, y):
        self.x = x
        self.y = y
        out = x*y
        return out

    def backward(self, dout):
        dx = dout * self.y
        dy = dout * self.x
        return dx, dy

上記のクラスは下記のように用いることができる。

apple = 100.
apple_num = 2.
tax = 1.1

mul_apple_layer = MulLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
price = mul_tax_layer.forward(apple_price, tax)

print(price)

# backward
dprice = 1.
dapple_price, dtax = mul_tax_layer.backward(dprice)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print(dapple_price, dtax)
print(dapple, dapple_num)

・実行結果

220.00000000000003
1.1 200.0
2.2 110.00000000000001

順伝播による計算

xとyの積の計算は下記のようにforwardメソッドで実装される。

class MulLayer:
    〜省略〜

    def forward(self, x, y):
        self.x = x
        self.y = y
        out = x*y
        return out

    〜省略〜

out = x*yでoutを計算しreturnで出力するのと同時にself.x = xとself.y = yによって入力変数をインスタンスに保持させることを抑えておくと良い。

逆伝播による勾配の取得

xとyの積の計算は下記のようにforwardメソッドで実装される。

class MulLayer:
    〜省略〜

    def backward(self, dout):
        dx = dout * self.y
        dy = dout * self.x
        return dx, dy

xyを$x$について偏微分すると$y$、$y$について偏微分すると$x$、が得られることからdx = dout * self.yとdy = dout * self.xの処理を理解すると良い。

加算のクラス化

全容

class AddLayer:
    def __init__(self):
        pass

    def forward(self, x, y):
        out = x+y
        return out

    def backward(self, dout):
        dx = dout*1
        dy = dout*1
        return dx, dy

上記のクラスは下記のように用いることができる。

apple = 100.
apple_num = 2.
orange = 150.
orange_num = 3.
tax = 1.1

mul_apple_layer = MulLayer()
mul_orange_layer = MulLayer()
add_apple_orange_layer = AddLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
orange_price = mul_orange_layer.forward(orange, orange_num)
all_price = add_apple_orange_layer.forward(apple_price, orange_price)
price = mul_tax_layer.forward(all_price, tax)

print(all_price)
print(price)
print("===")

# backward
dprice = 1.
dall_price, dtax = mul_tax_layer.backward(dprice)
dapple_price, dorange_price = add_apple_orange_layer.backward(dall_price)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)
dorange, dorange_num = mul_orange_layer.backward(dorange_price)

print(dall_price, dtax)
print(dapple, dapple_num)
print(dorange, dorange_num)

・実行結果

650.0
715.0000000000001
===
1.1 650.0
2.2 110.00000000000001
3.3000000000000003 165.0

順伝播による計算

class AddLayer:
    〜省略〜

    def forward(self, x, y):
        out = x+y
        return out

    〜省略〜

$x+y$を$x$について偏微分すると$1$、$y$について偏微分すると$1$が得られることから$x$と$y$の値を保持する必要がないことに着目しておくと良い。このように中間変数の値を保持するかどうかは逆伝播の際に用いるかどうかの観点から確認しておくと良い。

逆伝播による勾配の取得

class AddLayer:
    〜省略〜

    def backward(self, dout):
        dx = dout*1
        dy = dout*1
        return dx, dy

活性化関数のクラス化

ReLU

class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0
        return out

    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout
        return dx

上記のクラスは下記のように使用することができる。

import numpy as np

calc_Relu = Relu()

# forward
x = np.arange(-1., 1., 0.1)
y = calc_Relu.forward(x)

print(y)

# backward
dout = np.ones(20)
dx = calc_Relu.backward(dout)

print(dx)

・実行結果

[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7
 0.8 0.9]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Sigmoid関数

class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1./(1.+np.exp(-x))
        self.out = out
        return out

    def backward(self, dout):
        dx = dout * (1.-self.out) * self.out
        return dx

上記のクラスは下記のように使用することができる。

import numpy as np

calc_Sigmoid = Sigmoid()

# forward
x = np.arange(-1., 1., 0.1)
y = calc_Sigmoid.forward(x)

print(y)

# backward
dout = np.ones(20)
dx = calc_Sigmoid.backward(dout)

print(dx)

・実行結果

[0.26894142 0.2890505  0.31002552 0.33181223 0.35434369 0.37754067
 0.40131234 0.42555748 0.450166   0.47502081 0.5        0.52497919
 0.549834   0.57444252 0.59868766 0.62245933 0.64565631 0.66818777
 0.68997448 0.7109495 ]
[0.19661193 0.20550031 0.2139097  0.22171287 0.22878424 0.23500371
 0.24026075 0.24445831 0.24751657 0.24937604 0.25       0.24937604
 0.24751657 0.24445831 0.24026075 0.23500371 0.22878424 0.22171287
 0.2139097  0.20550031]

計算例：ロジスティック回帰のパラメータ推定

以下、ロジスティック回帰のパラメータ推定を題材に自動微分の実装について確認を行う。まず、下記に基づいてサンプリングを行う。

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

sns.set_theme()

np.random.seed(1)

n = 500
t_a = 1.2 
t_b = 0.5

x = np.random.rand(n)*10.-5.
t_p = 1./(1.+np.e**(-(t_a*x+t_b)))
t = stats.binom.rvs(size=x.shape[0],p=t_p,n=1)

x_ = np.arange(-5.0, 5.01, 0.01)

plt.scatter(x,t,color="lightgreen")
plt.plot(x_, 1./(1.+np.e**(-(t_a*x_+t_b))), color="green")
plt.show()

・実行結果

上記に対し、下記を実行することで自動微分と勾配法に基づいてパラメータ推定を行うことができる。

alpha = 0.01
a, b = 0.5, 0.

mul_coef_layer = MulLayer()
add_bias_layer = AddLayer()
calc_Sigmoid = Sigmoid()

for i in range(100):
    # forward
    u1 = mul_coef_layer.forward(x, a)
    u2 = add_bias_layer.forward(u1, b)
    y = calc_Sigmoid.forward(u2)
    #loss = t*np.log(y) + (1-t)*np.log(1-y)

    # backward
    #dloss = 1.
    dy = t/y - (1-t)/(1-y)
    du2 = calc_Sigmoid.backward(dy)
    du1, db = add_bias_layer.backward(du2)
    dx, da = mul_coef_layer.backward(du1)
    
    a += alpha*np.sum(da)
    b += alpha*np.sum(db)
    if (i+1)%10==0:
        print(a,b)

・実行結果

1.110878642205875 0.6047959437112932
1.1110570172855048 0.6055201115906093
1.1110586396377382 0.6055267041766271
1.1110586544122127 0.6055267642136816
1.1110586545467607 0.6055267647604263
1.1110586545479861 0.6055267647654055
1.111058654547997 0.6055267647654506
1.1110586545479972 0.6055267647654511
1.1110586545479972 0.6055267647654511
1.1110586545479972 0.6055267647654511

サンプリングに用いたパラメータの値が$a=1.2, b=0.5$であるので、上記は概ね適切な結果であることが確認できる。

注意：当記事ではクロスエントロピー誤差関数の実装を行なっていないので、dy = t/y - (1-t)/(1-y)は出力層の微分の式を用いた。

微分方程式(differential equation)の基本事項まとめ⑤｜線形微分方程式

投稿日: 2023-10-092023-10-11 投稿者: lib-arts

微分方程式(differential equation)は多くの応用先がありますが、統計学を学ぶにあたってもハザード関数から確率密度関数を導出する際などに用いられます。当記事では線形微分方程式の基本的な解法について概要と計算例を取りまとめました。
作成にあたっては「チャート式シリーズ大学教養微分積分」の第$9$章「微分方程式」を主に参考にしました。

・数学まとめ
https://www.hello-statisticians.com/math_basic

チャート式シリーズ大学教養微分積分 (チャート式・シリーズ)

3,080円(06/26 22:32時点)

Amazon

Contents

1 線形微分方程式
2 同次形の微分方程式の計算例

線形微分方程式

線形微分方程式の定義

$x$の関数$q(x), p_0(x), \cdots , p_{n-1}(x)$を用いて下記のような形式で表される微分方程式を未知関数$y=y(x)$に関する$n$階の線形微分方程式という。
$$
\large
\begin{align}
y^{(n)} + p_{n-1}(x) y^{(n-1)} + \cdots + p_1(x) y’ + p_0(x) y + q(x) = 0 \quad (1.1)
\end{align}
$$

また、$(1.1)$式の線形微分方程式について$q(x)=0$が成立する場合「同次」といい、$q(x) \neq 0$のとき「非同次」であるという。

微分作用素

$$
\large
\begin{align}
D = \frac{d}{dx}
\end{align}
$$

線形微分方程式は上記のような微分作用素$D$を用いると見やすく表すことができる。

定数変化法

同次の微分方程式を解いたのちに解の定数$C$を$C(x)$に置き換えて非同次の微分方程式に代入し、特殊解を求める方法を定数変化法という。

多項式の分解と微分方程式

定数係数同次線形微分方程式の解

同次形の微分方程式の計算例

以下、「チャート式シリーズ大学教養微分積分」の例題の確認を行う。

基本例題$168$

・$[1]$
$$
\large
\begin{align}
y’ + \frac{1}{x}y – x^2 = 0 \quad (2.1.1)
\end{align}
$$

以下、定数変化法を用いて解く。同次形の微分方程式$\displaystyle y’ + \frac{1}{x}y = 0$の解は下記のように得られる。
$$
\large
\begin{align}
y’ + \frac{1}{x}y &= 0 \\
\frac{dy}{dx} &= -\frac{y}{x} \\
\frac{dy}{y} &= -\frac{dx}{x} \\
\log{|y|} &= -\log{|x|} + C \\
\log{|xy|} &= C \\
y &= \frac{c_1}{x} \quad (2.1.2)
\end{align}
$$

上記に対し$c_1=p(x)$とおき、$(2.1.1)$式に代入すると下記が得られる。
$$
\large
\begin{align}
\left( \frac{p(x)}{x} \right)’ + \frac{p(x)}{x^2} – x^2 &= 0 \\
\frac{p'(x)x – \cancel{p(x)}}{x^2} + \cancel{\frac{p(x)}{x^2}} – x^2 &= 0 \\
\frac{p'(x)}{x} – x^2 &= 0 \\
\frac{dp}{dx} \cdot \frac{1}{x} &= x^{2} \\
dp &= x^{3} dx \\
p &= \frac{1}{4}x^{4} + C
\end{align}
$$

上記に基づいて$\displaystyle c_1 = p(x) = \frac{1}{4}x^{4}+C$を$(2.1.2)$式に代入すると下記が得られる。
$$
\large
\begin{align}
y &= \frac{c_1}{x} \\
&= \frac{1}{4}x^{3} + \frac{C}{4}
\end{align}
$$

・$[2]$
$$
\large
\begin{align}
y’ + y \cos{x} – \sin{x} \cos{x} = 0 \quad (2.2.1)
\end{align}
$$

以下、定数変化法を用いて解く。同次形の微分方程式$\displaystyle y’ + y \cos{x} = 0$の解は下記のように得られる。
$$
\large
\begin{align}
y’ + y \cos{x} &= 0 \\
\frac{dy}{dx} &= -y \cos{x} \\
\frac{dy}{y} &= – \cos{x} dx \\
\log{|y|} &= -\sin{x} + C \\
|y| &= e^{-\sin{x}+C} \\
y &= c_1 e^{-\sin{x}} \quad (2.2.2)
\end{align}
$$

上記に対し$c_1=p(x)$とおき、$(2.2.1)$式に代入すると下記が得られる。
$$
\large
\begin{align}
y’ + y \cos{x} – \sin{x} \cos{x} &= 0 \quad (2.2.1) \\
\left( p(x) e^{-\sin{x}} \right)’ + p(x) e^{-\sin{x}} \cos{x} &= \sin{x} \cos{x} \\
p'(x) e^{-\sin{x}} + \cancel{p(x) e^{-\sin{x}} \cdot (-\cos{x})} + \cancel{p(x) e^{-\sin{x}} \cos{x}} &= \sin{x} \cos{x} \\
\frac{dp}{dx} e^{-\sin{x}} &= \sin{x} \cos{x} \\
\int dp &= \int \sin{x} \cos{x} e^{\sin{x}} dx \quad (2.2.3)
\end{align}
$$

上記に対し、$t = \sin{x}$とおくと下記が成立する。
$$
\large
\begin{align}
\frac{dt}{dx} &= \cos{x} \\
\cos{dx} dx &= dt
\end{align}
$$

よって$(2.2.3)$式は下記のように変形できる。
$$
\large
\begin{align}
\int dp &= \int \sin{x} \cos{x} e^{\sin{x}} dx \quad (2.2.3) \\
\int dp &= \int t e^{t} dt \\
p &= t e^{t} – \int (t)’ e^{t} dt \\
p &= t e^{t} – \int e^{t} dt \\
&= t e^{t} – e^{t} + C \\
&= e^{\sin{x}}(\sin{x} – 1) + C
\end{align}
$$

上記に基づいて$\displaystyle c_1 = p(x) = e^{\sin{x}}(\sin{x} – 1) + C$を$(2.2.2)$式に代入することで下記が得られる。
$$
\large
\begin{align}
y &= e^{-\sin{x}} (e^{\sin{x}}(\sin{x} – 1) + C) \\
&= \sin{x} + C e^{-\sin{x}} – 1
\end{align}
$$

・$[3]$
$$
\large
\begin{align}
y’ + 2y – 3 e^{4x} = 0 \quad (2.3.1)
\end{align}
$$

以下、定数変化法を用いて解く。同次形の微分方程式$\displaystyle y’ + 2y = 0$の解は下記のように得られる。
$$
\large
\begin{align}
y’ + 2y &= 0 \\
\frac{dy}{dx} &= -2y \\
\frac{dy}{y} &= -2 dx \\
\log{|y|} &= -2x + C \\
|y| &= e^{-2x+C} \\
y &= \pm e^{C} e^{-2x} \\
y &= c_1 e^{-2x} \quad (2.3.2)
\end{align}
$$

上記に対し$c_1=p(x)$とおき、$(2.3.1)$式に代入すると下記が得られる。
$$
\large
\begin{align}
y’ + 2y – 3 e^{4x} &= 0 \quad (2.3.1) \\
\left( p(x) e^{-2x} \right)’ + 2p(x) e^{-2x} &= 3 e^{4x} \\
p'(x)e^{-2x} – \cancel{2p(x) e^{-2x}} + \cancel{2p(x) e^{-2x}} &= 3 e^{4x} \\
\frac{dp}{dx} &= 3e^{6x} \\
\int dp &= \int 3e^{6x} dx \\
p &= \frac{1}{2} e^{6x} + C
\end{align}
$$

上記に基づいて$\displaystyle c_1 = p(x) = \frac{1}{2} e^{6x} + C$を$(2.3.2)$式に代入すると下記が得られる。
$$
\large
\begin{align}
y &= c_1 e^{-2x} \\
&= \left( \frac{1}{2} e^{6x} + C \right) e^{-2x} \\
&= \frac{1}{2} e^{4x} + C e^{-2x}
\end{align}
$$

基本例題$169$

基本例題$171$

基本例題$172$

基本例題$173$

重要例題$103$

$$
\large
\begin{align}
y’ + y \tan{x} = \frac{1}{\cos{x}}
\end{align}
$$

以下、定数変化法を用いて解く。同次形の微分方程式$y’ + y \tan{x} = 0$の解は下記のように得られる。

$[1] \,$ $y=0$のとき
定数関数$y=0$は$y’ + y \tan{x} = 0$の解である。

$[2] \,$ $y \neq 0$のとき
$y’ + y \tan{x} = 0$は下記のように変形できる。
$$
\large
\begin{align}
y’ + y \tan{x} &= 0 \\
\frac{dy}{dx} &= -y \tan{x} \\
\frac{dy}{y} &= -\tan{x} dx \\
\int \frac{1}{y} dy &= \int \frac{(\cos{x})’}{\cos{x}} dx \\
\log{|y|} &= \log{|\cos{x}|} + C \\
|y| &= e^{C} |\cos{x}| \\
y &= \pm e^{C} \cos{x} \\
&= c_1 \cos{x}
\end{align}
$$

$c_1=0$のとき$y=0$であるから、$[1]$の結果も$y=c_1 \cos{x}$で表すことができる。ここで$c_1=p(x)$とおき、元の式に代入すると下記が得られる。
$$
\large
\begin{align}
y’ + y \tan{x} &= \frac{1}{\cos{x}} \\
\left( p(x) \cos{x} \right)’ + p(x) \tan{x} &= \frac{1}{\cos{x}} \\
p'(x) \cos{x} – p(x) \sin{x} + p(x) \cancel{\cos{x}} \cdot \frac{\sin{x}}{\cancel{\cos{x}}} &= \frac{1}{\cos{x}} \\
p'(x) \cos{x} – \cancel{p(x) \sin{x}} + \cancel{p(x) \sin{x}} &= \frac{1}{\cos{x}} \\
\frac{dp}{dx} \cos{x} &= \frac{1}{\cos{x}} \\
dp &= \frac{1}{\cos^{2}{x}} dx \\
p &= \tan{x} + C
\end{align}
$$

上記に基づいて$\displaystyle c_1 = p(x) = \tan{x} + C$を$y = c_1 \cos{x}$式に代入すると下記が得られる。
$$
\large
\begin{align}
y &= c_1 \cos{x} \\
&= (\tan{x} + C) \cos{x} \\
&= \sin{x} + C \cos{x}
\end{align}
$$

最大内積探索の問題設定

最大内積探索の立式

最近傍探索と一致する場合

MCSSと一致する場合

クラスタリングを用いた最大内積探索の効率化

k-means

階層k-means

ウォリスの公式の導出

ウォリス積分の公式

ウォリスの公式の導出

円周率の近似

Sparse Attention

Sparse Attentionの概要

Sparse Attentionの分類

Position-based Sparse Attention

基本パターン

基本パターンの解釈

Star-Transformer

Longformer

ETC

BigBird

Content-based Sparse Attention

Reformer

locality-sensitive hashing

LSH Attention

Routing Transformer

Routing Attention

Centroid vectorsの計算

まとめ

参考

前提の確認

Transformerの仕組み

Transformerの事前学習

Transformerの構成の分類

概要

Encoder-Decoder

Encoder only

Decoder only

具体的な研究の分類

Dot Product Attention

ソフトマックス関数

内積の計算

Dot Product Attention

Transformer

Transformerレイヤーの計算

Affineクラス

Transformer_Layerクラス

$2$層Transformer

出力層

$2$層Transformer

実行例

MessagePassing

NumPy.repeat

Attentionの実装

Attentionの基本処理

順伝播の実行例

逆伝播の実行例

MessagePassingの実装

隣接行列：グラフの定義

MessagePassingクラスの実装

MessagePassingクラスの実行例

GNNの実装

GNNのレイヤーの実装

クラスの実装

使用例

$2$層GNNの実装

関数・クラスの実装

実行例

RNNモジュールの実装

概要

$\tanh{(x)}$の定義と微分

実装と使用例

順伝播の計算

RNN全体の実装

実装

使用例

Affine変換の自動微分

Affine変換の式定義と転置

入力・パラメータ・バイアスによる行列微分

入力行列による行列微分

`Affine`クラス

`Transformer_Layer`クラス

`MessagePassing`クラスの実装

`MessagePassing`クラスの実行例