[Coursera Stanford Machine Learning (week 5)] Neural Networks

Data Science/Machine Learning

[Coursera Stanford Machine Learning (week 5)] Neural Networks - Cost Function and Backpropagation

sunnyshiny 2023. 2. 15. 16:13

728x90

해당 내용은 coursera Andrew Ng교수의 Machine Learning강의 노트 정리

Cost function

cost function for regularized logistic regression

$$ J(θ)=−\frac{1}{m}∑_{i=1}^m[y^{(i)} log(h_θ(x^{(i)}))+(1−y^{(i)}) log(1−h_θ(x^{(i)}))]+\frac{λ}{2m}∑_{j=1}^nθ_j^2$$

NN에서 K개의 class를 classify하는 multiclass classification이라면 $h_{\theta}(x) \in R^K$즉 K개의 output을 갖게 된다. 따라서 각 class의 output 갯수 만큼 cost 계산시 반복하여 summation에 하면 된다.$$

$$\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}$$

weight는 각 layer마다 현재의 노드와 다음 노드의 갯수만큼 있다. 정규화 부분에서는 모든 가중치 행렬을 제곱하여 더해줌으로써 binary logistic regression 과 마찬가지로 multiclass classification의 정규화를 할 수 있다.

Note:

the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i

Backpropagation

오차역전파 는 비용함수 $J(\theta)$를 최소화하기 위해 neural network 에서 사용되는 용어 이다

$min_{\theta}J(\theta)$ 으로 최적의 파라미터를 찾는 것이다. 이를 위해 $\frac{\partial}{\partial \theta}J(\theta)$를 계산해야 한다.

sigmoid 미분

$$ y = \frac{1}{1+e^{-x}} = (1+e^{-x})^{-1} \\ \Rightarrow \\ y' = -(1+e^{-x})^{-2}(-e^{-x})\\=\frac{e^{-x}}{(1+e^{-x})^2} \\=\frac{1+e^{-x}-1}{(1+e^{-x})^2}\\ = \frac{1}{1+e^{-x}}-\frac{1}{(1+e^{-x})^2}\\=y-y^2\\=y(1-y) $$

Affine 미분

$$ \frac{∂L}{∂X}=\frac{∂L}{∂Y}W^T\\\frac{∂L}{∂W}=X^ T\frac{∂L}{∂Y} $$

순전파는 각 layer 마다 input과 weight를 곱한 후 activation function을 적용하였다.

$$ a^{(1)} = x\\z^{(2)}=\theta^{(1)}a^{(1)} \\ a^{(2)} = g(z^{(2)})\\z^{(3)}=\theta^{(2)}a^{(2)} \\a^{(3)} = g(z^{(3)})\\z^{(4)}=\theta^{(3)}a^{(3)} \\ a^{(4)} = g(z^{(4)})=h_{\theta}(x) $$

역전파는 순전파의 계산을 역으로 각 계산마다 미분을 적용하여 준다. 가장 마지막의 loss는 $\frac{\partial L}{\partial L} = 1$이다. 강의 예에서 4개의 layer가 있고 4번째 layer 부터 역순으로 미분을 적용하며 계산한다.

$$ \delta^{(4)} = a^{(4)}-y\\\delta^{(3)} =(\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)})\\\delta^{(2)} =(\theta^{(2)})^T\delta^{(3)}.*g'(z^{(3)}) $$

첫번째 layer는 input layer이므로 오차를 구하지 않는다. 일반화 하여 표현을 하면

$$ \delta^{(l)} =(\theta^{(l)})^T\delta^{(l+1)}.*g'(z^{(l)})\\ =(\theta^{(l)})^T\delta^{(l+1)}.a^{(l)}.(1-a^{(l)}) $$

Gradient Checking

Backpropagation을 이용한 Gradient와 numerical gradient를 비교하여 두 값이 유사한지를 확인한다. Train시에는 numerical gradient 를 제외하고 역전파만을 이용하여 gradient 를 계산하는데 이것은 numerical gradient 가 매 iteration마다 계산을 해야하므로 느리기 때문이다.

$$\frac{df(x)}{dx}= \lim_{h->0}\frac{f(x+h)-f(x-h)}{2h}$$

# 밑바닥부터 시작하는 딥러닝에서 코드 발췌

def numerical_gradient(f, x):
    h = 1e-4
    grad = np.zeros_like(x) # x와 같은 shape의 배열 생성
    
    for idx in range(x.size):
        tem_val = x[idx]
        
        #f(x+h)계산
        x[idx] = tem_val + h
        fxh1 = f(x)
        
        #f(x-h)계산
        x[idx] = tem_val - h
        fxh2 = f(x)
        
        grad[idx] = (fxh1-fxh2)/(2*h)
        x[idx] = tem_val # 값 복원
        
    return grad

Random Initialization

만일 가중치의 초깃값을 0으로 설정한다면 $x * 0 =0$ 어떤수를 곱하던 결국 모든 가중치의 값이 똑같이 0으로 갱신되며 역전파에서도 동일하게 갱신되게 된다. 따라서 초깃값을 무작위로 설정한다.

Training a neural network

가중치 초기화 [ Randomly initialize the weights]
순전파 계산 [ Implement forward propagation to get $h_Θ(x^{(i)})$ for any $x^{(i)}$ ]
cost function 계산 [ Implement the cost function]
역전파 계산[ Implement backpropagation to compute partial derivatives ]
Gradient checking [ Use gradient checking to confirm that your backpropagation works. Then disable gradient checking ]
minimized $J(\theta)$ [ Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.]

Refrenece
Machine learning , Coursera, Andrew Ng

728x90