[Coursera Stanford Machine Learning (week3)]Solving the problem of overfitting

sunnyshiny 2023. 2. 7. 16:26

728x90

해당 내용은 coursera Andrew Ng교수의 Machine Learning 강의노트 정리

The problem of Overfitting

과소적합,underfit, 높은 편향,high bias,은 가설 함수 h의 형태가 데이터의 추세에 제대로 매핑되지 않는 경우이다. 일반적으로 기능이 너무 단순하거나 기능을 너무 적게 사용하여 발생한다. 다른 극단적인 경우에는 train 데이터를 적합시키지만 가설 함수가 일반화되지 못하는 과적합 ,overfit, 높은 분산,high variance,으로 새로운 데이터의 예측력이 떨어지게 된다 대개 데이터와 무관한 불필요한 곡선과 각도를 많이 생성하는 복잡한 함수에 의해 발생한다.

가장 왼쪽의 그래프는 hypothesis function가 데이터를 잘 설명해 주지 못한 underfit, high bias인 경우 이다. 가장 오른쪽의 그래프는 hypothesis function이 모든 데이터에 대해 fit 하려고 하는 overfit , high variance인 경우이며 가운데 그래프는 데이터를 적절하게 표현하였다.

과적합(overfit)이 발생하였을 경우

feature의 개수를 줄인다.
1. 어떤 feature를 선택할 것인지 직접(수동)으로 선택
2. model selection algorithm : 알고리즘을 통해 선택
정규화(Regularization)
1. 모든 feature를 사용하되 파라미터 $\theta$의 크기를 줄임
2. 정규화는 feature의 개수가 많고 각 feature들이 예측에 약간씩 기여할 때 잘 작동함

Cost Function

주택 크기에 따른 주택가격을 예측하는 hypothesis function에서 차수가 높은 polynomial regression이 train set을 과적합 하게 되었다면 cost를 증가시킴으로 weight를 줄일 수 있다. 예를 들면 3차 이상의 항에 대하여 해당 파라미터 weight의 제곱을 값을 더해줌으로써 cost를 크게 만들도록 해준다면 이것은 weight를 줄여주는 장치로 작용하게 된다.

Regularization(정규화) → 파라미터$\theta$가 작은 값을 갖도록 한다.

Simpler hypothesis
Less prone to overfitting

$$ J(\theta) = \frac {1}{2m}[\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y_{(i)})^2+\lambda \sum_{j=1}^n\theta_j^2]\\min_{\theta} J(\theta) $$

$\lambda$, 람다,는 정규화 파라마터로 파라미터 $\theta$의 cost를 얼마나 증가시키지를 결정한다. 만일 람다의 값이 크다면 ,$\lambda = 10^{10}$, $\theta_0$을 제외한 모든 파라미터의 값은 0에 근사하게 되고 결국 $h_{\theta}(x)=\theta_0$이 되어 과소적합(underfit)하게 된다.

람다의 값이 클 때

Algorithm works fine; setting to be very large can’t hurt it
Algorithm fails to eliminate overfitting.
Algorithm results in underfitting. → Fails to fit even training data
Gradient descent will fail to converge.

Regularized Linear Regression

Regularized linear regression cost function

$$ J(\theta) = \frac {1}{2m}[\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y_{(i)})^2+\lambda \sum_{j=1}^n\theta_j^2]\\min_{\theta} J(\theta) $$

Gradient descent

$$ \theta_0 := \theta_0- \alpha\frac {1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\\theta_j := \theta_j- \alpha [\frac {1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}+\frac {\lambda}{m}\theta_j] \quad (j=1,2,3... n)\\\theta_j := \theta_j(1-\alpha \frac {\lambda}{m})-\alpha\frac {1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} $$

$\alpha>0, \lambda>0, m>0 \rightarrow(1-\alpha \frac {\lambda}{m})<1$이므로 $\theta_j$가 점점 작아진다. 반면 $\alpha\frac {1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}$은 이전과 동일하다

Normal equation

정규화가 추가된 normal equation은 괄호 안에 다른 항을 추가하는 점을 제외하고는 동일하다

$$ \begin {align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text {where}\ \ L = \begin {bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end {bmatrix}\end {align*} $$

L은 왼쪽 상단에 0이 있고 대각선 아래에 1이 있는 행렬이며, 그 외에는 모두 0이 있다. (n+1) ×(n+1) 행렬이며 직관적으로, $x_0$를 제외한 실수$\lambda$가 곱해진 identity matrix이다.

$m <n$이면 $X^TX$는 non-invertible 하지만 $\lambda$-term이 추가되면 $X^TX+\lambda L$은 invertible 하다

Regularized Logistic Regression

Cost function for Logistic regression

$$ J(θ)=−\frac {1}{m}∑_{i=1}^m [y^{(i)} log(h_θ(x^{(i)}))+(1−y^{(i)}) log(1−h_θ(x^{(i)}))] $$

Regularized cost function

$$ J(θ)=−\frac{1}{m}∑_{i=1}^m[y^{(i)} log(h_θ(x^{(i)}))+(1−y^{(i)}) log(1−h_θ(x^{(i)}))]+\frac {\lambda}{2m}∑_{j=1}^n \theta_j^2 $$

Gradient descent

hypothesis function이 다르다는 점을 제외하면 linear regression과 logistic regression의 gradient descent 식은 동일하다

Refrenece
Machine learning , Coursera, Andrew Ng

728x90