Naive Bayes Classifiers

study/Machine Learning

Naive Bayes Classifiers

ys_cs17 2022. 6. 21. 06:26

Naive Bayes Classifiers

이제부터는 discrete-valued feature인 $x\in\{1,\dots,K\}^D$를 어떻게 분류할 것인지 알아보자. 어기서 $K$는 각 feature의 value 개수이고, $D$는 feature의 개수이다. 우리는 generative approach를 사용할 것이고, 이를 위해 class conditional distribution인 $p(x|y=c)$를 지정해야 한다. 가장 간단한 approach는 feature들이 주어진 class label에 대해 conditionally independent라고 가정하는 것이다. 이를 통해 class conditional density를 다음과 같이 1-D density의 곱으로 쓸 수 있다.

$$ p(x|y=c,\theta)=\prod_{j=1}^D p(x_j|y=c,\theta_{jc}) $$

이와 같은 model을 NBC (Naive Bayes Classifier)라고 한다.

이 모델은 feature가 independent 일 것이라고 생각하지 않기 때문에 naive하다고 불린다. 비록 Bayes assumption이 참이 아닐지라도, 괜찮게 동작하는 classifier이다.

여기서 필요한 parameter의 수는 Big-O로 표현하면 $O(C \times D)$가 된다. 수식도 심플하고, overfitting에도 강하다.

Naive Bayes Classifier의 form은 각 feature의 유형에 의존되어 있어 각각 다르다.

$x$: real-value

$x$가 real-value라면 다음과 같이 수식을 정의할 수 있다.

$$ p(x|y=c,\theta)=\prod_{j=1}^D N(x_j|\mu_{jc},\sigma^{2}_{jc}) $$

Binary classifier

$x_{j} \in \{ 0, 1\}$인 경우에는 Bernoulli distribution을 사용한다. 수식은 다음과 같다.

$$ p(x|y=c,\theta)=\prod_{j=1}^D \text{Ber}(x_j|\mu_{jc}) $$

여기서 $\mu_{jc}$는 class $c$에서 $j$ 번째 feature $x$가 발생할 확률 이다.

이를 또한 Multivariate Bernoulli native Bayes라고 부른다.

Categorical features

$x_{j} \in \{1, \dots, K\}$ 인 경우에는 Multinoulli distribution을 사용하고, 수식은 아래와 같다.

$$ p(x|y=c,\theta)=\prod_{j=1}^D \text{Cat}(x_j|\mu_{jc}) $$

$\mu_{jc}$는 위와 마찬가지고 class $c$에서 $x_{j}$에 대해 가능한 $K$개의 확률에 대한 히스토그램이다.

우리는 위와 같이 feature에 종류에 따른 다양한 distribution을 통해 assumption할 수 있다.

Model Fitting

Parameter를 추정하기 위해서는 MLE 또는 MAP를 사용하여 모델을 학습해야 한다.

NBC (Naive Bayes Classifier)에 대한 MLE는 다음과 같다. 조건부 독립을 사용하여 수식을 유도한다.

$$ p(x_{i}, y_{i} \mid \theta) = p(x_{i} \mid y_{i} , \theta)p(y_{i} \mid \theta) \\ =p(y_{i} \mid \pi)p(x_{i} \mid y_{i}, \theta) \\ = \prod_c \pi_{c}^{\mathbb{I}(y_{i}=c)}\prod_j \prod_c p(x_{ij} \mid \theta_{jc})^{\mathbb{I}(y_{i} = c)} $$

위 식을 가지고 log-likelihood form을 만들면 다음과 같다.

$$ \log{p(D \mid \theta)}= \sum_{C=1}^{C}N_{C}\log{\pi_{c}} + \sum_{j=1}^{D}\sum_{C=1}^{C}\sum_{i:y_{i} = c}\log{p(p_{ij} \mid \theta_{jc})} \\ (\hat{\pi} = \frac{N_{c}}{N}, \ \ where \ N_{C} = \sum_{i} \mathbb{I}(y_{i} = c) $$

$N_{C}$는 class $C$에 대한 개수가 된다.

이는 independent training이 가능하다. 이 말의 의미는 Likelihood는 각 feature에 대한 distribution 유형의 달라진 다는 의미이다.

예를 들어 모든 feature가 binary라고 하면 $x_{j} \mid y = c \sim \text{Ber}(\theta_{jc})$를 따를 것이고 이에 대한 MLE는 다음과 같게 된다.

$$ \hat{\theta}{jc} = \frac{N{jc}}{N_{C}} $$

Prediction

Train data $D$와 test data $x$ 가 주어졌을 때, class $c$를 예측하는 것을 다음과 같이 쓸 수 있다.

$$ p(y=c|x,\mathcal D)= \frac{p(x|y=c,\mathcal D)p(y=c|\mathcal D)}{p(x|\mathcal D)}\enspace\text{by Bayes rule}\\ \propto p(y=c|\mathcal D)\prod_{j=1}^Dp(x_j|y=c_i,\mathcal D)\enspace\text{by Assumption} $$

$$ \therefore p(y=c|x,\mathcal D)\propto p(y=c|\mathcal D)\prod_{j=1}^Dp(x_j|y=c,\mathcal D) $$

베이즈 정리를 통해 구한 식에 각 feature들은 독립이라는 가정에 의해 확률에 곱으로 접근할 수 있다.

$\pi : \text{prior}$

$\theta: \text{model paramter}$

이 model에서 학습해야 하는 parameter는 위와 같으며, 이러한 알려지지 않은 parameter를 위 prediction 식에 통합하게 되면 아래와 같은 식을 얻을 수 있다.

$$ p(y=c|x,\mathcal D)\propto\left[\int\text{Cat}(y=c|\pi)p(\pi|\mathcal D)d\pi\right]\prod_{j=1}^D\left[\int\text{Ber}(x_j|y=c,\theta_{jc})p(\theta_{jc}|\mathcal D)\right] $$

위 수식에 대해 설명을 추가하자면, $p(y=c \mid D)$라는 prior term에 $\pi$라는 prior parameter를 넣어 표현하게 되면 $p(y=c \mid \pi)p(\pi \mid D)$와 같이 표현할 수 있다. 이때, 여러 class $c$로 분류해야 하는 상황이기 때문에 Multinoulli distribution을 이용하여 위와 같이 표현할 수 있다.

Likelihood term 역시 $\theta$를 추가하여 표현할 수 있으며, binary feature 이기 때문에, Bernoulli distribution을 사용하여 표현할 수 있다.

따라서 $\pi, \theta$를 고려하여 위와 같이 쓸 수 있다.

만약 posterior가 Dirichlet라면 계산하기 더욱 쉬워진다.

특히 아래의 수식에서 posterior predictive density는 단순하게 posterior mean parameter $\overline{\theta}$를 삽입함으로써 얻을 수 있다는 사실을 알 수 있다.

$$ p(X=j|\mathcal D)=\int p(X=j|\theta)p(\theta)p(\theta|\mathcal D)d\theta\\ =\int p(X=j|\theta_j)\left[ \int p(\theta_{-j},\theta_j|\mathcal D)d\theta_{-j} \right]d\theta_j\\ =\int\theta_jp(\theta_j|\mathcal D)d\theta_j={E}[\theta_j|\mathcal D]=\frac{\alpha_j+N_j}{\sum_k(\alpha_k+N_k)}=\frac{\alpha_j+N_j}{\alpha_0+N} $$

따라서 다음과 같이 쓸 수 있다.

$$ p(y=c|x,\mathcal D)\enspace \propto\enspace\overline\pi_c\prod_{j=1}^D(\overline\theta_{jc})^{\mathbb I(x_j=1)}(1-\overline\theta_{jc})^{I(x_j=0)}\\ \overline\theta_{jk} \enspace=\enspace \frac{N_{jc}+\beta_1}{N_c+\beta_0+\beta_1}\\ \overline\pi_c \enspace=\enspace \frac{N_c+\alpha_c}{N+\alpha_0}\\ \alpha_0 \enspace=\enspace\sum_c\alpha_c $$

Class prior의 mean 값 $\alpha_{0}$은 $\alpha_{c}$의 합으로 가정하며, $\alpha_{c}$ 값은 hyper parameter로써 튜닝이 필요하다.

그리고 $\overline{\theta}, \overline{\pi}$도 위와 같이 가정한다.

이렇게 하면 $\overline{\pi}{c}$ 를 prior로 $\prod{j=1}^D(\overline\theta_{jc})^{\mathbb I(x_j=1)}(1-\overline\theta_{jc})^{\mathbb I(x_j=0)}$를 likelihood로 하는 posterior predictive density인 $p(y=c|x,\mathcal D)$의 값을 얻을 수 있다.

likelihood distribution의 경우 beta distribution을 사용하여 binary feature를 다루는 것이기 때문에, indetify function을 이용하여 표기한다.

만약 이렇게 구한 posterior가 $N$ 개의 데이터에 대해 single point로 $p(\theta|\mathcal D)\approx\delta_{\hat{\theta}}(\theta)$ 와 같이 approximate 될 수 있다고 한다면, $\hat{\theta}$ 는 ML 또는 MAP estimate를 통해 구할 수 있다.

이렇게 해서 $\hat{\theta}$를 구하게 되면, 앞서 구했던 것과 같이 $\hat{\theta}{\text{ML}}= \frac{N{c}}{N}$ 또는 $\hat{\theta}{\text{MLE}} = \frac{N{c} + \alpha}{N+\alpha_{0}}$ 이 된다.

$$ p(y=c|x,\mathcal D) \propto \hat\pi_c\prod_{j=1}^D(\hat\theta_{jc})^{I(x_j=1)}(1-\hat\theta_{jc})^{I(x_j=0)} $$

이전 posterior와 위 수식의 차이는 $\overline{\theta}$를 poseterior mode 또는 $\hat{\theta}_{\text{MLE}}$ 로 변경한 것이지만, 이런 작은 차이가 실제로는 overfitting을 줄이는 결과를 가져온다.

Underflow

The log-sum-exp trick

$x$가 높은 차원을 갖게 되면 under flow 문제가 발생한다. 그 이유는 해당 확률 값이 매우 작을 수 있는데, 이는 확률의 정의에 의해 $\sum_{i}p(x_{i} \mid y=c) = 1$ 이므로 특정한 high-dimensional vector가 나타날 확률이 적기 때문이다. 따라서 이를 해결하기 위해 Bayes rule을 적용할 때 $\log$를 취해 해결한다.

$$ \begin{align*} \log p(y=c|x) &=\log\left[\frac{p(x|y=c)\cdot p(y=c)}{p(x)}\right]\enspace\text{by Bayes rule}\\ &=\log\left[p(x|y=c)\cdot p(y=c)\right]-\log{p(x)} \\ &=b_c-\log\left[\sum_{c'=1}^Ce^{b_{c'}}\right]\\ b_c&\triangleq\log p(x|y=c)+\log p(y=c) \end{align*} $$

여기서 2번째 term은 계산하기 힘들어 아래의 공식을 이용한다.

$$ \log\left[\sum_{c'}e^{b_{c'}}\right]=\log\sum_{c'}p(y=c',x)=\log p(x) $$

여기서 $p(x)$는 multi-dimensional feature 이기 때문에, $p(x)$의 distribution은 feature의 dimension이 커진다면, 확률의 정의 $\sum_{x}p(x) = 1$ 에 의해 값이 작아지는 경우가 생기게 된다. 결과적으로 feature의 차원이 굉장히 크다면, 이 방법 역시도 numerical underflow problem이 발생한다.

Log-Sum-Exp Trick

다음 방법은 $\log\left[\sum_{c'}e^{b_{c'}}\right]$ 에서 가장 큰 term을 추출한 후 나머지 term을 정리하는 방법이 있다.

예를 들어 다음과 같이 표현할 수 있다.

$$ \log(e^{-120}+e^{-121})=\log\left(e^{-120}(e^0+e^{-1})\right)=\log(e^0+e^{-1})-120 $$

위 방식을 이용하여, 우리가 원하는 식을 표현할 수 있다.

$$ \log\sum_{c}e^{b_{c}}=\log\left[\left(\sum_{c}e^{b_{c}-B}\right)e^B\right]=\left[\log\left(\sum_{c}e^{b_{c}-B}\right)\right]+B,\enspace\\ \text{where }B=\max_cb_c $$

이렇게 하는 방식을 log-sum-exp trick이라고 부르며, 널리 사용된다.

이를 통해 numerical underflow를 완화할 수 있다.

Overflow

Feature selection

NBC는 잠재적으로 많은 feature에 대한 joint distribution을 fitting 하기 때문에 overfitting으로 어려움을 겪을 수 있다. 또한 run-time cost가 $O(D)$로 일부 application에서는 너무 높을 수 있다.

이러한 두 가지 문제를 모두 해결하기 위한 일반적인 접근 방식은 classification 과정에서 필요 없는 부분들을 없애기 위해 feature selection을 진행한다.

Feature selection에 대한 가장 간단한 접근 방식은 각 feature의 관련성을 개별적으로 평가한 다음 상위 $K$만 선택하는 것이다. 여기서 $K$는 acc와 complexity 사이에서 일부 절충하여 선택한다.

이때 선택할 feature의 개수 $K$가 너무 크거나 작으면 over/under fitting이 일어날 수 있어 적절하게 설정해야 한다. 이러한 접근 방식을 ranking, filtering, screening이라고 한다.

Feature selection using MI (Mutual information)

관련성을 측정하는 한 가지 방법은 feature $X_{j}$와 class label $Y$ 간의 mutual information을 사용하여, $I(X, Y)$를 구하는 것이다. 식은 다음과 같다.

$$ I(X,Y)=\sum_{x_j}\sum_yp(x_j,y)\log\frac{p(x_j,y)}{p(x_j)p(y)} $$

Mutual information은 feature $j$의 값을 관찰하면 label distribution의 entropy 감소로 생각할 수 있다.

Feature가 binary라면 $x_{j} = \{1,0 \}$이고 MI를 다음과 같이 계산할 수 있다.

$$ I_j =\sum_{x_j}\sum_yp(x_j,y)\log\frac{p(x_j,y)}{p(x_j)p(y)} \\ \sum_{y}(p(x_{j}=0, y)\log \frac{p(x_{j} = 0, y)}{p(x_{j}= 0)p(y)} + p(x_{j}=1, y)\log \frac{p(x_{j} = 1, y)}{p(x_{j}= 1)p(y)}) \\ \sum_{c}\theta_{jc}\pi_{c}\log\frac{\theta_{jc}}{\theta_{j}}+(1-\theta_{jc})\pi_{c}\log \frac{1-\theta_{jc}}{1- \theta_{j}} $$

$$ \pi_c=p(y=c)\\ \enspace\theta_{jc}=p(x_j=1|y=c)\\ \theta_j=p(x_j=1)=\sum_c\pi_c\theta_{jc} $$