[Practical Statistics for Data Scientists] B팀: Boosting

심화 스터디/Practical Statistics for Data Scientists

by dudwnqkr 2022. 7. 22. 14:37

Boosting은 Bagging과 마찬가지로 Decision tree와 연계되어 가장 많이 사용되는 방법이다. 물론 Boosting은 Bagging에 비해 튜닝, 규제화 과정 등에서 좀 더 많은 작업이 필요하다. Boosting은 모델의 fitting 정도를 확인하는 residual의 concept을 사용하여 모델을 학습하는데, 결과적으로 ‘successive’ 모델 즉, 병렬적인게 아닌 연속적인 모델이 이전 모델에 비해 에러를 최소화하고 있는가를 바탕으로 학습을 진행한다. 대표적인 Boosting 방법에는 Adaboost / gradient boosting / stochastic gradient boosting 등이 있다.

The Boosting Algorithm

Boosting의 근본적인 알고리즘 아이디어는 비슷하므로 가장 이해하기 쉬운 Adaboost를 이용해 알고리즘을 이해해 보자

1. Initialize

관측치에 동일하게 1/N의 가중치를 부여한다.

2. Train

매 학습마다 오분류된 관측치만큼의 가중치가 부여되며 해당 error를 최소화하는 방향으로 모델을 학습한다.

3. Ensemble

a) $\hat{F_m}=\hat{F}_{m-1}+\alpha \hat{f}_m$

결과적으로 Boost 추정치는 각 모형의 선형결합형태로 나타나게 된다.

b) $\alpha_m = \frac{log1-e_m}{e_m}$

4. Update

오분류 관측치에 대한 가중치를 증가하는 방향으로 갱신하여 성능이 낮은 모형에 대해 더 강한 학습을 진행하며 위 과정을 반복한다.

XGBoost

Gradient Boost에 대해 병렬 학습을 가능케 한 알고리즘이다. 통계프로그램 R을 이용하여 그 과정을 확인해 보자

predictors <- data.matrix(loan3000[, c('borrower_score', 'payment_inc_ratio')])   # predictors는 행렬 형태
label <- as.numeric(loan3000[,'outcome']) - 1   # binary: 0 / 1 로 encoding
# XGBoost 모델 적합
xgb <- xgboost(data=predictors, label=label, objective="binary:logistic",   # objective: 이진 분류
							 params=list(subsample=0.63, eta=0.1), nrounds=100)

subsample: weak learner가 학습에 사용하는 데이터 샘플링 비율 → Boosting을 마치 비복원추출하는 Random Forest처럼 사용
eta: learning rate → 가중치를 조정해 과대적합을 방지

pred <- predict(xgb, newdata=predictors)   # 모델 예측값
xgb_df <- cbind(loan3000, pred_default = pred > 0.5, prob_default = pred)
ggplot(data=xgb_df, aes(x=borrower_score, y=payment_inc_ratio,
       color=pred_default, shape=pred_default, size=pred_default)) +  
	geom_point(alpha=.8) +
	scale_color_manual(values = c('FALSE'='#b8e186', 'TRUE'='#d95f02')) +
	scale_shape_manual(values = c('FALSE'=0, 'TRUE'=1)) +
	scale_size_manual(values = c('FALSE'=0.5, 'TRUE'=2))

borrower_score가 높은 대출자가 채무불이행 예측을 하는 경우가 보이므로 썩 좋은 결과는 아님을 확인할 수 있다.

Regularization

손실함수에 penalize를 부여해 모형의 복잡성을 완화시키는 과정

seed <- 400820
predictors <- data.matrix(loan_data[, -which(names(loan_data) %in% 'outcome')])   # loan_data 빼고 전부 predictors로 사용
label <- as.numeric(loan_data$outcome) - 1   # target
test_idx <- sample(nrow(loan_data), 10000)   # test index 설정

xgb_default <- xgboost(data=predictors[-test_idx,], label=label[-test_idx],
                       objective='binary:logistic', nrounds=250, verbose=0)   # train data에 대해 모형 적합
pred_default <- predict(xgb_default, predictors[test_idx,])   # test data에 모형 적합 결과 예측
error_default <- abs(label[test_idx] - pred_default) > 0.5   # error 계산
xgb_default$evaluation_log[250,]
--------------------------------
iter train_error
1: 250 0.133043

mean(error_default)
--------------------
[1] 0.3529   # train에서는 13.3% 정도의 error이지만 test에서는 35.29% 정도의 error로 과적합 상태

# Regularization: alpha(l1) & lambda(l2)
xgb_penalty <- xgboost(data=predictors[-test_idx,], label=label[-test_idx],
                       params=list(eta=.1, subsample=.63, lambda=1000),   # lambda 사용해 규제화
                       objective='binary:logistic', nrounds=250, verbose=0)
pred_penalty <- predict(xgb_penalty, predictors[test_idx,])
error_penalty <- abs(label[test_idx] - pred_penalty) > 0.5

xgb_penalty$evaluation_log[250,]
--------------------------------
iter train_error
1: 250 0.30966

mean(error_penalty)
-------------------
[1] 0.3286   # 과적합이 어느정도 해결된 모습

alpha: Manhattan distance - L1 regularization
lambda: squared Euclidean distance - L2 regularization

errors <- rbind(xgb_default$evaluation_log, xgb_penalty$evaluation_log,
							  data.frame(iter=1:250, train_error=error_default),
                data.frame(iter=1:250, train_error=error_penalty))

errors$type <- rep(c('default train', 'penalty train','default test', 'penalty test'), 
                   rep(250, 4))

ggplot(errors, aes(x=iter, y=train_error, group=type)) +
	geom_line(aes(linetype=type, color=type))

train error가 규제화를 한 결과 iter가 늘어나더라도 감소하지 않음

Hyperparameters and Cross-Validation

data를 랜덤하게 K개의 서로 다른 그룹으로 분할한 후 각 fold에 대해 fold에 없는 data를 바탕으로 학습한 모형을 test하는 과정

N <- nrow(loan_data)
fold_number <- sample(1:5, N, replace=TRUE)   # 1부터 5까지 랜덤 추출
params <- data.frame(eta = rep(c(.1, .5, .9), 3), 
                     max_depth = rep(c(3, 6, 12), rep(3,3)))

error <- matrix(0, nrow=9, ncol=5)
for(i in 1:nrow(params)){
	for(k in 1:5){   # 5-fold
		fold_idx <- (1:N)[fold_number == k]   # index 부여
		xgb <- xgboost(data=predictors[-fold_idx,], label=label[-fold_idx],   # k번째 fold 제외하고 학습
									 params=list(eta=params[i, 'eta'],
															 max_depth=params[i, 'max_depth']),
									 objective='binary:logistic', nrounds=100, verbose=0)
pred <- predict(xgb, predictors[fold_idx,])   # k번째 fold에 예측
error[i, k] <- mean(abs(label[fold_idx] - pred) >= 0.5)

avg_error <- 100 * round(rowMeans(error), 4)   # error의 평균 계산
cbind(params, avg_error)
--------------------------------------------
eta max_depth avg_error
1 0.1 3 32.90
2 0.5 3 33.43
3 0.9 3 34.36
4 0.1 6 33.08
5 0.5 6 35.60
6 0.9 6 37.82
7 0.1 12 34.56
8 0.5 12 36.83
9 0.9 12 38.18

'심화 스터디 > Practical Statistics for Data Scientists' 카테고리의 다른 글

[Practical Statistics for Data Scientists] A팀: Bagging and the Random Forest (0)	2022.07.24
[Practical Statistics for data science] A팀: Boosting (1)	2022.07.23
[Practical Statistics for Data Scientists] B팀: K-Nearest Neighbors (0)	2022.07.22
[Practical Statistics for Data Scientists] A팀: Tree Models (1)	2022.07.20
[Practical Statistics for Data Scientists] B팀: Tree Models (0)	2022.07.18

KUBIG 2022-1 활동 블로그

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

The Boosting Algorithm

1. Initialize

2. Train

3. Ensemble

4. Update

XGBoost

Regularization

Hyperparameters and Cross-Validation

'심화 스터디 > Practical Statistics for Data Scientists' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

티스토리툴바