Skip to main content

XG-Boost

XG-Boost used by Kaggle winners mostly.

Fast, accurate, takes care of bias-variance tradeoff

Immune to curse of dimensionality

https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f?gi=22e5809f407a

bagging (parallel ensemble) and boosting (sequential ensemble): bagging also called bootstrapping means combining all decision treed at the end like majority voting for output. boosting means starting with a weak learner and sequentially making predictions with stronger learners 


eXtreme Gradient Boosting

But before that:

Decision Tree - simple one tree based on lots of if else statements on individual features

Random Forest - ensemble of decision trees, multiple decision trees are constructed using random sampling of data points (with replacement) - aka bootstrapping and using different feature subsets to add more variation to decision trees. And finally all results are combined and majority voting is done for prediction aka aggregation

Gradient Boosting - combine decision trees, but start the combining process at the beginning, instead of at the end. ensemble of decision trees, but different to random forest, it doesn't create random decision trees, instead use a method called boosting -- combines weak learners sequentially and so that each new tree corrects error from the previous one

weak learner = only one decision split - called decision stump

we use the output of first tree to calculate the loss, then we find a second tree that minimizes that loss (loss also called as residuals - see more explanation below)




Gradient boosting - we combine multiple rules of thumb to make accurate and informed decision

So one approach:
first train a simple model  and do the prediction -- on train set or maybe separated data from train set? like a sampled data --- then use the half correctly classified and half incorrectly classified data to train next --- and third will take from one and 2 both...


So to make a scalable hypothesis -- (each hypothesis is a weak learner)
We combine multiple hypothesis and do majority voting to get a result.. hypothesis added one after another...
Another way is to do a weight the hypothesis - AdaBoost - adapting

METHOD ADABOOST:

1. Take all training data (initialize all weights of the samples the same) 
2. Construct the first hypothesis (weak learner 1) by training on these data points.
3. Weight the data points based on training error / prediction (Wrong prediction higher weight)
4. Then train next hypothesis with the weighted data points.
5. Also update the weights of the hypothesis itself.

The final prediction is the weighted majority vote (or weighted median in case of regression problems)

METHOD Gradient BOOST (better than adaboost):

In AdaBoost weights learning is done by updating the weights for misclassified data - all decision trees are usually stumps (only one decision - or one depth - total 2 leaf nodes) - so we calculate the tree, then give more weight to misclassifications for same train set and retrain on another stump and so on. then later decisions are combines based on weighted average/majority voting.

In GBoost learning happens by optimizing a loss. Here the decision tree can be large, i.e. number of leaf nodes are usually between 8-32, so trees have larger depths compared to AdaBoost which only had one decision (stumps) i.e. 2 nodes.

Gradient boosting does the weighting by calculating gradients.
but for large dataset gradient calculation is slow and too many hypothesis add complexity and make it overfit

Steps:
1. Loss/Residual Calculation: First find the average of "target" variable (y) - assume this as the first prediction (y^) and calculate residual. residual or loss, r = y - y^, for each sample data and total loss is something like rmse - regression
2. Then a decision tree model is fit on these residuals
3. Then we do prediction of residuals again and get new residuals
4. Now we get new prediction as: 
            New Prediction = Prev Prediction + learning rate * new residual
5. Using this new prediction calculate residual again and make another tree and so son.

So final prediction will look like:

Final Prediction = Base Value + LR * 1st Residual Prediction + LR * 2nd Residual Prediction

All prediction models (Decision Trees / Residual Models) are trained on residual values.



METHOD XGBoost:

Gradient boosting with some enhancements


Data is divided into blocks.
XGBoost works fast on sorted columns.

Steps:
1. Similar to Gradient Boost, it will first calculate the average and call it prediction. Then we calculate the residuals (errors)
2. A model (Decision Tree/Residual Model) is fit on these residuals/errors - RM1 
3. Difference with GBoost:
            Lambda - Regularization Parameter - protection from outliers / more generalization
            Gamma - Threshold that decides auto-pruning of the tree - threshold if you want to split or                                 create a branch in decision tree going down
            Eta - How fast you want to converge to the next value is like learning rate 
                            (typically 0.3 or 0.1-0.2)

https://www.youtube.com/watch?v=pzFr7ske6Tc

Similarity Score at a level/stage in decision tree, 
        SS = ( Sum of Residuals at that level Ex. 2+(-4)+1 ) ^ 2/ (Num of residuals + lambda)

        So if residuals are of opposite signs similarity score is less.

Now for decision tree, 
        similarity gain, SG = SS before split - SS after split 
        (In normal Decision Tree we have information gain)

Here comes gamma, so gamma decides if we want to split or not, if SG is less than Gamma split won't happen. - This is called auto pruning.

lambda will also increase pruning, it decreases overall SS -  also decreases the effect of outliers

                            New Prediction = Prev Pred + Learning Rate (eta) * output


Read More: