A deep dive in gradient boosting with LightGBM
2019-09-02 Fredrik Olsson

One of the most powerful techniques for building predictive Machine Learning models is the Gradient Boosting Machine. Gradient boosting is widely used in both industry and by Machine Learning competition winners, and the method can be used for a lot of different problems like regression, binary classification and multi-class classification. In this post, we will turn our focus to gradient boosting and try to get an understanding of what this algorithm does and how it works, as well as applying gradient boosting to the Titanic disaster dataset, using the gradient boosting framework LightGBM.

Titanic disaster dataset

The Titanic disaster dataset contain information about the passengers on board the Titanic and whether they survived or not. More precisely, the dataset contain the following information for every passenger:

  • Survived: Whether the passenger survived or not
  • Sex: The sex of the passenger
  • Age: The age of the passenger
  • Fare: How much the passenger payed for the ticket
  • Ticket: Whether the ticket was a 1st, 2nd or 3rd class ticket
  • Family: How many family members the passenger had on board the ship
  • Title: Mr, Mrs, Master or Miss (indicating if the passenger was married or not)
  • Embarked: Which of the three places the passenger embarked on the ship

  • Age, Fare and Family size are numerical features, while Survived and Sex are binary features (False/True, 0/1). Ticket class, Title and Embarked are all categorical features, we use one-hot-encoding to represent these features. For example, if a passenger bought a 2nd class ticket, this will be represented in the following way:


    1st Class2nd Class3rd Class
    010

    In total, this dataset contain 892 data points (892 passengers), and the data set has been split in a training set of size 792 that will be used to train the model and a test set of size 100 that we will use to test the performance of the model.

    In this case our target variable, that is what we want to be able to predict given all the other features, is the Survival variable. Since this is either true or false, we have binary classification problem - we want to predict if a passenger survived the Titanic disaster or not.

    Baseline Model

    To have something to compare our gradient boosting model’s performance with, we start by performing a standard method for binary classification, namely logistic regression. As an evaluation metric, we will use weighted F1-score. The F1-score is based on precision and recall, and can for each class be computed as:

    F1-score=2PrecisionRecallPrecision + Recall \text{F1-score}=2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision + Recall}}

    The weighted F1-score is a weighted sum of the F1-scores for each class:

    Weighted F1-score=w0F1-score0+w1F1-score1 \text{Weighted F1-score}=w_0\cdot\text{F1-score}_0 + w_1\cdot\text{F1-score}_1

    where w0w_0 and w1w_1 are the proportion of data points belonging to class 0 and class 1 respectively, in the test set.

    Training a logistic regression model on the training set, and then evaluating the model on the test set gives us the following result:

    Predicted 0Predicted 1
    Actual 0577
    Actual 1531

    Since there are 64 data points in class 0 and 36 data points in class 1, in the test set, the weighted F1-score for the logistic regression model works out to:

    Weighted F1-score=0.640.90+0.360.84=0.88 \text{Weighted F1-score}=0.64\cdot0.90 + 0.36\cdot0.84=0.88

    Gradient Boosting

    Before we move on to implementing a gradient boosting model on the Titanic disaster dataset, we start by explaining what gradient boosting does and how it works.

    Gradient boosting uses an ensemble of decision tree learners. Now, if you have heard of the Random Forest algorithm, the concept of using an ensemble of decision tree learners may sound familiar, since random forests use this as well. There is however, a big difference compared to gradient boosting in how these tree learners are constructed. In a random forest, each tree is created independently of the others and the models weigh their respective result together equally. In gradient boosting, we let each new tree be based on the prediction of the previous trees, so that they can learn from the mistakes the previous trees made. This is the fundamental idea in boosting: converting weak learners into strong learners.

    By a weak learner we mean a model whose performance is at least slightly better than random chance. In a random forest each tree is created independently, so these trees are all strong learners by themselves, but in gradient boosting we want room for improvement so they can learn from each other. To make sure that a decision tree is indeed a weak learner, we can limit its depth, number of leaves and set a minimal number of data points that need to be in a leaf.

    So, the question is how gradient boosting make the new decision trees learn from predictions made by the previous trees. The answer gives rise to why the method is called gradient boosting. What gradient boosting does, is that each new tree model Δn\Delta_n, train on the negative gradient of the loss function with respect to the previous prediction pn1(x)p_{n-1}(x)

    pn1(x)L(y,pn1(x))=[L(y1,pn1(x1))pn1(x1),,L(ym,pn1(xm))pn1(xm)] \begin{aligned} -\nabla_{p_{n-1}(x)} L(y,p_{n-1}(x))=-\left[\frac{\partial L(y_1,p_{n-1}(x_1))}{\partial p_{n-1}(x_1)}, \quad\dots\quad,\frac{\partial L(y_m,p_{n-1}(x_m))}{\partial p_{n-1}(x_m)} \right] \end{aligned}

    instead of training on the actual labels y=(y1,...,ym)y=(y_1,...,y_m).

    Also, x1,...,xmx_1,...,x_m are the feature vectors for all mm data points and x=(x1,...,xm)x=(x_1,...,x_m). Then we let the new prediction pn(x)p_n(x) be:

    pn(x)=pn1(x)+αΔn(x) \begin{aligned} p_n(x)=p_{n-1}(x)+\alpha\cdot\Delta_{n}(x) \end{aligned}

    for some learning rate α\alpha. Therefore, since Δn\Delta_n is an approximation of pn1(x)L(y,pn1(x))-\nabla_{p_{n-1}(x)} L(y,p_{n-1}(x)), gradient boosting performs gradient descent in function space:

    pn(x)=pn1(x)αpn1(x)L(y,pn1(x)) \begin{aligned} p_n(x)=p_{n-1}(x)-\alpha\cdot\nabla_{p_{n-1}(x)} L(y,p_{n-1}(x)) \end{aligned}

    Now, gradient descent in function space might sound awfully complicated and abstract. Gradient descent is typically used to find the xx-value that minimizes some function f(x)f(x):

    xn=xn1αxn1f(xn1) \begin{aligned} x_n = x_{n-1}-\alpha\cdot\nabla_{x_{n-1}} f(x_{n-1}) \end{aligned}

    Here, gradient descent iterates through different points in the space of possible xx-values to find the value that minimizes f(x)f(x). Gradient descent in function space instead iterates through different functions (in this case predictor models) in the space of possible functions to find the function that minimizes the loss function L(y,p(x))L(y,p(x)). If the function concept is hard to grasp, one can rather think about it as finding the prediction in the space of possible predictions that minimizes the loss function.

    The idea in gradient descent is to, at each iteration, take steps in the direction that the function we want to minimize decreases the most, which is the direction of the negative gradient. To further improve the intuition of why we want to train Δn\Delta_n on pn1(x)L(y,pn1(x))-\nabla_{p_{n-1}(x)} L(y,p_{n-1}(x)), we look at an example of a loss function, namely the squared error that we typically use in regression:

    L(y,p(x))=12i=1m(yip(xi))2 \begin{aligned} L(y,p(x))=\frac{1}{2}\sum_{i=1}^m (y_i-p(x_i))^2 \end{aligned}

    If we calculate the gradient:

    p(x)L(y,p(x))=[L(y1,p(x1))p(x1),,L(ym,p(xm))p(xm)]=[p(x1)y1,,p(xm)ym] \begin{aligned} \nabla_{p(x)} L(y,p(x))=\left[\frac{\partial L(y_1,p(x_1))}{\partial p(x_1)}, \quad\dots\quad,\frac{\partial L(y_m,p(x_m))}{\partial p(x_m)} \right] \\\\ =\left[p(x_1)-y_1, \quad\dots\quad,p(x_m)-y_m \right] \end{aligned}

    we see that we are training the new tree model on the residuals from the previous predictions. Also for the log loss function that is typically used as loss function in binary classification, the gradient is the residual vector. So, thinking about the gradient as the residual vector and that the model wants to minimize the sum of these residuals, the model will focus on the large residuals, that is the predictions that the previous model got the most wrong. In that way, the new model will learn from the mistakes made by previous models to improve its predictive power.

    We summarize the gradient boosting model. First, to initialize the algorithm, we need a starting prediction model p0(x)p_0(x). We just let p0(x)p_0(x) be the constant value vv that minimizes the loss function L(y,v)L(y,v), often this is the mean value of the true labels yy, i.e.v=1mi=1myiv=\frac{1}{m}\sum_{i=1}^m y_i. So, the algorithm can be summarized as (with NN as the number of tree learners):

    Let p0(x) be the constant value v minimizing L(y,v)for n=1 to NTrain decision tree Δn on pn1(x)L(y,pn1(x))Let pn(x)=pn1(x)+αΔn(x)endreturn pN \begin{aligned} &\text{Let } p_0(x) \text{ be the constant value } v \text{ minimizing } L(y,v) \\ &\textbf{for } n=1 \textbf{ to } N \\ &\quad\quad \text{Train decision tree } \Delta_n \text{ on } -\nabla_{p_{n-1}(x)} L(y,p_{n-1}(x)) \\ &\quad\quad \text{Let } p_n(x)=p_{n-1}(x)+\alpha\cdot\Delta_{n}(x) \\ &\textbf{end} \\ &\textbf{return } p_N \end{aligned}

    LightGBM model

    Now it is time to implement a gradient boosting model on the Titanic disaster dataset. There are several frameworks one can use, and we will use LightGBM, which is a gradient boosting method developed by Microsoft, that is implemented with several adjustments to improve things like time and memory efficiency, accuracy and parallel learning.

    For many machine learning algorithms, there are only a few parameters and the default settings often works the best, or at least very well. This makes these algorithms easy to implement. For example, this is the case for our logistic regression model. In gradient boosting however, we have a lot of different parameters that we can specify and the default settings are almost never the best solution. To do the parameter tuning, one often does a cross-validation analysis using e.g. grid search, that is we train the model on a lot of combinations of different parameter values, and see which performs best on a cross-validation set according to some evaluation metric.

    After some parameter tuning, we got the following result by applying our LightGBM model on the Titanic disaster dataset:

    Predicted 0Predicted 1
    Actual 0622
    Actual 1927

    and calculating precision, recall and F1-score for each class in the same way as we did for the logistic regression model, we get the following weighted F1-score:

    Weighted F1-score=0.640.92+0.360.83=0.89 \begin{aligned} \text{Weighted F1-score}=0.64\cdot0.92 + 0.36\cdot0.83=0.89 \end{aligned}

    which is slightly better than the weighted F1-score of 0.88 we got with logistic regression.

    Custom loss function

    So, we were able to improve the results compared to the baseline model. Perhaps we can do even better. Looking at the predictions from the LightGBM model, we see that there are rather many false negatives compared to false positives. Maybe we can improve the results with a loss function that punishes false negatives more than false positives. As you might have noticed, the gradient boosting algorithm is not expressed in a specific loss function, but just as a general loss function LL. This makes it possible for gradient boosting to tackle several different problems like regression, binary classification and multi-class classification, but also enables us to chose our own loss function. Since we need the gradient of the loss function, we just need to make sure that our loss function is differentiable.

    As a default when doing binary classification, LightGBM uses the log loss function as loss function:

    L(y,p(x))=ylnp(x)(1y)ln(1p(x)) \begin{aligned} L(y,p(x))=-y\ln{p(x)}-(1-y)\ln{(1-p(x))} \end{aligned}

    where p(x)=11+exp(x)=\frac{1}{1+e^{-x}} is the Sigmoid function (also known as the logistic function). Note that this is the same loss function used in logistic regression. Now, if we would like a loss function that penalizes false negatives more than false positives, we could adjust the log loss function in the following way:

    L(y,p(x))=βylnp(x)(1y)ln(1p(x)) \begin{aligned} L(y,p(x))=-\beta\cdot y\ln{p(x)}-(1-y)\ln{(1-p(x))} \end{aligned}

    where β>1\beta > 1 is a constant (the higher the value, the more we punish false negatives). To use our own customized loss function in LightGBM, we need to provide a function that takes the true values and predictions as input, and outputs the gradient and Hessian (second derivative) of our loss function. So, we start by finding the first and second derivative for our modified log loss function with respect to the predictions. The chain rule of derivation gives us:

    L(y,p(x))x=L(y,p(x))p(x)p(x)x=(βyp(x)+1y1p(x))ex(1+ex)2=(1y)p(x)βy(1p(x))p(x)(1p(x))exp2(x)=p(x)(1+βyy)βy1p(x)exp(x)=p(x)(1+βyy)βyex1+exexp(x)=p(x)(1+βyy)βyp(x)p(x)=p(x)((β1)y+1)βy \begin{aligned} \frac{\partial L(y,p(x))}{\partial x}=\frac{\partial L(y,p(x))}{\partial p(x)}\cdot\frac{\partial p(x)}{\partial x}=\left(-\frac{\beta y}{p(x)}+\frac{1-y}{1-p(x)}\right)\cdot\frac{e^{-x}}{(1+e^{-x})^2} \\\\ =\frac{(1-y)p(x)-\beta y(1-p(x))}{p(x)(1-p(x))}\cdot e^{-x}p^2(x)=\frac{p(x)(1+\beta y-y)-\beta y}{1-p(x)}\cdot e^{-x}p(x) \\\\ =\frac{p(x)(1+\beta y-y)-\beta y}{\frac{e^{-x}}{1+e^{-x}}}\cdot e^{-x}p(x)=\frac{p(x)(1+\beta y-y)-\beta y}{p(x)}\cdot p(x) \\\\ =p(x)((\beta-1)y+1)-\beta y \end{aligned}

    and

    2L(y,p(x))x2=(L(y,p(x))x)p(x)p(x)x=((β1)y+1)ex(1+ex)2=((β1)y+1)11+exex1+ex=((β1)y+1)p(x)(1p(x)) \begin{aligned} \frac{\partial^2 L(y,p(x))}{\partial x^2}=\frac{\partial(\frac{\partial L(y,p(x))}{\partial x})}{\partial p(x)}\cdot\frac{\partial p(x)}{\partial x}=((\beta-1)y+1)\cdot\frac{e^{-x}}{(1+e^{-x})^2} \\\\ =((\beta-1)y+1)\cdot\frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}}=((\beta-1)y+1)p(x)(1-p(x)) \end{aligned}

    Having the expressions for the gradient and Hessian for our custom loss function, we can implement the function in Python, and later pass it to LightGBM:

    Now we can try out our custom loss function in the LightGBM model, for different values of β\beta. With β=2.5\beta=2.5, we get the same number of false negatives and false positives, but the overall performance has decreased (F1-score=0.86), so this is not a better solution. However, with β=1.5\beta=1.5, we are actually able to increase the performance from the previous model (but still having a certain imbalance between false negatives and false positives). We get the following results:

    Predicted 0Predicted 1
    Actual 0613
    Actual 1729

    yielding the weighted F1-score below:

    Weighted F1-score=0.640.92+0.360.86=0.90 \begin{aligned} \text{Weighted F1-score}=0.64\cdot0.92+0.36\cdot0.86=0.90 \end{aligned}

    which is a slight improvement on our first LightGBM model.