Regression
Questions and Overview
Linear Regression
Linear regression is a method of supervised learning. It works by fitting a straight line with the equation y= β0 + β1X1 + β2X2+… βnXn+ ε , where our β values are coefficients generated by training the model on data, and our x values are inputted feature values of data points ( ϵ is the error term, indicating that the model will not make perfect predictions and there will always be some deviation). We use a method called ordinary-least-squares to find the coefficient values. This technique minimizes the sum of squared residuals, which effectively means that the model’s goal is to minimize the aforementioned error term, between our observations and generated predictions [1].
Logistic Regression
Logistic regression is another method of supervised learning. It generates a prediction based on the probability that a data point observation will belong to a (most often) binary class. It does this by using a function called the inverse logit or sigmoid to transform a linear score into a probability, according to the following equation: y = e^(β0 + β1X1 + β2X2+… βnXn+ ε) / (1 + e^(β0 + β1 x 1 + β2 x 2 +… βn x n + ε)) . Again, the β values are our coefficients, the x values come from our data, and the ε is the error term. Because the output is a probability, it lies between 0 and 1, and the model tells us how likely it is that a data point will belong to one class or the other [1].
Similarities and Differences
Both linear and logistic regression make use of linear combinations of the predictive features, and use statistical methods to “learn” coefficient values for a final equation. In other words, this method of coefficient-finding is similar for both techniques. They are different in that linear regression is used to predict continuous data, whereas logistic regression is used for predicting a binary outcome (yes or no, 0 or 1, etc.). They also differ in their loss functions, or the equation used to find and minimize the error between observations and predictions (linear regression uses ordinary least-squares, and logistic regression uses something called the log-likelihood function). The equations generated by each method (shown above) are also different [1].
The Sigmoid Function
Logistic regression uses the sigmoid function to transform the linear equation (as found in linear regression) with coefficients generated from the data to transform values into a probability score. In this way, the sigmoid allows continuous numerical values to be turned into a likelihood that a data point can be classified, or “binned”, into one category or another (usually a binary one, such as “yes” or “no”, 0 or 1, etc.). For example, if we were to use logistic regression as a way to predict whether a patient carries a disease (using medical information such as platelet count or white blood cell count, perhaps), we would find the linear equation coefficients (our β values), and then put them through the sigmoid function (y = e^(β0 + β1X1 + β2X2+… βnXn+ ε) / (1 + e^(β0 + β1 x 1 + β2 x 2 +… βn x n + ε))) to generate a probability value (between 0 and 1) that a patient has the disease [1].
Maximum Likelihood and Logistic Regression
After data has been transformed by the inverse logit function, using ordinary least squares to minimize error becomes un-optimal. Therefore, we must use a method called maximum likelihood estimation to find the optimal coefficients and minimize error. This essentially works by choosing coefficients that maximize the probability of getting our observed class labels given that we already observe feature vectors from our data points. Different algorithms that perform numerical optimizations can be used to find the maximum likelihood for this purpose - the key point to remember is that this is the way we find the best possible coefficients for logistic regression [1] [2].
Data Prep and Code
CODE FOR ALL PROCESSES EXPLAINED BELOW CAN BE FOUND HERE: REGRESSION CODE
To conduct a logistic regression, we will use the same data we already prepared and used for multinomial NB (found here), since this is ideally suited for logistic regression, as it has a binary label category (top_four).
Training and testing datasets were prepared with an 80/20 split, as with all other models. Below are snippets of the training and testing datasets and links to the data used:
Training Data. Full Dataset can be found here: Training Dataset
Testing Data. Full Dataset can be found here: Testing Dataset
Results and Conclusions
The below image shows the accuracy and confusion matrix from running a logistic regression on our trait data:
Note the accuracy score of approximately 71%
The below image shows other metrics like precision and recall for our model:
Note our precision and recall scores for each class
With an accuracy score of nearly 71%, this was one of the best-performing models of the project. Precision scores for each class are almost identical - when the model says that a score is in the top four or is not, it is correct 7 times out of 10. Recall scores indicate that the model does, however, miss one-third of the true boards that reach the top four. It is more accurate at finding boards that will not reach the top four, indicated by a higher recall score for class 0.
When we compare our results with the multinomial NB algorithm on the same dataset (found here in the Results & Conclusions section), we see we have improved our accuracy by almost 9 percentage points (from 62% accuracy to 71%). The logistic regression cuts both false negatives and false positives. This is likely due to the fact that multinomial NB assumes that each trait count is conditionally independent given the class placement, but functionally we are aware that this is not the case, since in TFT certain traits interact with each other more than others. Logistic regression does not make the same independence assumptions and learns weights more optimally, and can handle this trait data better and be more accurate. While the NB algorithm might be easier to interpret, in this case logistic regression performs much better and is the more appropriate model. This is particularly surprising considering that we can predict a top-four placement of a player quite well based solely on what traits they run and how many units of each trait are on their board, considering how much other information is relevant within the game but has not been used in the model (such as gold remaining, unit rarity, and items used).
Sources
-
[1] Linear vs Logistic Regression - Difference Between Machine Learning Techniques - AWS. (n.d.). Amazon Web Services, Inc. https://aws.amazon.com/compare/the-difference-between-linear-regression-and-logistic-regression/
-
[2] Addagatla, A. (2022, January 6). Maximum likelihood estimation in logistic regression. Medium. https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67