Support Vector Machines
Overview
Support Vector Machines (SVMs for short) are a type of supervised algorithm that can be used for either classification or regression. They function through the use of a hyperplane, learned from analyzing input data. The goal of an SVM is to optimize this hyperplane such that it separtes different classes of data, and to maximize a space known as a margin in-between the classes. We call the nearest data points to the hyperplane support vectors, and both are used when defining what is known as a decision boundary. New data points are binned into one side or the other of this decision boundary based on their features by the algorithm.
Support Vector Machines are Linear Separators
The decision function for the hyperplane is given by:
f(x)=sign(w^⊤x+b) [1]
Where w is the weight vector, x is an input vector of data points, and b is the bias, or error term. This is clearly a linear equation (reminiscent of y = mx + b). The image below illustrates a hyperplane separating two classes of data:
This is an example of a the margin hyperplane for an SVM model [1].
Even if the data isn’t linearly separable, SVMs transform the data into a space where a flat, straight hyperplane will work to separate the data linearly. This is done using the kernel trick, described below.
The Kernel Trick and the Importance of the Dot Product
A kernel is a function that is used to arrange and transform data and aids the SVM algorithm in its function to bin data. They project the original data onto a space with more dimensions, such that a hyperplane can separate them where it wouldn’t have been able to before [2].
A kernel function is defined as:
k(x, z) = Φ(x) · Φ(z) [3]
where Φ is an embedding, or a kind of representation learned from the data, and x and z are two input data points. This is what allows the transformation of the lower-dimensional data into a higher dimensional feature space. Note that the above equation is defined as a dot product of two vectors. This works as intended because the kernel transformation is done using inner products of training vectors, and not using the vectors themselves. In other words, the dot product of x and z is done first, and then the kernel function is applied to said dot product. This means we never have to compute Φ(x) or Φ(z) directly, which is important because the embedding Φ can lie in extremely high dimensions. When we apply the kernel function to an inner product, which is a scalar value, this makes something that would be wildly expensive (in tems of computational resources) feasible, but still able to deliver the same mathematical results. The hyperplane in this new kernel space will still become a linear separator, even if the data is not linearly separable in the original feature space of the data. The below image shows a visual representation of how a 2D kernel with non-linearly separable data can be transformed into a higher-dimensional (3D) space so that a hyperplane can separate the data:
This image shows how the kernel trick works for SVMs [2].
Example of Mapping 2D Points with a Polynomial Kernel
The example below showcases how mapping 2D points into a six-dimensional space works in the case of a polynomial kernel:
Data Prep and Code (Data Train-Test-Split Images)
CODE FOR ALL PROCESSES EXPLAINED BELOW CAN BE FOUND HERE: SVM CODE
To create a SVM model, we will use the same data we already prepared and used for multinomial NB (found here) This data is suited for the classification task of trying to predict whether players placed in the top four in a given match or not, based on the traits they included in their team composition and how many units of each trait they ran.
Training and testing datasets were prepared with an 80/20 split, as with all other models. Below are snippets of the training and testing datasets and links to the data used:
Training Data. Full Dataset can be found here: Training Dataset
Testing Data. Full Dataset can be found here: Testing Dataset
Results and Conclusions
Data Scaling
Data was scaled using a standard scaler before applying the SVM model. Images of the scaled data are below:
Training Data Scaled using sklearn's StandardScaler. Full Dataset can be found here: Scaled Training Dataset
Testing Data Scaled using sklearn's StandardScaler. Full Dataset can be found here: Scaled Testing Dataset
Kernels Used and Model Results: Accuracies and Confusion Matricies
In total, 9 SVM models were created. Three models were created with a polynomial kernel, three with an RBF (Radial Basis Function) kernel, and three with a linear kernel. For each kernel, C, the regularization parameter, was altered between 0.1, 1, and 10. For the polymomial kernel, degree was kept at 2 (so all polymomial models were quadratic), and the coefficient parameter was kept at 1. For the polymomial and rbf kernels, the ‘gamma’ parameter, another kernel coefficient, was kept as ‘scale’ (details for each model can be found in the link to the coding notebook above).
Below are images showing the accuracy, classification report, and confusion matrix for each of the 9 models:
Polynomial Kernel, C = 0.1
The Accuracy and Classification Report of an SVM with a Polynomial Kernel and C = 0.1.
The Confusion Matrix of an SVM with a Polynomial Kernel and C = 0.1.
Polynomial Kernel, C = 1
The Accuracy and Classification Report of an SVM with a Polynomial Kernel and C = 1.
The Confusion Matrix of an SVM with a Polynomial Kernel and C = 1.
Polynomial Kernel, C = 10
The Accuracy and Classification Report of an SVM with a Polynomial Kernel and C = 10.
The Confusion Matrix of an SVM with a Polynomial Kernel and C = 10.
RBF Kernel, C = 0.1
The Accuracy and Classification Report of an SVM with an RBF Kernel and C = 0.1.
The Confusion Matrix of an SVM with an RBF Kernel and C = 0.1.
RBF Kernel, C = 1
The Accuracy and Classification Report of an SVM with an RBF Kernel and C = 1.
The Confusion Matrix of an SVM with an RBF Kernel and C = 1.
RBF Kernel, C = 10
The Accuracy and Classification Report of an SVM with an RBF Kernel and C = 10.
The Confusion Matrix of an SVM with an RBF Kernel and C = 10.
Linear Kernel, C = 0.1
The Accuracy and Classification Report of an SVM with a Linear Kernel and C = 0.1.
The Confusion Matrix of an SVM with a Linear Kernel and C = 0.1.
Linear Kernel, C = 1
The Accuracy and Classification Report of an SVM with a Linear Kernel and C = 0.1.
The Confusion Matrix of an SVM with a Linear Kernel and C = 1.
Linear Kernel, C = 10
The Accuracy and Classification Report of an SVM with a Linear Kernel and C = 0.1.
The Confusion Matrix of an SVM with a Linear Kernel and C = 10.
Discussion of Results
Overall, all models were fairly similar in performance, with only a variability of around 3.5% in accuracy from the lowest-performing SVM model to the highest.
The highest-performing model in terms of accuracy was the polynomial model with C = 1, with an accuracy of 0.723, and a recall score of class 1 0f 0.69, and the second-best performing model was the polynomial model with C = 10, with an accuracy of 0.722 and a class 1 recall score of 0.68. This indicates, that, largely, incorporating interactions of the features with a quadratic model is the most helpful without causing too much overfitting. Increasing C adds some variance, with accuracy being almost identical but with a slightly lower recall score. The polynomial model with C = 0.1 performed the worst out of the three polynomial models, with an accuracy of 0.707, so having C be too small is clearly an issue, and overall a C of 1 seems to be optimal.
The RBF models performed best-to-worst with C = 1, C = 10, and C = 0.1, with accuracy scores of 0.715, 0.712, and 0.688, respectively. We see that the RBF kernels are able to model the complex structure of the input data, but there are no gains in accuracy over using a quadratic model - the only potential improvement is that class 1 recall is 0.7 for both the RBF kernels with C = 1 and C = 10. This suggests that quadratic features already capture most of the non-linearity of the data. Again, we see that a C too large drops in accuracy very slightly, indicating overfitting, and a C too small falls in accuracy steeply, so again C = 1 seems to be optimal.
The SVMs using the linear kernel peformed the most consistently across all values of C, each being just under 0.7 in accuracy, and all having the same class 1 recall of 0.65. However, these models also performed poorer than the best models using other kernels. This implies that a linear surface cannot accurately model the interaction structure in trait counts as well as the other kernels can.
The best-performing SVM models beat out the best-performing model thus far in this project, that being the logistic regression model trained on the same data. Pairwise combinations of traits in a quadratic model are informative in their predictive power to figure out if a player will finish in the top 4 or not. This makes intuitive sense, since many traits in the game automatically go together and are usually fielded on the same boards.
Visualization
In an effort to try and visualize the SVM decision boundary from the three best-performing models (mentioned above), principal component analysis was peformed on the dataset (see the PCA page for more information on how PCA works) to reduce the data to 2 dimensions. However, considering the original dataset is very wide, with 29 dimensions, it is very difficult to capture all of the necessary information into just 2 dimensions in order to visualize the decision boundary. Each visualization for the top three performing models looked the exact same and performed the same, as seen in the plot below:
After reducing the 29-dimension dataset into just 2, a line was drawn in an effort to try and visualize the decision boundary between the two classes. However, this does a poor job at showing how the SVM works in such a high-dimensional dataset. Note the very low PCA-space accuracy at the top
Visibly, the line does a horrible job at separating these classes. It simply is not possible to visualize the decision boundary in a dataset with 29 dimensions, even though the model itself performs fairly well. Another attempt could be made to show a plane if the data were reduced to 3 dimensions.
Sources
-[1]: Lecture 9: SVM. (n.d.). https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html
-[2]: Jain, A. (2024, November 16). SVM kernels and its type - Abhishek Jain - Medium. Medium. https://medium.com/@abhishekjainindore24/svm-kernels-and-its-type-dfc3d5f2dcd8
-[3]: Kernel Methods. (n.d.-b). https://cseweb.ucsd.edu/~dasgupta/250B/kernel-handout.pdf