The comparison of articles and methods that perform parallel algorithms other than mentioned in the previous tables. Here, we highlight the articles in the survey that have interesting findings. The description of each column is as follows,

"Algorithm" indicates the corresponding algorithm's name or approach.

"Problem Type" shows whether the algorithm solves binary, multi-class, linear or/and non-linear classifications.

"Parallelism Type" highlights whether the algorithm uses the shared memory parallelism, distributed big data architectures, distributed HPC architecture, or/and GPU parallelism.

"Main Idea" simply shows the main contribution of the author(s).

"Pros" explains the remarks and findings of the algorithm. Besides, it highlights the advantages of the algorithm compared to the similar approaches, if possible.

"Cons" describes the shortcomings or the place for further improvements of the corresponding algorithm. Besides, it shows the weak points of the algorithm compared to the similar approaches, if possible.

Note the table can be sorted by clicking on the title of each column.
Algorithm	Problem Type	Parallelism Type	Main Idea	Pros	Cons	Reference
Algorithm	Problem Type	Parallelism Type	Main Idea	Pros	Cons	Reference
PIRWLS and PSIRWLS	Non-linear, binary, and multi-class classifications	Shared memory parallelism	Implements two parallel iterative re-weighted least square SVMs to solve a full SVM and a semi-parametric model of SVMs. Parallelizes a Cholesky decomposition to solve linear systems in each iteration	Reduces computationally expensive kernel evaluations Avoids computing a computationally expensive matrix inversion Outperforms LIBIRWLS proposed by Morales and Vazquez [2017] Higher speedup compared to the standard SVM libraries Similar accuracy to the standard SVM libraries, but in less time	A large numbers of SVs in some cases Slower than the standard libraries using one core Benefits not fully clear for a large number of cores May suffer from limited memory	Morales and Vazquez [2016]
LIBIRWLS	Non-linear and binary classifications	Shared memory parallelism	Implements a parallel library that uses iterative re-weighted least square SVM instead of quadratic programming. Approximates the kernel matrix using a sparse greedy approximation	Efficiently controls the number of SVs Reduces training samples using random sampling	Does not perform better than PSIRWLS proposed by Morales and Vazquez [2016] in terms of accuracy May suffer from limited memory Benefits not fully clear in terms of scalability	Morales and Vazquez [2017]
P-packSVM	Non-linear and multi-class classifications	Distributed HPC Architectures	Stochastically approximates the gradient using gradient descent, so that only a single training sample is used instead of using the whole training samples. Packs many consecutive gradient descent steps in one pack	Supports any arbitrary kernel function Good scalability in terms of the number of training samples and cores Outperforms the similar state-of-the-art algorithms Reduces communication overhead Reduces number of iterations Outperforms IPM- and SMO-based algorithms Reduces the problem size Trains large-scale problems in much shorter time and comparative accuracy compared to the similar approaches	Lacks a convergence criterion Sub-linear speedup in some cases	Zhu et al. [2009]
PmSVM	Non-linear and multi-class classifications	Hybrid shared and distributed HPC Architectures	Builds balanced bagging classifiers using under-sampling strategy and trains only a fraction of training data	Fast convergence Scales well with the number of training samples Avoids training on full training samples Handles class-imbalance problems using an under-sampling strategy Higher speedup (286 times) compared to the sequential version and (1000 times) compared to Libsvm and LIBLINEAR.	Suffers from limited memory Requires to load the whole training data into the memory Does not support heterogeneous computing environments Benefits not fully clear for a large number of cores	Nghi et al. [2015]
Gradient system	Non-linear and multi-class classifications	Distributed HPC Architectures	Uses neural networks for training SVMs. Solves high-frequency oscillations in the neighborhood of the discontinuity surface (chattering problem) using a boundary layer technique	Good parallelizability Calculates only the lower triangular matrix addressed by SVM Outperforms SVMLight and Libsvm Solves the chattering problem	Benefits not fully clear for scalability of large-scale problems Benefits not fully clear for scalability in terms of the number of processors Complex configuration	Ferreira et al. [2006]
ASGD	Non-linear and multi-class classifications	Distributed big data architectures using Hadoop map-reduce	Performs feature extraction in parallel using a large number of mappers. Performs averaging stochastic gradient descent in parallel for multi-class classifications	Reduces dimension of the problem Fast extraction of fairly sophisticated features Fast training of redundant data using an averaging scheme Good parallelizability Reduces the traffic for loading data Good scalability in terms of the number of training samples Fast convergence	Benefits not fully clear for non-redundant data May suffer from communication overheads Concerns regarding asymptotic convergence property	Lin et al. [2011]
Map-Reduce SVM	Unclear	Distributed big data architectures	Reformulates the optimization problem to perform better parallelization	Improves ASGD proposed by Lin et al. [2011] in terms of communication overhead Better parallelizability compared to the similar approaches Reduces the communication between processors using an averaging scheme Linear speedup while increasing the number of processors Outperforms the standard cascade SVMs in terms of speedup Performs other machine learning algorithms in parallel on the proposed platform	Concerns regarding scalability in terms of the number of cores Requires using whole training samples in each iteration Concerns regarding scalability in terms of the number of training samples Unclear type of the classification problems	Chu et al. [chu2007]
BMBPGD	Distributed big data architectures using Apache Spark	Non-linear and binary classifications	Implements a budgeted mini-batch gradient descent in parallel on Apache Spark	Uses only a fraction of training examples in each iteration Good scalability in terms of the number of training samples Keeps the number of SVs under control Avoids the curse of dimensionality Outperforms SGD-based algorithms in terms of accuracy and time	Concerns regarding accuracy in some cases Exists a trade-off between accuracy and time	Tao et al. [2014]