The comparison of articles and methods that perform distributed algorithms for training SVMs in parallel. Here, we highlight the articles in the survey that have interesting findings. The description of each column is as follows,

"Algorithm" indicates the corresponding algorithm's name or approach.

"Problem Type" shows whether the algorithm solves binary, multi-class, linear or/and non-linear classifications.

"Parallelism Type" highlights whether the algorithm uses the shared memory parallelism, distributed big data architectures, distributed HPC architecture, or/and GPU parallelism.

"Main Idea" simply shows the main contribution of the author(s).

"Pros" explains the remarks and findings of the algorithm. Besides, it highlights the advantages of the algorithm compared to the similar approaches, if possible.

"Cons" describes the shortcomings or the place for further improvements of the corresponding algorithm. Besides, it shows the weak points of the algorithm compared to the similar approaches, if possible.

Note the table can be sorted by clicking on the title of each column.
Algorithm	Problem Type	Parallelism Type	Main Idea	Pros	Cons	Reference
Algorithm	Problem Type	Parallelism Type	Main Idea	Pros	Cons	Reference
Initial clustering	Unclear	Shared memory parallelism	Minimizes the number of training examples that represent the original large-scale problem using k-means clustering	Reduces number of training examples using clustering Robust for unbalanced data Easy implementation Speedup (3 times) compared to the sequential version	Possible deterioration of accuracy Limited explainability of results Benefits not fully clear for non-linear and multi-class classifications Limited memory	Shrivastava et al. [2011]
kSVM	Non-linear and Multi-class classifications	Shared memory parallelism	Distributes data into clusters in which clusters locally train data in parallel	Scales well with the size of training data Maintains the classification accuracy Good scalability in terms of the number of training samples	Exists a trade-off between computation time and accuracy Finds local SVs instead of the global SVs	Do et al. [2015]
Ensemble SVMs	Non-linear and Multi-class classifications	Shared memory parallelism	Implements ensemble learning to train data and uses simple aggregation schemes to find the global result	Significant speedup compared to Libsvm with comparative accuracy Reduces training complexity Avoids duplicate storage High classification accuracy Good parallelizability An intuitive programming framework	Needs more complex aggregation schemes for further performance improvement Deterioration of accuracy for low-dimensional problems Limited memory	Claesen et al. [2014]
GaBP SVM	Non-linear and binary classifications	Distributed HPC Architectures	Reformulates the standard SVM problem from algebraic to a probabilistic domain using Gaussian distribution	Reduces overheads using a few numbers of passing messages Good scalability in terms of the number of cores (up to 1024 cores) Comparable accuracy with SVMLight The largest parallel implementation of the Gaussian belief propagation ever done A linear speedup using 256 CPUs Robust to sparse and dense matrix operations	Lacks support of asynchronous communications in heterogeneous computing environments Slow convergence of a large number of input data and large networks Communication overhead Computes the full kernel matrix	Bickson et al. [2008]
RBF kernel for SVM	Non-linear and binary classifications	Distributed HPC Architectures	Implements an approximation method for RBF kernel function in which the kernel matrix is reduced into a sparse matrix with smaller memory requirements	Minimizes communication cost using a kernel approximation Efficient data partition and communication pattern in a distributed system Good scalability in terms of the number of nodes High speedup (45 times) using 65 nodes Linear speedup for the number of nodes less than 32 using an approximated kernel High classification accuracy	Benefits not fully clear for other kernels Limited speedup of dimension-wise partitioning Communication overhead	Qiu and Lane [2005]
SVM Mixture	Non-linear and binary classifications	Distributed HPC Architectures	Trains a mixture of SVMs using a neural network	Leads to high prediction accuracy Scales linearly with the number of samples	Unclear impact of data with a large number of features on the performance Computation time in the neural network is longer than the time spent on training sub-problems Complex configurations	Collobert et al. [2002]
Modular network SVMs	Non-linear and multi-class classifications	Distributed HPC Architectures	Improves the work by Collobert et al. [2002] by replacing the neural network with neural quantizer modules which does not require a specific knowledge of neural networks	Training time increases linearly with the number of samples Efficient network structure A local SVM produces the actual output Improves the performance in terms of time compared to training a single SVM No need for extra parameter tuning No need for specialist knowledge of neural networks Simpler configuration than the work proposed by Collobert et al. [2002] and takes a shorter time to train SVMs	The aggregation model is still not simple Concerns regarding accuracy Exists a trade-off between learning speedup and classification accuracy	Huang et al. [2005]
D-ADMM	Linear and binary classifications	Distributed HPC Architectures	Uses a coloring scheme that leads to fewer communications between nodes than that in the state-of-the-art algorithms	Reduces communication between nodes Asynchronous computations Good parallelizability	Unclear theoretical explanation of why this algorithm is more efficient than the similar algorithms Supports only linear classifications Slow convergence	Mota et al. [2013]
Linear ADMM	Linear and multi-class classifications	Distributed HPC Architectures	Implements an ADMM-based SVMs in which an inexact/approximation optimization problem is solved with a warm start of sub-problems	Outperforms the algorithms proposed by Yu et al. [2012] and Zinkevich et al. [2010] in terms of training time and convergence rate Efficient parameter tuning using cross-validation Reduces the total number of ADMM's iterations using over-relaxation while keeping the same accuracy Reduces the total number of iterations using a random shuffling Shortens the time in each ADMM iteration using warm start in solving subproblems Accelerates the classification and increases the classification accuracy using distributed normalization of data	Supports only linear classifications Requires data to fit in the memory	Zhang et al. [2012]
euroCAT	Linear and binary classifications	Distributed HPC Architectures	Implements a multi-centric distributed learning in which data privacy is a top priority with application in personalized medicine	Preserves data privacy Higher speedup while having good accuracy Quality of the model is similar to the centralized models Asynchronous communications Efficient parameter tuning	Supports only linear classifications Slow convergence	Deist et al. [2017]
GADMM	Linear and binary classifications	Distributed HPC Architectures	Implements a group-based ADMM in which all slave nodes are divided into groups and uses a weighted average method to gather the results from all groups and update the global variable	Fast convergence Higher speedup (30%) compared to non-grouped ADMM Reduces the number of ADMM iterations Improves the global consensus Good scalability in terms of the number of training samples	Supports only linear classifications Slightly degrades the accuracy	Wang et al. [2017]
PMKL	Non-linear and binary classifications	Shared memory parallelism	Implements the first hybrid ADMM and multiple kernel learning in parallel and coordinates the communications between processors to obtain the global solution	High classification accuracy Fast computational speed Outperforms the standard cascade (Graf et al. [2004]) and SVM ensembles (Alham et al. [2011]) Efficient parameter tuning Global optimal solution	Benefits not fully clear regarding performance and scalability for large-scale problems Not robust for imbalanced samples	Chen et al. [2014]
MRESVM	Non-linear and binary classifications	Distributed big data architectures using map-reduce	Implements an ensemble SVM which re-samples training data using parallel balanced bootstrapping for image classification and annotation	Improves the classification accuracy Reduces the training time while keeping high accuracy Scales well with the number of mappers Reduces overheads for the increasing number of mappers Reduces classification error for the increasing number of bootstrap data Reduces the processing time for a large number of mappers Good scalability in terms of the number of cores Good parallelizability	No load balancing for heterogeneous computing environments Communication overheads for a large number of mappers	Alham et al. [2013]
DSVM	Linear and binary classifications	Distributed big data architectures using map-reduce	Performs block minimizations and obtains final results using averaging and line search strategies	Scales well in terms of the number of training and testing samples Communication complexity is independent of the number of training samples Fast computation time for training large-scale problems (over 80 million samples in less than 11 minutes)	Supports only linear classifications No theoretical proof of global convergence	Pechyony et al. [2011]
GPU-Tailored kernalized SVM	GPU-based Parallelism	Non-linear, binary, and multi-class classifications.	Handles sparse data using sparsity clustering which groups training samples with similar sparsity patterns on GPU	Solves problems with sparse datasets that may not follow the GPUs' special memory pattern Robust to both sparse and dense data formats Efficiently handles sparse data Publicly available implementation Overcomes some of limited memory issues on GPUs Reduces runtime Efficient memory access pattern Easy configuration Orders of magnitude faster performance than existing CPUs and GPUs implementations	Benefits not fully clear for other data structures May still suffers from limited memory Initial clustering of data can be computationally expensive	Cotter et al. [2011]
Automatic parallel SVM	GPU-based Parallelism	Non-linear and binary classifications	Develops a directive-based technique using a gradient ascent for the optimization problem. Performs a novel feature extraction in which pixel values are converted to a high dimensional input vector	Higher speedup (3 times) for a simple change of memory access pattern Higher speedup (10-fold) for changing the input data structure Higher memory throughput The fastest GPU-based SVM for datasets with high dimensional input space Dimension reduction of problem Maintaining high accuracy Efficient parameter tuning	Benefits not fully clear regarding scalability	Codreanu et al. [2014]