The comparison of articles and methods that perform distributed algorithms for training SVMs in parallel. Here, we highlight the articles in the survey that have interesting findings. The description of each column is as follows,

Note the table can be sorted by clicking on the title of each column.

Algorithm Problem Type Parallelism Type Main Idea Pros Cons Reference
Algorithm Problem Type Parallelism Type Main Idea Pros Cons Reference
Initial clustering Unclear Shared memory parallelism Minimizes the number of training examples that represent the original large-scale problem using k-means clustering
  • Reduces number of training examples using clustering
  • Robust for unbalanced data
  • Easy implementation
  • Speedup (3 times) compared to the sequential version
  • Possible deterioration of accuracy
  • Limited explainability of results
  • Benefits not fully clear for non-linear and multi-class classifications
  • Limited memory
Shrivastava et al. [2011]
kSVM Non-linear and Multi-class classifications Shared memory parallelism Distributes data into clusters in which clusters locally train data in parallel
  • Scales well with the size of training data
  • Maintains the classification accuracy
  • Good scalability in terms of the number of training samples
  • Exists a trade-off between computation time and accuracy
  • Finds local SVs instead of the global SVs
Do et al. [2015]
Ensemble SVMs Non-linear and Multi-class classifications Shared memory parallelism Implements ensemble learning to train data and uses simple aggregation schemes to find the global result
  • Significant speedup compared to Libsvm with comparative accuracy
  • Reduces training complexity
  • Avoids duplicate storage
  • High classification accuracy
  • Good parallelizability
  • An intuitive programming framework
  • Needs more complex aggregation schemes for further performance improvement
  • Deterioration of accuracy for low-dimensional problems
  • Limited memory
Claesen et al. [2014]
GaBP SVM Non-linear and binary classifications Distributed HPC Architectures Reformulates the standard SVM problem from algebraic to a probabilistic domain using Gaussian distribution
  • Reduces overheads using a few numbers of passing messages
  • Good scalability in terms of the number of cores (up to 1024 cores)
  • Comparable accuracy with SVMLight
  • The largest parallel implementation of the Gaussian belief propagation ever done
  • A linear speedup using 256 CPUs
  • Robust to sparse and dense matrix operations
  • Lacks support of asynchronous communications in heterogeneous computing environments
  • Slow convergence of a large number of input data and large networks
  • Communication overhead
  • Computes the full kernel matrix
Bickson et al. [2008]
RBF kernel for SVM Non-linear and binary classifications Distributed HPC Architectures Implements an approximation method for RBF kernel function in which the kernel matrix is reduced into a sparse matrix with smaller memory requirements
  • Minimizes communication cost using a kernel approximation
  • Efficient data partition and communication pattern in a distributed system
  • Good scalability in terms of the number of nodes
  • High speedup (45 times) using 65 nodes
  • Linear speedup for the number of nodes less than 32 using an approximated kernel
  • High classification accuracy
  • Benefits not fully clear for other kernels
  • Limited speedup of dimension-wise partitioning
  • Communication overhead
Qiu and Lane [2005]
SVM Mixture Non-linear and binary classifications Distributed HPC Architectures Trains a mixture of SVMs using a neural network
  • Leads to high prediction accuracy
  • Scales linearly with the number of samples
  • Unclear impact of data with a large number of features on the performance
  • Computation time in the neural network is longer than the time spent on training sub-problems
  • Complex configurations
Collobert et al. [2002]
Modular network SVMs Non-linear and multi-class classifications Distributed HPC Architectures Improves the work by Collobert et al. [2002] by replacing the neural network with neural quantizer modules which does not require a specific knowledge of neural networks
  • Training time increases linearly with the number of samples
  • Efficient network structure
  • A local SVM produces the actual output
  • Improves the performance in terms of time compared to training a single SVM
  • No need for extra parameter tuning
  • No need for specialist knowledge of neural networks
  • Simpler configuration than the work proposed by Collobert et al. [2002] and takes a shorter time to train SVMs
  • The aggregation model is still not simple
  • Concerns regarding accuracy
  • Exists a trade-off between learning speedup and classification accuracy
Huang et al. [2005]
D-ADMM Linear and binary classifications Distributed HPC Architectures Uses a coloring scheme that leads to fewer communications between nodes than that in the state-of-the-art algorithms
  • Reduces communication between nodes
  • Asynchronous computations
  • Good parallelizability
  • Unclear theoretical explanation of why this algorithm is more efficient than the similar algorithms
  • Supports only linear classifications
  • Slow convergence
Mota et al. [2013]
Linear ADMM Linear and multi-class classifications Distributed HPC Architectures Implements an ADMM-based SVMs in which an inexact/approximation optimization problem is solved with a warm start of sub-problems
  • Outperforms the algorithms proposed by Yu et al. [2012] and Zinkevich et al. [2010] in terms of training time and convergence rate
  • Efficient parameter tuning using cross-validation
  • Reduces the total number of ADMM's iterations using over-relaxation while keeping the same accuracy
  • Reduces the total number of iterations using a random shuffling
  • Shortens the time in each ADMM iteration using warm start in solving subproblems
  • Accelerates the classification and increases the classification accuracy using distributed normalization of data
  • Supports only linear classifications
  • Requires data to fit in the memory
Zhang et al. [2012]
euroCAT Linear and binary classifications Distributed HPC Architectures Implements a multi-centric distributed learning in which data privacy is a top priority with application in personalized medicine
  • Preserves data privacy
  • Higher speedup while having good accuracy
  • Quality of the model is similar to the centralized models
  • Asynchronous communications
  • Efficient parameter tuning
  • Supports only linear classifications
  • Slow convergence
Deist et al. [2017]
GADMM Linear and binary classifications Distributed HPC Architectures Implements a group-based ADMM in which all slave nodes are divided into groups and uses a weighted average method to gather the results from all groups and update the global variable
  • Fast convergence
  • Higher speedup (30%) compared to non-grouped ADMM
  • Reduces the number of ADMM iterations
  • Improves the global consensus
  • Good scalability in terms of the number of training samples
  • Supports only linear classifications
  • Slightly degrades the accuracy
Wang et al. [2017]
PMKL Non-linear and binary classifications Shared memory parallelism Implements the first hybrid ADMM and multiple kernel learning in parallel and coordinates the communications between processors to obtain the global solution
  • High classification accuracy
  • Fast computational speed
  • Outperforms the standard cascade (Graf et al. [2004]) and SVM ensembles (Alham et al. [2011])
  • Efficient parameter tuning
  • Global optimal solution
  • Benefits not fully clear regarding performance and scalability for large-scale problems
  • Not robust for imbalanced samples
Chen et al. [2014]
MRESVM Non-linear and binary classifications Distributed big data architectures using map-reduce Implements an ensemble SVM which re-samples training data using parallel balanced bootstrapping for image classification and annotation
  • Improves the classification accuracy
  • Reduces the training time while keeping high accuracy
  • Scales well with the number of mappers
  • Reduces overheads for the increasing number of mappers
  • Reduces classification error for the increasing number of bootstrap data
  • Reduces the processing time for a large number of mappers
  • Good scalability in terms of the number of cores
  • Good parallelizability
  • No load balancing for heterogeneous computing environments
  • Communication overheads for a large number of mappers
Alham et al. [2013]
DSVM Linear and binary classifications Distributed big data architectures using map-reduce Performs block minimizations and obtains final results using averaging and line search strategies
  • Scales well in terms of the number of training and testing samples
  • Communication complexity is independent of the number of training samples
  • Fast computation time for training large-scale problems (over 80 million samples in less than 11 minutes)
  • Supports only linear classifications
  • No theoretical proof of global convergence
Pechyony et al. [2011]
GPU-Tailored kernalized SVM GPU-based Parallelism Non-linear, binary, and multi-class classifications. Handles sparse data using sparsity clustering which groups training samples with similar sparsity patterns on GPU
  • Solves problems with sparse datasets that may not follow the GPUs' special memory pattern
  • Robust to both sparse and dense data formats
  • Efficiently handles sparse data
  • Publicly available implementation
  • Overcomes some of limited memory issues on GPUs
  • Reduces runtime
  • Efficient memory access pattern
  • Easy configuration
  • Orders of magnitude faster performance than existing CPUs and GPUs implementations
  • Benefits not fully clear for other data structures
  • May still suffers from limited memory
  • Initial clustering of data can be computationally expensive
Cotter et al. [2011]
Automatic parallel SVM GPU-based Parallelism Non-linear and binary classifications Develops a directive-based technique using a gradient ascent for the optimization problem. Performs a novel feature extraction in which pixel values are converted to a high dimensional input vector
  • Higher speedup (3 times) for a simple change of memory access pattern
  • Higher speedup (10-fold) for changing the input data structure
  • Higher memory throughput
  • The fastest GPU-based SVM for datasets with high dimensional input space
  • Dimension reduction of problem
  • Maintaining high accuracy
  • Efficient parameter tuning
  • Benefits not fully clear regarding scalability
Codreanu et al. [2014]