Initial clustering |
Unclear |
Shared memory parallelism |
Minimizes the number of training examples that represent the original large-scale problem using k-means clustering |
- Reduces number of training examples using clustering
- Robust for unbalanced data
- Easy implementation
- Speedup (3 times) compared to the sequential version
|
- Possible deterioration of accuracy
- Limited explainability of results
- Benefits not fully clear for non-linear and multi-class classifications
- Limited memory
|
Shrivastava et al. [2011] |
kSVM |
Non-linear and Multi-class classifications |
Shared memory parallelism |
Distributes data into clusters in which clusters locally train data in parallel |
- Scales well with the size of training data
- Maintains the classification accuracy
- Good scalability in terms of the number of training samples
|
- Exists a trade-off between computation time and accuracy
- Finds local SVs instead of the global SVs
|
Do et al. [2015] |
Ensemble SVMs |
Non-linear and Multi-class classifications |
Shared memory parallelism |
Implements ensemble learning to train data and uses simple aggregation schemes to find the global result |
- Significant speedup compared to Libsvm with comparative accuracy
- Reduces training complexity
- Avoids duplicate storage
- High classification accuracy
- Good parallelizability
- An intuitive programming framework
|
- Needs more complex aggregation schemes for further performance improvement
- Deterioration of accuracy for low-dimensional problems
- Limited memory
|
Claesen et al. [2014] |
GaBP SVM |
Non-linear and binary classifications |
Distributed HPC Architectures |
Reformulates the standard SVM problem from algebraic to a probabilistic domain using Gaussian distribution |
- Reduces overheads using a few numbers of passing messages
- Good scalability in terms of the number of cores (up to 1024 cores)
- Comparable accuracy with SVMLight
- The largest parallel implementation of the Gaussian belief propagation ever done
- A linear speedup using 256 CPUs
- Robust to sparse and dense matrix operations
|
- Lacks support of asynchronous communications in heterogeneous computing environments
- Slow convergence of a large number of input data and large networks
- Communication overhead
- Computes the full kernel matrix
|
Bickson et al. [2008] |
RBF kernel for SVM |
Non-linear and binary classifications |
Distributed HPC Architectures |
Implements an approximation method for RBF kernel function in which the kernel matrix is reduced into a sparse matrix with smaller memory requirements |
- Minimizes communication cost using a kernel approximation
- Efficient data partition and communication pattern in a distributed system
- Good scalability in terms of the number of nodes
- High speedup (45 times) using 65 nodes
- Linear speedup for the number of nodes less than 32 using an approximated kernel
- High classification accuracy
|
- Benefits not fully clear for other kernels
- Limited speedup of dimension-wise partitioning
- Communication overhead
|
Qiu and Lane [2005] |
SVM Mixture |
Non-linear and binary classifications |
Distributed HPC Architectures |
Trains a mixture of SVMs using a neural network |
- Leads to high prediction accuracy
- Scales linearly with the number of samples
|
- Unclear impact of data with a large number of features on the performance
- Computation time in the neural network is longer than the time spent on training sub-problems
- Complex configurations
|
Collobert et al. [2002] |
Modular network SVMs |
Non-linear and multi-class classifications |
Distributed HPC Architectures |
Improves the work by Collobert et al. [2002] by replacing the neural network with neural quantizer modules which does not require a specific knowledge of neural networks |
- Training time increases linearly with the number of samples
- Efficient network structure
- A local SVM produces the actual output
- Improves the performance in terms of time compared to training a single SVM
- No need for extra parameter tuning
- No need for specialist knowledge of neural networks
- Simpler configuration than the work proposed by Collobert et al. [2002] and takes a shorter time to train SVMs
|
- The aggregation model is still not simple
- Concerns regarding accuracy
- Exists a trade-off between learning speedup and classification accuracy
|
Huang et al. [2005] |
D-ADMM |
Linear and binary classifications |
Distributed HPC Architectures |
Uses a coloring scheme that leads to fewer communications between nodes than that in the state-of-the-art algorithms |
- Reduces communication between nodes
- Asynchronous computations
- Good parallelizability
|
- Unclear theoretical explanation of why this algorithm is more efficient than the similar algorithms
- Supports only linear classifications
- Slow convergence
|
Mota et al. [2013] |
Linear ADMM |
Linear and multi-class classifications |
Distributed HPC Architectures |
Implements an ADMM-based SVMs in which an inexact/approximation optimization problem is solved with a warm start of sub-problems |
- Outperforms the algorithms proposed by Yu et al. [2012] and Zinkevich et al. [2010] in terms of training time and convergence rate
- Efficient parameter tuning using cross-validation
- Reduces the total number of ADMM's iterations using over-relaxation while keeping the same accuracy
- Reduces the total number of iterations using a random shuffling
- Shortens the time in each ADMM iteration using warm start in solving subproblems
- Accelerates the classification and increases the classification accuracy using distributed normalization of data
|
- Supports only linear classifications
- Requires data to fit in the memory
|
Zhang et al. [2012] |
euroCAT |
Linear and binary classifications |
Distributed HPC Architectures |
Implements a multi-centric distributed learning in which data privacy is a top priority with application in personalized medicine |
- Preserves data privacy
- Higher speedup while having good accuracy
- Quality of the model is similar to the centralized models
- Asynchronous communications
- Efficient parameter tuning
|
- Supports only linear classifications
- Slow convergence
|
Deist et al. [2017] |
GADMM |
Linear and binary classifications |
Distributed HPC Architectures |
Implements a group-based ADMM in which all slave nodes are divided into groups and uses a weighted average method to gather the results from all groups and update the global variable |
- Fast convergence
- Higher speedup (30%) compared to non-grouped ADMM
- Reduces the number of ADMM iterations
- Improves the global consensus
- Good scalability in terms of the number of training samples
|
- Supports only linear classifications
- Slightly degrades the accuracy
|
Wang et al. [2017] |
PMKL |
Non-linear and binary classifications |
Shared memory parallelism |
Implements the first hybrid ADMM and multiple kernel learning in parallel and coordinates the communications between processors to obtain the global solution |
- High classification accuracy
- Fast computational speed
- Outperforms the standard cascade (Graf et al. [2004]) and SVM ensembles (Alham et al. [2011])
- Efficient parameter tuning
- Global optimal solution
|
- Benefits not fully clear regarding performance and scalability for large-scale problems
- Not robust for imbalanced samples
|
Chen et al. [2014] |
MRESVM |
Non-linear and binary classifications |
Distributed big data architectures using map-reduce |
Implements an ensemble SVM which re-samples training data using parallel balanced bootstrapping for image classification and annotation |
- Improves the classification accuracy
- Reduces the training time while keeping high accuracy
- Scales well with the number of mappers
- Reduces overheads for the increasing number of mappers
- Reduces classification error for the increasing number of bootstrap data
- Reduces the processing time for a large number of mappers
- Good scalability in terms of the number of cores
- Good parallelizability
|
- No load balancing for heterogeneous computing environments
- Communication overheads for a large number of mappers
|
Alham et al. [2013] |
DSVM |
Linear and binary classifications |
Distributed big data architectures using map-reduce |
Performs block minimizations and obtains final results using averaging and line search strategies |
- Scales well in terms of the number of training and testing samples
- Communication complexity is independent of the number of training samples
- Fast computation time for training large-scale problems (over 80 million samples in less than 11 minutes)
|
- Supports only linear classifications
- No theoretical proof of global convergence
|
Pechyony et al. [2011] |
GPU-Tailored kernalized SVM |
GPU-based Parallelism |
Non-linear, binary, and multi-class classifications. |
Handles sparse data using sparsity clustering which groups training samples with similar sparsity patterns on GPU |
- Solves problems with sparse datasets that may not follow the GPUs' special memory pattern
- Robust to both sparse and dense data formats
- Efficiently handles sparse data
- Publicly available implementation
- Overcomes some of limited memory issues on GPUs
- Reduces runtime
- Efficient memory access pattern
- Easy configuration
- Orders of magnitude faster performance than existing CPUs and GPUs implementations
|
- Benefits not fully clear for other data structures
- May still suffers from limited memory
- Initial clustering of data can be computationally expensive
|
Cotter et al. [2011] |
Automatic parallel SVM |
GPU-based Parallelism |
Non-linear and binary classifications |
Develops a directive-based technique using a gradient ascent for the optimization problem. Performs a novel feature extraction in which pixel values are converted to a high dimensional input vector |
- Higher speedup (3 times) for a simple change of memory access pattern
- Higher speedup (10-fold) for changing the input data structure
- Higher memory throughput
- The fastest GPU-based SVM for datasets with high dimensional input space
- Dimension reduction of problem
- Maintaining high accuracy
- Efficient parameter tuning
|
- Benefits not fully clear regarding scalability
|
Codreanu et al. [2014] |