This section starts with an overview of the Bison Algorithm, followed by an introduction to Apache Spark, and the details about the parallel Bison Algorithm implementation provided at the end of this section.
3.2. Apache Spark
Apache Spark is a popular open-source analytics engine used for large-scale data processing [
19]. Spark uses parallelization, which is a computation technique used to divide a task into smaller, independent sub-tasks that can be executed simultaneously across multiple processors, or cores. Parallelization aims to increase computational speed by reducing the overall time required to complete a task. By distributing a workload across multiple cores, performing operations such as classification can be conducted in parallel to ease computational load.
The main program that runs the user application is called the driver program, which defines the high-level logic of the application, including details about the execution of tasks. The cluster manager manages the cluster of machines, or nodes, and allocates resources to the Spark application.
Worker nodes are the machines in the cluster that execute the tasks assigned by the driver program. Each worker node runs an executor that executes the individual tasks. Apache Spark leverages distributed computing by breaking down large data processing tasks into smaller ones that are executed in parallel across clusters of computers, or cores, (seen in
Figure 1), enabling efficient handling of large amounts of data processing, improving speed and scalability [
20].
3.3. Bison Algorithm for Classification Implemented with Spark
The Bison Algorithm for classification implemented with Spark can be thought of an iterative algorithm that uses a swarm of solutions (bisons) to find the best way to classify data points in a dataset. The algorithm improves over time by learning from the best-performing bisons, adjusting the positions of the others, and testing the resulting model on unseen data, thus, achieving a high level of accuracy in clustering and classifying data.
In order to enable the Bison Algorithm to perform the classification task, a few steps need to be taken. The algorithm combines elements of classification and clustering to classify data points. The process starts with a setup phase where a configuration file is read to determine key parameters, such as how many “bisons” (representing possible solutions) will be used, how many iterations the algorithm should run, and the paths to the necessary input data files. These parameters are essential as they dictate the behavior of the algorithm, including how it processes data and how much computational power it allocates.
After the setup is complete, the algorithm moves into the data preparation phase. Here, the dataset is loaded, and these data consist of features and the classification label, which classifies each point. To ensure that all features contribute equally during processing, the algorithm scales them so that no single feature can dominate due to its magnitude. The labels are also converted into numerical form. Then, the dataset is split into two parts: one part will be used for training, and the other will be used for testing.
Then, the swarm of bisons is initialized. Please note that each bison represents a random point in the search space, and these points act as potential centroids around which data points can be clustered. Essentially, each bison is a “guess” for where groups of similar data points might gather in the feature space. The goal of the algorithm is to improve these centroid locations over time converging to the optimal centroid position.
During the training phase, the algorithm goes through several iterations where it evaluates how well the bisons are classifying the data points. For each data point, the algorithm calculates which bison’s centroids are closest to the point using a distance measure.
If the centroids associated with a bison misclassify a point, that bison receives a penalty. Over time, bisons with fewer penalties are considered better classifiers since they better assign points to clusters. Once the algorithm has evaluated the bisons in an iteration, it updates their positions. The bisons that performed best in this round are designated as elite bisons. These elite bisons represent the best solutions at this stage. The remaining bisons adjust their positions based on these elite bisons. Some bisons, called swarmers, move closer to the elite bisons, trying to emulate their success. Others (runners), move in random directions, exploring new parts of the solution space that may provide better results. This balance between exploration and exploitation allows the algorithm to avoid getting stuck in local optima and helps to search more broadly for better centroids.
After completing the training phase, the algorithm selects the best-performing bison, which had the lowest classification error across the iterations. This best bison represents the most accurate model for classifying the dataset. The next step is the testing phase where the best bison is applied to the testing dataset in order to test the model in classifying new data points. After the classification is completed, the algorithm evaluates its performance using several metrics (more description in
Section 4.3).
The algorithm leverages Spark’s distributed computing capabilities to perform data preprocessing, training, and testing in parallel. It distributes the data across multiple worker nodes, processes it simultaneously, and aggregates the results using accumulators and broadcast variables. This parallelization allows the algorithm to scale efficiently and handle large datasets without sacrificing speed or accuracy. Spark ensures that the algorithm can take advantage of cluster resources, making the training and testing phases much faster than they would be on a single machine.
The algorithm uses Apache Spark to parallelize both the data processing and the training of the Bison model by distributing tasks across multiple nodes in a cluster. When the dataset is loaded, it is stored as a DataFrame, which Spark automatically splits and distributes across different worker nodes. Each node processes a portion of the data in parallel. For example, feature scaling, label encoding, and data splitting are all performed concurrently on each partition of the dataset. After the initial data preparation, the DataFrame is converted into an RDD (Resilient Distributed Dataset), which allows the algorithm to perform custom operations such as mapping and reducing in parallel.
In the training phase, the swarm of bisons (candidate solutions) is broadcast to all worker nodes, ensuring that every node has access to the current bison positions without redundant data transfers. Each worker node evaluates how well these bisons classify the data points in its partition and updates the accumulator, a Spark variable that aggregates the misclassification counts from all nodes. This allows the algorithm to compute fitness (classification accuracy) in parallel, as each node processes its data independently. Once the fitness of each bison is evaluated, the bisons are updated, and the next iteration begins. This parallel approach allows the algorithm to scale efficiently, making it capable of handling large datasets and complex computations faster than if it were run on a single machine. The algorithm description is provided in Algorithm 1.
Algorithm 1: Bison Algorithm with RDD. |
- 1:
Initialize the Bison group randomly - 2:
Initialize the Runner group around - 3:
Initialize the run direction vector - 4:
Read the dataset into an RDD - 5:
Initialize Accumulator Accu - 6:
loop - 7:
Broadcast Bison group - 8:
for each element in RDD do - 9:
Assign the element to the closest centroids - 10:
Update Accu - 11:
end for - 12:
Update the fitness of the bisons based on the values in Accu - 13:
Update - 14:
Update Bison Group according to the original algorithm - 15:
Reset Accu - 16:
end loop
|
Figure 2 shows how the dataset is partitioned and then used by the Spark framework. Each partition is sent to an executor, which performs the closest centroid calculations.