Introduction, Implementation and Comparison of Four Randomized Optimization Algorithms

10 min readMar 4, 2019

Abstract

In this study, there are two main parts: in Part I, I applied three randomized optimization algorithms (randomized hill climbing, simulated annealing, and genetic algorithm) to optimize the weights of the neural network, all of which resulted in better fitness than the ones obtained using backpropagation. In Part II, I applied the three algorithms plus Mutual-Information-Maximizing Input Clustering (MIMIC) to three classic optimization problems. I will start with an introduction of the four algorithms used in this study, and continue with the two parts of analysis, and conclude with a comparison of pros and cons as well as the efficiency for each algorithm.

1. Introduction of the four randomized optimization algorithms

1.1 Randomized hill climbing (RHC)

Hill climbing is a mathematical optimization technique which belongs to the family of local search. It starts with a random solution to the problem and continues to find a better solution by incremental change until no improvement can be found. RHC iteratively does hill climbing, each time starts with a random initial guess, moving to the direction of increased fitness on each iteration. This algorithm uses very little memory and is optimal if the problem is convex.

1.2 Simulated annealing (SA)

SA is a hill climbing algorithm with non-deterministic search for the global optimum. Annealing is the process of a metal cooling and freezing into a minimum-energy crystalline structure. SA mimics this process by occasionally accepting a decreased fitness function so it always has a chance to escape from the local optimum. One of its parameter “T” is the temperature. The higher the temperature, the higher probability of accepting a worse solution. Typically, T decreases as the algorithm runs longer. Theoretically this algorithm always finds the global optimum but it can run very slow for some problems and in practice it would be a problem as to how to decide the rate at which to decrease T.

1.3 Genetic algorithm (GA)

GA is a metaheuristic method inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms. GA allows mutation and crossover on the two parent chromosomes from a population and place the new offspring in the new population, from which it returns the best solution. It iterates until it finds the optimal solution of a given function. It can randomly explore the population and finds solutions that local search algorithms cannot, and it does not need to know the fitness function derivative. However, this method may not be very good for complex problems.

1.4 Mutual-Information-Maximizing Input Clustering (MIMIC)

MIMIC is one of the estimation of distribution algorithms that belong to the larger class of evolutionary algorithms. MIMIC first randomly samples from those regions of the input space most likely to contain the optima for C(), and then used an effective density estimator that can be used to capture a wide variety of structure on the input space, yet is computable from simple second order statistics of the data[1].

*Figure 4. The problem with most optimization algorithms is that much of the work done in the first half of the search does not directly benefit the second half of the search.*

In this study, I used the Java ABAGAIL package on Chad Maron’s github[2] for both neural network weights optimization (Part I) and the three optimization problems (Part II).

2. Part I: Neural network weights optimization

2.1 Dataset selection and preprocessing

LendingClub (LC) emerged in 2005 and 2006 as a peer-to-peer lending online platforms that connect individual borrowers and individual investor-lenders. It is crucial to the firm to predict whether a borrow will default or not.

The data I selected were the most recent publicly available 12-month LC data from 2017Q4–2018Q3. The raw data included 485,484 rows and 145 columns. For this study, I picked the features that I want to test on, and after the data-preprocessing procedures described subsequently, there are 467,732 rows and 17 features used in the dataset.

Usually default does not happen very frequently, so the dataset is imbalanced with 2.2% defaulters and 97.8% repayors. I did under-sampling by removing random records from the repayor group, with 9,979 defaulters and 9,979 repayors in the datasets, i.e., the dataset has 19,958 rows and 17 features.

I partitioned the dataset into 80% training and 20% testing sets. In this part, I will apply three randomized optimization algorithms (randomized hill climbing, simulated annealing, and genetic algorithm) to optimize the weights of the neural network and compare with the ones obtained using backpropagation.

2.2 Implementation of randomized optimization algorithms

Based on my analysis, the best backpropagation setup was 2 hidden layers, each with 10 nodes, and an initial learning rate of 0.064. Here I use the same neural network setup, but use the three aforementioned randomized optimization algorithms instead of backpropagation to optimize the weights. I used F1 score to measure the model performance in this part. I tested these algorithms with various parameters for each algorithm and plotted the results in Figure 5a. Then I plotted the best model based on the tested parameter combinations for each algorithm in Figure 5b. Each algorithm was run with 3,000 iterations.

I found three things very interesting: (1) from Figure 5a, the training model and cross validation model showed similar convergence paths; (2) from Figure 5b, GA showed a larger volatility in F1 score evaluation; and (3) backpropagation, RHC and SA all converged at 3,000 iterations, but it seems GA needs more iterations to converge and backpropagation converges fastest with RHC being the second.

Figure 5a. *F1 scores of the training model and cross validation model for (a) backpropagation, (b) RHC, (c) various mutate parameters for GA, and (d) various cooling exponent parameters for SA.*

Figure 5b. *F1 scores of the best cross validation models from Figure 5a for (a) backpropagation, (b) RHC, (c)GA, and (d) SA.*

In order to better compare the model performance, I further compared the best F1 scores and running times for each algorithm in Figure 5c below. Prominently, the F1 scores for GA, RHC and SA are all better than Backpropagation, with GA having the best score and SA having a slightly lower but very similar score. With regards to the running time, GA runs much longer than the other three algorithms (327.8s). From Figure 5b, it seems GA has not converged after 3,000 iterations, so it cannot be precluded that GA might be able to find even better solutions if running for more iterations. In this analysis, SA generated an F1 score of 69.97% as compared to GA’s 69.99%, but was much less time-consuming (18.55s vs GA’s 327.8s).

*Figure 5c. Left: F1 scores of the testing set for (a) backpropagation, (b) GA, (c)RHC, and (d) SA. Right: running time after 3,000 iterations for (a) backpropagation, (b) GA, (c) RHC, and (d) SA.*

In conclusion, all of the three algorithms tested against backpropagation had an improved F1 score than backpropagation. Backpropagation runs and converges the fastest among all the algorithms tested. GA was the best algorithm for this neural network weights optimization in terms of F1 score. However, it takes much longer to run GA, and SA generated results as satisfying as GA with much less running time in this problem.

3. Part II: Implementation of four algorithms in three problems

3.1 Travelling Salesman Problem (TSP)

TSP is a famous NP-hard combinatorial optimization problem and it is important in operations research and theoretical computer science. It asks the following question: “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?” TSP can be modelled as a graph problem, with the cities being the graph’s vertices, paths being the graph’s edges, and a path’s distance being the edge’s weight. It is a minimization problem starting from a specified vertex and finishing at the same vertex after visiting all other vertices exactly once. TSP has several applications in planning, logistics, and DNA sequencing.

Assuming 100 cities (N=100), I tested the four algorithms with various parameters and plotted them in Figure 6a. Each run was 3,000 iterations. For SA, the best parameter was Temp=1E10 and CE=0.55. For GA, the best parameter was Pop=100, Mate=50, Mutate=10. For MIMIC, the best parameter was Pop=100, Keep=50, m=0.1. The performance for MIMIC largely depends on the parameter selection.

*Figure 6a. Fitness scores of TSP by (a) RHC, (b) various cooling exponent parameters of SA algorithm, (c)various mutate parameters of GA algorithm, and (d) various m parameters of MIMIC algorithm.*

In order to better compare the model performance, I plotted the best model with the highest fitness score for each algorithm in Figure 6b below. The result shows that GA achieved the best fitness score for this problem while MIMIC was the worst.

*Figure 6b. Fitness scores of the best models based on the parameters tested. GA algorithm achieved the best fitness score in the TSP.*

In order to compare model efficiency, I plotted the running time for TSP in Figure 6c below. MIMIC was the worst and SA was the best in terms of running time.

*Figure 6c. The running time of TSP after 3,000 iterations by four randomized optimization algorithms.*

In summary, GA algorithm was the best algorithm for TSP, and its running time was quite satisfying as well. MIMIC was the worst both in terms of fitness and efficiency. Parameter selection is a very important factor in this problem for MIMIC, so an entirely different set of parameters might generate a different result.

3.2 Flip Flop Problem (FFP)

FFP is a problem that counts the number of times of bits alternation in a bit string, i.e., from a number to any other number in the next digit is counted as 1. A maximum fitness bit string would be one that consists entirely of alternating digits.

Assuming a length of 1,000 (N=1,000), I tested the four algorithms with various parameters and plotted them in Figure 7a. Each run was 1,000 iterations. For SA, the best parameter was Temp=1E10 and CE=0.35. For GA, the best parameter was Pop=100, Mate=30, Mutate=50. For MIMIC, the best parameter was Pop=100, Keep=50, m=0.9. The performance for SA and GA largely depends on the parameter selection.

*Figure 7a. Fitness scores of FFP by (a) RHC, (b) various cooling exponent parameters of SA algorithm, (c) various mutate parameters of GA algorithm, and (d) various m parameters of MIMIC algorithm.*

In order to better compare the model performance, I plotted the best model with the highest fitness score for each algorithm in Figure 7b below. The result shows that MIMIC achieved the best fitness score for this problem while GA was the worst. RHC and SA performed almost equally well in terms of fitness score.

*Figure 7b. Fitness scores of the best models based on the parameters tested. MIMIC algorithm achieved the best fitness score in the FPP.*

In order to compare model efficiency, I plotted the running time for FFP in Figure 7c below. MIMIC was the worst and SA was the best in terms of running time.

*Figure 7c. The running time of FFP after 1,000 iterations by four randomized optimization algorithms.*

In summary, similar to TSP, MIMIC algorithm iterates much slower than the other three algorithms, however, unlike TSP, MIMIC algorithm generated the best fitness score that is significantly better than others in this problem. Parameter selection is a very important factor in this problem for SA and GA, so an entirely different set of parameters might generate a different result.

3.3 Continuous Peaks Problem (CPP)

CPP is a problem that contains many local optima. In a 2D world, a better solution can be found by moving to more positive or negative on the x-axis. However, in a high dimensional world, a better solution might be found in all directions at each dimension. Therefore, the difference between randomized optimization and no randomized optimization can be very evident if there are many local optima.

Assuming the number of local optima of 100 (N=100), I tested the four algorithms with various parameters and plotted them in Figure 8a. Each run was 5,000 iterations. For SA, the best parameter was Temp=1E10 and CE=0.55. For GA, the best parameter was Pop=100, Mate=50, Mutate=30. For MIMIC, the best parameter was Pop=100, Keep=50, m=0.7. The performance for MIMIC and GA largely depends on the parameter selection.

*Figure 8a. Fitness scores of CPP by (a) RHC, (b) various cooling exponent parameters of SA algorithm, (c) various mutate parameters of GA algorithm, and (d) various m parameters of MIMIC algorithm.*

In order to better compare the model performance, I plotted the best model with the highest fitness score for each algorithm in Figure 8b below. The result shows that SA achieved the best fitness score for this problem while MIMIC was the worst. RHC and GA performed equally well in terms of fitness score.

*Figure 8b. Fitness scores of the best models based on the parameters tested. SA algorithm achieved the best fitness score in the CPP.*

In order to compare model efficiency, I plotted the running time for CPP in Figure 8c below. MIMIC was the worst and SA was the best in terms of running time.

*Figure 8c. The running time of CPP after 5,000 iterations by four randomized optimization algorithms.*

In summary, similar to TSP and FFP, MIMIC algorithm iterates much slower than the other three algorithms. In this problem, SA algorithm both generated the best fitness score and iterates the fastest. Its fitness score of 99 is the closest to the global optima (100). MIMIC was the worst for this problem in terms of fitness score and running time. Parameter selection is a very important factor in this problem for RHC and SA, so an entirely different set of parameters might generate a different result.

4. Conclusions

Based on the previous implementation of various problems, I summarized the advantages and disadvantages for each algorithm in Table 1 below for comparison. For different problems, we need to consider its complexity and features to select the best optimization algorithm.