A Genetic Algorithm is an advanced form of Brute Forcing to find the best features from a given dataset or to optimize the weights. It is based on Darwin’s theory of survival of fittest. The genetic algorithm repeatedly modifies a population of individual solutions.
Features in Dataset
A dataset is a pair of Features and Target values that can be called as Inputs and Outputs. Features are mainly the input provided by the user to get a prediction.
A dataset is never pure, it may consist of many unuseful features. These unuseful or garbage features doesn’t help in the training process that much. Reducing features can help Machine Learning models to get trained faster and become more optimal.
In this example, we are going to use the dataset of Breast Cancer available in sklearn.
Visualizing the Dataset:
X -> Features
Here we can see that we have 30 features in the dataset. Let’s find out the feature names.
Genetic Algorithm Steps:
First of all, we will generate some random population.
generate=np.zeros([sample_size,row_size],dtype='int16') for i in range(0,sample_size): for j in range(0,row_size): generate[i][j]=(round(np.random.uniform(0,29)))
Fitness is calculated using the SVM. Here the accuracy is our fitness value.
def fitness(samples,size): accuracy= for i in range(size): X=dataset.data[:,samples[i]] Y=dataset.target X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.18) clf=SVC(kernel='linear',random_state=0) hist=clf.fit(X_train,Y_train) y_pred=hist.predict(X_test) acc=accuracy_score(y_pred,Y_test) accuracy.append(acc) return accuracy
In the selection process, we will find the best 4 populations and use them later for crossover.
best_finder=fitness.copy() best_finder.sort(reverse=True) best_4=best_finder best_4=list(set(best_4)) best_4_a=best_4[0:4] indexes= for i in range(0,4): indexes.append(fit.index(best_4_a[i])) best_samples=samples[indexes].tolist()
def crossover(best_samples): off_springs=np.zeros([len(best_samples),len(best_samples)]) for i in range(0,len(best_samples)): for j in range(0,len(best_samples)): if j<(len(best_samples)/2): off_springs[i][j]=best_samples[i][j] else: if i==0 and i==2: k=0 while(k<len(best_samples)): if best_samples[i+1][k] not in off_springs[i]: off_springs[i][j]=best_samples[i+1][k] break k+=1 else: k=0 while(k<len(best_samples)): if best_samples[i-1][k] not in off_springs[i]: off_springs[i][j]=best_samples[i-1][k] break k+=1 return off_springs
For mutation, we will select one random index and one random value which is not present in the population and put that in the population which we got after doing the crossover.
def mutation(crossed): random.seed() for i in range(len(crossed)): x=random.randint(0,29) y=random.randint(0,7) while x in crossed[i]: x=random.randint(0,29) crossed[i][y]=x return crossed
At last, we have the accuracy and list of features that provide the maximum accuracy in prediction.