# Feature Reduction from Breast Cancer Dataset with Genetic Algorithm

A Genetic Algorithm is an advanced form of Brute Forcing to find the best features from a given dataset or to optimize the weights. It is based on Darwin’s theory of survival of fittest. The genetic algorithm repeatedly modifies a population of individual solutions.

## Features in Dataset

A dataset is a pair of Features and Target values that can be called as Inputs and Outputs. Features are mainly the input provided by the user to get a prediction.

Feature Reduction:

A dataset is never pure, it may consist of many unuseful features. These unuseful or garbage features doesn’t help in the training process that much. Reducing features can help Machine Learning models to get trained faster and become more optimal.

### Dataset

In this example, we are going to use the dataset of Breast Cancer available in sklearn.

#### Visualizing the Dataset:

X -> Features

Y-> Target

Here we can see that we have 30 features in the dataset. Let’s find out the feature names.

## Genetic Algorithm Steps:

First of all, we will generate some random population.

### Generating Random Population

```generate=np.zeros([sample_size,row_size],dtype='int16')
for i in range(0,sample_size):
for j in range(0,row_size):
generate[i][j]=(round(np.random.uniform(0,29)))```

### Fitness calculation

Fitness is calculated using the SVM. Here the accuracy is our fitness value.

```def fitness(samples,size):
accuracy=[]
for i in range(size):
X=dataset.data[:,samples[i]]
Y=dataset.target
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.18)
clf=SVC(kernel='linear',random_state=0)
hist=clf.fit(X_train,Y_train)
y_pred=hist.predict(X_test)
acc=accuracy_score(y_pred,Y_test)
accuracy.append(acc)
return accuracy```

### Selection

In the selection process, we will find the best 4 populations and use them later for crossover.

```best_finder=fitness.copy()
best_finder.sort(reverse=True)
best_4=best_finder
best_4=list(set(best_4))
best_4_a=best_4[0:4]
indexes=[]
for i in range(0,4):
indexes.append(fit.index(best_4_a[i]))
best_samples=samples[indexes].tolist()```

### Crossover

```def crossover(best_samples):
off_springs=np.zeros([len(best_samples),len(best_samples[0])])
for i in range(0,len(best_samples)):
for j in range(0,len(best_samples[0])):
if j<(len(best_samples[0])/2):
off_springs[i][j]=best_samples[i][j]
else:
if i==0 and i==2:
k=0
while(k<len(best_samples[0])):
if best_samples[i+1][k] not in off_springs[i]:
off_springs[i][j]=best_samples[i+1][k]
break
k+=1
else:
k=0
while(k<len(best_samples[1])):
if best_samples[i-1][k] not in off_springs[i]:
off_springs[i][j]=best_samples[i-1][k]
break
k+=1
return off_springs```

### Mutation

For mutation, we will select one random index and one random value which is not present in the population and put that in the population which we got after doing the crossover.

```def mutation(crossed):
random.seed()
for i in range(len(crossed)):
x=random.randint(0,29)
y=random.randint(0,7)
while x in crossed[i]:
x=random.randint(0,29)
crossed[i][y]=x
return crossed```

## Final Result

At last, we have the accuracy and list of features that provide the maximum accuracy in prediction.