InfoClusters
Information at sight.

Feature Reduction from Breast Cancer Dataset with Genetic Algorithm

44
Reading time: 3 min

A Genetic Algorithm is an advanced form of Brute Forcing to find the best features from a given dataset or to optimize the weights. It is based on Darwin’s theory of survival of fittest. The genetic algorithm repeatedly modifies a population of individual solutions.

Features in Dataset

A dataset is a pair of Features and Target values that can be called as Inputs and Outputs. Features are mainly the input provided by the user to get a prediction.

Feature Reduction:

A dataset is never pure, it may consist of many unuseful features. These unuseful or garbage features doesn’t help in the training process that much. Reducing features can help Machine Learning models to get trained faster and become more optimal.

Dataset

In this example, we are going to use the dataset of Breast Cancer available in sklearn.

Visualizing the Dataset:

X -> Features

Y-> Target

Shape of X and Y

Here we can see that we have 30 features in the dataset. Let’s find out the feature names.

feature_names

Genetic Algorithm Steps:

 

First of all, we will generate some random population.

Generating Random Population

generate=np.zeros([sample_size,row_size],dtype='int16')
    for i in range(0,sample_size):
      for j in range(0,row_size):
        generate[i][j]=(round(np.random.uniform(0,29)))

Fitness calculation

Fitness is calculated using the SVM. Here the accuracy is our fitness value.

def fitness(samples,size):
  accuracy=[]
  for i in range(size):
    X=dataset.data[:,samples[i]]
    Y=dataset.target
    X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.18)
    clf=SVC(kernel='linear',random_state=0)
    hist=clf.fit(X_train,Y_train)
    y_pred=hist.predict(X_test)
    acc=accuracy_score(y_pred,Y_test)
    accuracy.append(acc)
  return accuracy

Selection

In the selection process, we will find the best 4 populations and use them later for crossover.

best_finder=fitness.copy()
best_finder.sort(reverse=True)
best_4=best_finder
best_4=list(set(best_4))
best_4_a=best_4[0:4]
indexes=[]
for i in range(0,4):
  indexes.append(fit.index(best_4_a[i]))
best_samples=samples[indexes].tolist()

Crossover

selecting

def crossover(best_samples):
 off_springs=np.zeros([len(best_samples),len(best_samples[0])])
    for i in range(0,len(best_samples)):
      for j in range(0,len(best_samples[0])):
        if j<(len(best_samples[0])/2):
          off_springs[i][j]=best_samples[i][j]
        else:
          if i==0 and i==2:
            k=0
            while(k<len(best_samples[0])):
              if best_samples[i+1][k] not in off_springs[i]:
                off_springs[i][j]=best_samples[i+1][k]
                break
              k+=1
          else:
            k=0
            while(k<len(best_samples[1])):
              if best_samples[i-1][k] not in off_springs[i]:
                off_springs[i][j]=best_samples[i-1][k]
                break
              k+=1
    return off_springs

Mutation

For mutation, we will select one random index and one random value which is not present in the population and put that in the population which we got after doing the crossover.

def mutation(crossed):
  random.seed()
  for i in range(len(crossed)):
    x=random.randint(0,29)
    y=random.randint(0,7)
    while x in crossed[i]:
      x=random.randint(0,29)
    crossed[i][y]=x
  return crossed

Final Result

At last, we have the accuracy and list of features that provide the maximum accuracy in prediction.

best_features

 


Full Code

google colab

%d bloggers like this: