In a given iteration, will every strain access the same or different sampled data? (cross validation) #137

jdmoore7 · 2022-10-17T17:46:47Z

jdmoore7
Oct 17, 2022

I'm using PyGAD for cross-validation hyper-parameter tuning. I sample train/test data in the fitness function; I'm unclear on whether every strain i in iteration x will have access to the same sampled data or if sampling will differ across strains given the same iteration?

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement
import numpy as np 

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

def fitness_function_factory(hyperparameters, data, y_name, sample_size):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )
        
        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)        

        train_idx = sample_without_replacement(n_population=len(X_train), 
                                              n_samples=sample_size)         
        
        test_idx = sample_without_replacement(n_population=len(X_test), 
                                              n_samples=sample_size) 
         
        model.fit(X_train[train_idx], y_train[train_idx])
        fitness = model.score(X_test[test_idx], y_test[test_idx])
        
        return fitness 

    return fitness_function

cross_validation = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory(),
                      num_generations=100,
                      num_parents_mating=4,
                      sol_per_pop=8,
                      # num_genes=5,
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=10)

ahmedfgad · 2023-02-19T01:33:51Z

ahmedfgad
Feb 19, 2023
Maintainer

@jdmoore7,

Sorry for the getting back late!

The fitness function is called for each individual solution (or strain as you said). This means the data sampling will differ from one solution to another.

If you want to do sampling once for all solutions (strains) in the same generation (iteration), then you can:

For the first generation, do sampling in a code outside the fitness function.
For all other generations, use the on_generation() callback function to do sampling.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement
import numpy as np 

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

X = data.drop(columns=[y_name])
y = data[y_name]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.5)        

train_idx = sample_without_replacement(n_population=len(X_train), 
                                       n_samples=sample_size)         
        
test_idx = sample_without_replacement(n_population=len(X_test), 
                                      n_samples=sample_size) 

def on_generation(ga_instance):
    global X, y, X_train, X_test, y_train, y_test, train_idx, test_idx
    X = data.drop(columns=[y_name])
    y = data[y_name]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=0.5)        
    
    train_idx = sample_without_replacement(n_population=len(X_train), 
                                           n_samples=sample_size)         
            
    test_idx = sample_without_replacement(n_population=len(X_test), 
                                          n_samples=sample_size) 

def fitness_function_factory(hyperparameters, data, y_name, sample_size):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )
        global X, y, X_train, X_test, y_train, y_test, train_idx, test_idx
                 
        model.fit(X_train[train_idx], y_train[train_idx])
        fitness = model.score(X_test[test_idx], y_test[test_idx])
        
        return fitness 

    return fitness_function

cross_validation = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory,
                      num_generations=100,
                      num_parents_mating=4,
                      sol_per_pop=8,
                      # num_genes=5,
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=10)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

In a given iteration, will every strain access the same or different sampled data? (cross validation) #137

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

In a given iteration, will every strain access the same or different sampled data? (cross validation) #137

Uh oh!

Uh oh!

jdmoore7 Oct 17, 2022

Replies: 1 comment

Uh oh!

ahmedfgad Feb 19, 2023 Maintainer

jdmoore7
Oct 17, 2022

ahmedfgad
Feb 19, 2023
Maintainer