|
Genetic algorithms
The Genetic Algorithms (GA) [1,2], in the BioChemics Consulting's
approaches, are used above all to generate procedures able
to select the most relevant molecular descriptors for database
mining [3]. In fact, amidst the several selection proposed
in the literature, GA are probably the most powerful
methods, as they are able to thoroughly explore the molecular
descriptor hyperspace.
GA methods, inspired by population genetics, consist of a
population of individuals competing on a "survival of the
fittest" basis. Each individual, or chromosome, represents
a trial solution of the problem to solve. In the context
of descriptor selection, the structure of the chromosome
is very simple. Each descriptor is coded by a bit (0 or 1)
and represents a component of the chromosome. 0 defines
the absence of the descriptor, 1 defines its presence.
Figure 1. Simplified representation of a population
of chromosomes.
The algorithm, transforming the chromosome population into
a new population with more adapted individuals, proceeds
in successive steps called generations
(Figure 2). During each generation, the population evolves by means
of a "fitness" function [4] that selects individuals by standard
crossover and mutation operators [5].
Figure 2. Steps defining the descriptor selection procedure.
Evaluating a population by the fitness function
Selecting the best chromosomes requires to define a fitness
function able to evaluate the "quality" of the solutions proposed,
i.e. their ability to discriminate data pertaining to different
classes. Therefore, this selection step has to be "inactive" and
not to replace other classification tools able to find complex edges
for region separation. If the compounds are partitioned into
approximately homogenous groups, only similar compounds are
likely to be located into the same group. If the number of
homogeneous groups is relevant, a reduced descriptor hyperspace
should be suitable to discriminate different
activity classes. The fitness function proposed here is based on this
similarity concept and it is implemented by a Fuzzy Clustering (FC)
procedure [3,6]. More particularly, the analysis of the cluster
structures can be achieved by partitioning the fuzzy sets by
means of interval alpha cuts Ια [7] (Figure 3), whose
importance is modulated by weights that grow less when the
distances from the cluster centers increase.
Figure 3. Representation of a cluster subdivided into
Ια intervals.
= category 1;
= category 2.
The fitness score is firstly computed on an only cluster and
Ια interval.
This score is 1 if only one class of compounds is included
in the interval, then, for a given descriptor subset, a good
discrimination power should be represented by scores close to 1.
After gathering the scores for all clusters and Ια
intervals, the global fitness score of each chromosome derives
from the average of the values computed on a series of training
and test sets previously selected,
in order to prevent over-fitting and poor generalization.
Crossover and mutation operators
A crossover operator takes two chromosomes and produces
two new chromosomes by swapping segments of genetic material
(e.g. bits). The number of chromosomes involved in the crossover
operation is defined according to the probability of crossover.
Different crossover technique can be found in the literature [1].
The one-point crossover consists in randomly partitioning into
two sections the bit string representing a chromosome population
(Figure 3). The parts of the parent chromosomes are swapped
and two children can be created, containing a part of each
parent. The two-point crossover is similar: two points are
determined partitioning each chromosome into three parts.
Each child contains two parts of one parent and one part
of the other. In the proposed procedure,
the one-point crossover is used.
Figure 4. Examples of one-point and two-point crossover
operations.
A mutation operator, when chromosomes are represented by a binary string,
randomly switches some bits (Figure 4).
Only a few chromosomes are affected by the
mutation operator, according to the probability of mutation,
amidst the global population. The mutation serves to create
random diversity in the population and it should prevent the
algorithm convergence towards a non-optimal solution.
Figure 5. Examples of mutation operation.
Local optimization by stepwise method
GA are very effective for exploratory search, applicable to problems
where little knowledge is available, but they are not particularly
suitable for local search. In the latter case, they are combined
with a stepwise approach in order to reach local convergence [8].
Stepwise approaches are quick and adapted to finding a solution in
"promising" areas already identified. They are divided into two steps,
named ascending and descending procedures. In the ascending
procedure (Figure 5), each inactive descriptor of a give
chromosome is separately and successively activated.
Each new chromosome generated is evaluated by fitness function and
only the best one is kept. If the difference between new and old score
exceeds a threshold value, the procedure goes on, otherwise
it is stopped.
Figure 6. Steps defining the ascending procedure in a
stepwise technique.
In the succeeding descending procedure (Figure 6), all descriptors of
the chromosome selected in the ascending procedure are separately and
successively inactivated. The best chromosome is kept and if the gap
between new and old scores is not lower than a threshold value, the
procedure goes on, otherwise it is stopped.
Figure 7. Steps defining the descending procedure in a
stepwise technique.
End conditions
For the procedure end, to be considered as satisfactory, the
difference between the fitness function scores in two
successive steps must be inferior to a fixed parameter
and the chromosome populations have to be comparable.
This situation must be repeated for a given number of iterations.
References
- K.E. Kinnear, Advances in Genetic Programming, MIT Press, Cambridge, MA, 1994.
- D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
- F. Ros, M. Pintore, J.R. Chrétien, Molecular descriptor selection combining genetic algorithms and fuzzy logic: application to database mining procedures, Chemometr. Intell. Lab. Syst., 63 (2002) 15-26.
- L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, NY, USA, 1991.
- K.E. Kinnear, Advances in Genetic Programming, MIT Press, Cambridge, MA, USA, 1994.
- F. Ros, K. Audouze, M. Pintore, J.R. Chretien, Hybrid System for Virtual Screening: Interest of Fuzzy Clustering Applied to Olfaction, SAR QSAR Env. Res., 11 (2000) 281-300.
- W. Pedrycz, Fuzzy Sets in Pattern Recognition: Methodology and Methods, Pattern Recognition, 23 (1990) 121-146.
- R. Leardi, A.L. Gonzales, Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemometr. Intell. Lab. Syst., 41 (1998) 195-207.
|