>home     >contact     >sitemap     >login
BIOCHEMICS CONSULTING SAS BCX logo
  About us Our Team Expertise Services
Expertise


Data mining

    Fuzzy logic

    Genetic algorithms

    Neural networks

Molecular modeling

Examples of achievements

Selected publications

Genetic algorithms

The Genetic Algorithms (GA) [1,2], in the BioChemics Consulting's approaches, are used above all to generate procedures able to select the most relevant molecular descriptors for database mining [3]. In fact, amidst the several selection proposed in the literature, GA are probably the most powerful methods, as they are able to thoroughly explore the molecular descriptor hyperspace.

GA methods, inspired by population genetics, consist of a population of individuals competing on a "survival of the fittest" basis. Each individual, or chromosome, represents a trial solution of the problem to solve. In the context of descriptor selection, the structure of the chromosome is very simple. Each descriptor is coded by a bit (0 or 1) and represents a component of the chromosome. 0 defines the absence of the descriptor, 1 defines its presence.

Figure 1. Simplified representation of a population of chromosomes.

The algorithm, transforming the chromosome population into a new population with more adapted individuals, proceeds in successive steps called generations (Figure 2). During each generation, the population evolves by means of a "fitness" function [4] that selects individuals by standard crossover and mutation operators [5].

Figure 2. Steps defining the descriptor selection procedure.

Evaluating a population by the fitness function

Selecting the best chromosomes requires to define a fitness function able to evaluate the "quality" of the solutions proposed, i.e. their ability to discriminate data pertaining to different classes. Therefore, this selection step has to be "inactive" and not to replace other classification tools able to find complex edges for region separation. If the compounds are partitioned into approximately homogenous groups, only similar compounds are likely to be located into the same group. If the number of homogeneous groups is relevant, a reduced descriptor hyperspace should be suitable to discriminate different activity classes. The fitness function proposed here is based on this similarity concept and it is implemented by a Fuzzy Clustering (FC) procedure [3,6]. More particularly, the analysis of the cluster structures can be achieved by partitioning the fuzzy sets by means of interval alpha cuts Ια [7] (Figure 3), whose importance is modulated by weights that grow less when the distances from the cluster centers increase.

Figure 3. Representation of a cluster subdivided into Ια intervals.
 = category 1;  = category 2.

The fitness score is firstly computed on an only cluster and Ια interval. This score is 1 if only one class of compounds is included in the interval, then, for a given descriptor subset, a good discrimination power should be represented by scores close to 1.

After gathering the scores for all clusters and Ια intervals, the global fitness score of each chromosome derives from the average of the values computed on a series of training and test sets previously selected, in order to prevent over-fitting and poor generalization.

Crossover and mutation operators

A crossover operator takes two chromosomes and produces two new chromosomes by swapping segments of genetic material (e.g. bits). The number of chromosomes involved in the crossover operation is defined according to the probability of crossover. Different crossover technique can be found in the literature [1]. The one-point crossover consists in randomly partitioning into two sections the bit string representing a chromosome population (Figure 3). The parts of the parent chromosomes are swapped and two children can be created, containing a part of each parent. The two-point crossover is similar: two points are determined partitioning each chromosome into three parts. Each child contains two parts of one parent and one part of the other. In the proposed procedure, the one-point crossover is used.

Figure 4. Examples of one-point and two-point crossover operations.

A mutation operator, when chromosomes are represented by a binary string, randomly switches some bits (Figure 4). Only a few chromosomes are affected by the mutation operator, according to the probability of mutation, amidst the global population. The mutation serves to create random diversity in the population and it should prevent the algorithm convergence towards a non-optimal solution.

Figure 5. Examples of mutation operation.

Local optimization by stepwise method

GA are very effective for exploratory search, applicable to problems where little knowledge is available, but they are not particularly suitable for local search. In the latter case, they are combined with a stepwise approach in order to reach local convergence [8]. Stepwise approaches are quick and adapted to finding a solution in "promising" areas already identified. They are divided into two steps, named ascending and descending procedures. In the ascending procedure (Figure 5), each inactive descriptor of a give chromosome is separately and successively activated. Each new chromosome generated is evaluated by fitness function and only the best one is kept. If the difference between new and old score exceeds a threshold value, the procedure goes on, otherwise it is stopped.


Figure 6. Steps defining the ascending procedure in a stepwise technique.

In the succeeding descending procedure (Figure 6), all descriptors of the chromosome selected in the ascending procedure are separately and successively inactivated. The best chromosome is kept and if the gap between new and old scores is not lower than a threshold value, the procedure goes on, otherwise it is stopped.



Figure 7. Steps defining the descending procedure in a stepwise technique.

End conditions

For the procedure end, to be considered as satisfactory, the difference between the fitness function scores in two successive steps must be inferior to a fixed parameter and the chromosome populations have to be comparable. This situation must be repeated for a given number of iterations.


References

  1. K.E. Kinnear, Advances in Genetic Programming, MIT Press, Cambridge, MA, 1994.
  2. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
  3. F. Ros, M. Pintore, J.R. Chrétien, Molecular descriptor selection combining genetic algorithms and fuzzy logic: application to database mining procedures, Chemometr. Intell. Lab. Syst., 63 (2002) 15-26.
  4. L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, NY, USA, 1991.
  5. K.E. Kinnear, Advances in Genetic Programming, MIT Press, Cambridge, MA, USA, 1994.
  6. F. Ros, K. Audouze, M. Pintore, J.R. Chretien, Hybrid System for Virtual Screening: Interest of Fuzzy Clustering Applied to Olfaction, SAR QSAR Env. Res., 11 (2000) 281-300.
  7. W. Pedrycz, Fuzzy Sets in Pattern Recognition: Methodology and Methods, Pattern Recognition, 23 (1990) 121-146.
  8. R. Leardi, A.L. Gonzales, Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemometr. Intell. Lab. Syst., 41 (1998) 195-207.
TOP
Biochemics Consulting SAS
Pépinière d’Entreprises - 111 bd Duhamel de Monceau
45 166 Olivet Cedex - France

® BCX 2004, All Rights Reserved.
- Biochemics Consulting SAS - 16 rue Leonard de Vinci - F 45074 ORLEANS Cedex 2 - Capital: 100.000€ - RCS SIRET 441 128 980 000 1 4 -