Evolutionary Algorithm
======================


``stk`` includes an evolutionary algorithm which
can be used to evolve molecules that fulfil user defined design criteria.
The evolutionary algorithm can be run from the command line using::

    $ python -m stk.ea input_file.py

The input file is a simple python script which defines the
calculators the evolutionary algorithm should use, as well as some optional
parameters.

The evolutionary algorithm automatically works with any molecules that ``stk``
uses, both :class:`.BuildingBlock` and :class:`.ConstructedMolecule` objects.

Take for example the following input file which runs the EA on polymers
and selects building blocks which have the most atoms.

.. code-block:: python

    # #####################################################################
    # Imports.
    # #####################################################################

    import stk
    import logging

    # #####################################################################
    # Pick a random seed.
    # #####################################################################

    random_seed = 12

    # #####################################################################
    # Run GA serially.
    # #####################################################################

    num_processes = 1

    # #####################################################################
    # Set logging level.
    # #####################################################################

    logging_level = logging.DEBUG

    # #####################################################################
    # Initial population.
    # #####################################################################

    carbon = 'C'
    building_blocks = [
        stk.BuildingBlock(f'[Br]{carbon*i}[Br]', ['bromine'])
        for i in range(2, 27)
    ]

    topology_graphs = [
        stk.polymer.Linear('A', 3),
        stk.polymer.Linear('A', 6),
        stk.polymer.Linear('A', 12)
    ]

    population = stk.EAPopulation.init_random(
        building_blocks=[building_blocks],
        topology_graphs=topology_graphs,
        size=25,
        use_cache=True,
        random_seed=random_seed
    )

    # #####################################################################
    # Selector for selecting the next generation.
    # #####################################################################

    generation_selector = stk.SelectorSequence(
        stk.Fittest(num_batches=3, duplicates=False),
        stk.Roulette(
            num_batches=22,
            duplicates=False,
            random_seed=random_seed
        )
    )

    # #####################################################################
    # Selector for selecting parents.
    # #####################################################################

    crossover_selector = stk.AboveAverage(num_batches=5, batch_size=2)

    # #####################################################################
    # Selector for selecting molecules for mutation.
    # #####################################################################

    mutation_selector = stk.SelectorFunnel(
        stk.AboveAverage(num_batches=10, duplicates=False),
        stk.Roulette(num_batches=5, random_seed=random_seed)
    )

    # #####################################################################
    # Crosser.
    # #####################################################################

    crosser = stk.Jumble(
        num_offspring_building_blocks=3,
        random_seed=random_seed
    )

    # #####################################################################
    # Mutator.
    # #####################################################################

    mutator = stk.RandomMutation(
        stk.RandomTopologyGraph(topology_graphs, random_seed=random_seed),
        stk.RandomBuildingBlock(
            building_blocks=building_blocks,
            key=lambda mol: True,
            random_seed=random_seed
        ),
        stk.SimilarBuildingBlock(
            building_blocks=building_blocks,
            key=lambda mol: True,
            duplicate_building_blocks=False,
            random_seed=random_seed
        ),
        random_seed=random_seed
    )

    # #####################################################################
    # Optimizer.
    # #####################################################################

    optimizer = stk.NullOptimizer(use_cache=True)

    # #####################################################################
    # Fitness calculator.
    # #####################################################################


    def num_atoms(mol):
        return len(mol.atoms)


    fitness_calculator = stk.PropertyVector(num_atoms)

    # #####################################################################
    # Fitness normalizer.
    # #####################################################################

    # The PropertyVector fitness calculator will set the fitness as
    # [n_atoms] use the Sum() fitness normalizer to convert the fitness to
    # just n_atoms^0.5. The sqrt is because we use the Power normalizer.
    fitness_normalizer = stk.NormalizerSequence(
        stk.Power(0.5),
        stk.Sum()
    )

    # #####################################################################
    # Exit condition.
    # #####################################################################

    terminator = stk.NumGenerations(25)

    # #####################################################################
    # Make plotters.
    # #####################################################################

    plotters = [
        stk.ProgressPlotter(
            filename='fitness_plot',
            property_fn=lambda mol: mol.fitness,
            y_label='Fitness',
        ),
        stk.ProgressPlotter(
            filename='atom_number_plot',
            property_fn=lambda mol: len(mol.atoms),
            y_label='Number of Atoms',
        )
    ]

    stk.SelectionPlotter(
        filename='generational_selection',
        selector=generation_selector,
        molecule_label=lambda mol: f'{mol.id} - {mol.fitness}',
        x_label='Molecule: id - fitness value'
    )
    stk.SelectionPlotter(
        filename='crossover_selection',
        selector=crossover_selector,
        molecule_label=lambda mol: f'{mol.id} - {mol.fitness}',
        x_label='Molecule: id - fitness value'
    )
    stk.SelectionPlotter(
        filename='mutation_selection',
        selector=mutation_selector,
        molecule_label=lambda mol: f'{mol.id} - {mol.fitness}',
        x_label='Molecule: id - fitness value'
    )


Running the evolutionary algorithm with this input file::

    $ python -m stk.ea big_monomers.py

will produce the following directory structure::

    |-- stk_ea_runs
    |   |-- 0
    |   |   |-- scratch
    |   |   |   |-- atom_number_plot.png
    |   |   |   |-- atom_number_plot.csv
    |   |   |   |-- fitness_plot.png
    |   |   |   |-- fitness_plot.csv
    |   |   |   |-- generational_selection_1.png
    |   |   |   |-- crossover_selection_1.png
    |   |   |   |-- mutation_selection_1.png
    |   |   |   |-- progress.log
    |   |   |   |-- ...
    |   |   |
    |   |   |-- final_pop
    |   |   |   |-- 150.mol
    |   |   |   |-- 2160.mol
    |   |   |   |-- 9471.mol
    |   |   |   |-- ...
    |   |   |
    |   |   |-- big_monomers.py
    |   |   |-- database.json
    |   |   |-- progress.json
    |   |   |-- errors.log
    |   |   |-- output.tgz

A glance at the evolutionary progress plot in
``scratch/fitness_plot.png`` will show us how well our EA did.

.. image:: figures/epp.png


Running the evolutionary algorithm again::

    $ python -m stk.ea big_monomers.py

will add a second subfolder with the same structure::

    |-- stk_ea_runs
    |   |-- 0
    |   |   |-- counters
    |   |   |   |-- gen_1_crossover_counter.png
    |   |   |   |-- gen_1_mutation_counter.png
    |   |   |   |-- gen_1_selection_counter.png
    |   |   |   |-- ...
    |   |   |
    |   |   |-- final_pop
    |   |   |   |-- 150.mol
    |   |   |   |-- 2160.mol
    |   |   |   |-- 9471.mol
    |   |   |   |-- ...
    |   |   |
    |   |   |-- big_monomers.py
    |   |   |-- database.json
    |   |   |-- progress.json
    |   |   |-- errors.log
    |   |   |-- progress.log
    |   |   |-- epp.png
    |   |   |-- epp.csv
    |   |   |-- output.tgz
    |
    |   |-- 1
    |   |   |-- scratch
    |   |   |   |-- atom_number_plot.png
    |   |   |   |-- atom_number_plot.csv
    |   |   |   |-- fitness_plot.png
    |   |   |   |-- fitness_plot.csv
    |   |   |   |-- generational_selection_1.png
    |   |   |   |-- crossover_selection_1.png
    |   |   |   |-- mutation_selection_1.png
    |   |   |   |-- progress.log
    |   |   |   |-- ...
    |   |   |
    |   |   |-- final_pop
    |   |   |   |-- 150.mol
    |   |   |   |-- 2160.mol
    |   |   |   |-- 9471.mol
    |   |   |   |-- ...
    |   |   |
    |   |   |-- big_monomers.py
    |   |   |-- database.json
    |   |   |-- progress.json
    |   |   |-- errors.log
    |   |   |-- output.tgz

The evolutionary algorithm can also be run multiple times in a row::

    $ python -m stk.ea -l 5 big_monomers.py

which will run the EA 5 separate times adding 5 more subfolders to the
directory structure::

    |-- stk_ea_runs
    |   |-- 0
    |   |   |-- ...
    |   |
    |   |-- 1
    |   |   |-- ...
    |   |
    |   |-- 2
    |   |   |-- ...
    |   |
    |   |-- 3
    |   |   |-- ...
    |   |
    |   |-- 4
    |   |   |-- ...
    |   |
    |   |-- 5
    |   |   |-- ...
    |   |
    |   |-- 6
            |-- ...

The benefit of using the ``-l`` option is that the molecular cache is
not reset between each run. This means that a molecule which was constructed,
optimized and had its fitness value calculated in the first run will
not need to be re-constructed, re-optimized or have fitness value
re-calculated in any of the subsequent runs. The cached version
of the molecule will be used.

However, the molecular cache be pre-loaded even when the ``-l`` option is
not used, simply load the molecules in the input file.

.. code-block:: python

    # some input_file.py

    # There is no need to save this population into a variable.
    # It is enough to load the molecules to place them into the cache.
    stk.Population.load('dumped_molecules.json',
                        stk.Molecule.from_dict)

The output of a single EA consists of a number of files and
directories. The ``scratch`` directory holds any files created during
the EA run. For example, the ``.png`` files showing
how frequently a member of the population was selected for mutation,
crossover and generational selection. For example, this is a
mutation counter

.. image:: figures/counter_example.png

It shows that molecule ``8`` was selected three times for mutation, while molecules
``40``, ``23``` were selected once. The
remaining molecules were not mutated in that generation.

The ``final_pop`` directory holds the ``.mol`` files holding the
structures of the last generation of molecules.
The file ``big_monomers.py`` is a copy of the input file. The ``database.json``
file is a population dump file which holds every molecule produced by
the EA during the run. ``progress.json`` is also a population dump file.
This population holds
every generation of the EA as a subpopulation. This is quite useful
if you want to analyse the output of the EA generation-wise.

``errors.log`` is a file which contains every exception and its
traceback encountered by the EA during its run.

``progress.log`` is a file which lists which molecules make up each
generation, and their respective fitness values.

``output.tgz`` is a tarred and compressed copy of the output folder for
the run.
This means if you want to share you entire run output you can just
share this file.

Finally, when running the EA the progress will be printed into
stderr. The message should be relatively straightforward, such as

::

    ======================================================================

    17:42:20 - INFO - stk.ea.mutation - Using random_bb.

    ======================================================================

which shows the time, the level of the message which can be, in order of
priority DEBUG, INFO, WARNING, ERROR or CRITICAL, the module where
the message originated and finally the message itself.

Evolutionary algorithm input file variables.
............................................

This section lists the variables that need to be defined in the EA
input file, along with a description of each variable.

* :data:`population` - :class:`.EAPopulation` - **mandatory** -
  The initial population of the EA.
* :data:`optimizer` - :class:`.Optimizer` - **mandatory** - The
  optimizer used to optimize the molecules created by the EA.
* :data:`fitness_calculator` - :class:`.FitnessCalculator` -
  **mandatory** - The fitness calculator used to calculate fitness of
  molecules.
* :data:`crosser` - :class:`.Crosser` - **mandatory** - The crosser
  used to carry out crossover operations.
* :data:`mutator` - :class:`.Mutator` - **mandatory** - The mutator
  used to carry out mutation operations.
* :data:`generation_selector` - :class:`.Selector` - **mandatory** -
  The selector used to select the next generation.
  :attr:`~.Selector.batch_size` must be ``1``.
* :data:`mutation_selector` - :class:`.Selector` - **mandatory** - The
  selector used to select molecules to mutate.
  :attr:`~.Selector.batch_size` must be ``1``.
* :data:`crossover_selector` - :class:`.Selector` - **mandatory** -
  The selector used to select molecules for crossover.
* :data:`terminator` - :class:`.Terminator` - **mandatory** -
  The terminator which determines if the EA has satisfied its exit
  condition.
* :data:`fitness_normalizer` - :class:`.FitnessNormalizer` -
  *optional, default =* :class:`.NullFitnessNormalizer()` - The fitness
  normalizer which normalizes fitness values each generation.
* :data:`num_processes` - :class:`int` -
  *optional, default =* :func:`psutil.cpu_count` - The number of CPU
  cores the EA should use.
* :data:`plotters` - :class:`list` of :class:`.Plotter` -
  *optional, default =* ``[]`` - Plotters which are used to plot graphs
  at the end of the EA.
* :data:`log_file` - :class:`bool` -
  *optional, default =* ``True`` - Toggles whether a log file which
  lists which molecules are present in each generation should be made.
* :data:`database_dump` - :class:`bool` -
  *optional, default =* ``True`` - Toggles whether a
  :class:`.Population` JSON file should be made at the end of the EA
  run. It will hold every molecule made by the EA.
* :data:`progress_dump` - :class:`bool` -
  *optional, default =* ``True`` - Toggles whether a
  :class:`.Population` JSON file should be made at the end of the EA
  run. It will hold every generation of the EA as a separate
  subpopulation.
* :data:`debug_dumps` - :class:`bool` -
  *optional, default =* ``False`` - If ``True`` a database and progress
  dump is made after every generation rather than just the end. This is
  nice for debugging but can seriously slow down the EA.
* :data:`tar_output` - :class:`bool` -
  *optional, default =* ``False`` - If ``True`` then a compressed tar
  archive of the output folder will be made.
* :data:`logging_level` - :class:`int` -
  *optional, default =* ``logging.INFO`` - Sets the logging level in
  the EA.