Chemoinformatics in action:
     some question for audience

 Yuriy Sushko, Sergii Novotarskyi
Practical example
Story:
A company that produces or intends
   to produce some particular
   compound (drug, make up, paint,
   glue, toilet refresher, whatever..) is
   obliged to test, if this compound is
   toxic for human and how toxic it is.
   What are the options to check
   this?



                                            Teuthrin, Cyclopropanecarboxylic acid
Practical example

       Bioassay                      Computer modeling

                                    In silico: using QSAR (QSPR) based
                                    on machine learning to predict
In vivo and in vitro assays with
                                    properties of interest without direct
mice, dogs, rats or other species
                                    experiment.
Option 1: Bioassay
Classical and currently widely used method
  for measuring toxicity is bioassay with
  mice, rats, dogs or other species.

What are advantages and disadvantages
Option 1: Bioassay
For bioassay we would typically need:
• Dozens of mice for checking several concentrations of
  tested compound
• In some assays we need to wait for next generation
• We may need to test against several organisms (rat,
  mouse) and dierent administration routes (oral, skin, IV
  injection)
• Test can take upto several months
• Test would cost upto dozens of thousands dollars
     What if we need to measure toxicity for 100 000 compounds?
Option 2: Modeling
What are the steps required to build
 predictive model for physicochemical or
 biological property?

• Prepare dataset of experimental data
• Choose and calculate molecular
  descriptors
• Apply machine learning method
Molecular descriptors
What is descriptor? Most simple examples?

Descriptor is some numerical property of chemical
  compound.

•   Simplest constitutional descriptors: MW, NA, nDB, ..
•   Molecular properties: LogP, hydrophilic factor, ..
•   Randic molecular profiles
•   Various topological and 3D indices and profiles
Molecular descriptors
         2.54
         4.25
         -5.71
         3.26
         0.57
         -0.07



         1.45
         6.34
         8.28
         2.78
         -5.67
         -2.33



         1.45
         7.34
         8.35
         1.64
         -5.56
         -4.45
Machine learning
What kind of machine learning methods do
  you know?
• Linear regression
• K nearest neighbors (KNN)
• Partial Least Regression
• Neural networks
• Support Vector Machines
Some additional facts
Popular formats for representing molecules
  in databases
• SDF
• SMILES
• INCHI
SDF — a plain text file
benzene
ACD/Labs0812062058                                                                     header
 6   6 0 0 0 0 0       0     0 0     1 V2000
    1.9050  -0.7932        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0
    1.9050  -2.1232        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0
    0.7531  -0.1282        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0
    0.7531  -2.7882        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0   atom information
   -0.3987  -0.7932        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0
   -0.3987  -2.1232        0.0000   C   0 0    0   0   0   0   0   0   0   0   0   0
  2 1 1 0 0 0 0
  3 1 2 0 0 0 0
  4 2 2 0 0 0 0
  5 3 1 0 0 0 0                                                                        bond information
  6 4 1 0 0 0 0
  6 5 2 0 0 0 0
 M END
 $$$$
> <Unique_ID>
XCA3464366

> <ClogP>
5.825
                                                                                       tags
> <Vendor>
Sigma

> <Molecular Weight>
499.611
SMILES — a string representation

                     C1=CC=C(C=C1)Br



                     CC(F)F




                     COC(C(Cl)Cl)(F)
                     F
InChI — one more approach
 InChI (international chemical identifier) — a standart, developed by IUPAC
    for a textual identifier of chemical substances


              InChI: InChI=1S/C6H5Br/c7-6-4-2-1-3-5-6/h1-5H
              InChIKey: QARVLSVVCXYDNA-UHFFFAOYSA


             InChI: InChI=1S/C2H4F2/c1-2(3)4/h2H,1H3
             InChIKey: NPNPZTNLOVBDOC-UHFFFAOYSA


              InChI: InChI=1S/C3H4Cl2F2O/c1-8-3(6,7)2(4)5/h2H,1H3
              InChIKey: RFKMCNOHBTXSMU-UHFFFAOYSA

Chemoinformatics in Action

  • 1.
    Chemoinformatics in action: some question for audience Yuriy Sushko, Sergii Novotarskyi
  • 2.
    Practical example Story: A companythat produces or intends to produce some particular compound (drug, make up, paint, glue, toilet refresher, whatever..) is obliged to test, if this compound is toxic for human and how toxic it is. What are the options to check this? Teuthrin, Cyclopropanecarboxylic acid
  • 3.
    Practical example Bioassay Computer modeling In silico: using QSAR (QSPR) based on machine learning to predict In vivo and in vitro assays with properties of interest without direct mice, dogs, rats or other species experiment.
  • 4.
    Option 1: Bioassay Classicaland currently widely used method for measuring toxicity is bioassay with mice, rats, dogs or other species. What are advantages and disadvantages
  • 5.
    Option 1: Bioassay Forbioassay we would typically need: • Dozens of mice for checking several concentrations of tested compound • In some assays we need to wait for next generation • We may need to test against several organisms (rat, mouse) and dierent administration routes (oral, skin, IV injection) • Test can take upto several months • Test would cost upto dozens of thousands dollars What if we need to measure toxicity for 100 000 compounds?
  • 6.
    Option 2: Modeling Whatare the steps required to build predictive model for physicochemical or biological property? • Prepare dataset of experimental data • Choose and calculate molecular descriptors • Apply machine learning method
  • 7.
    Molecular descriptors What isdescriptor? Most simple examples? Descriptor is some numerical property of chemical compound. • Simplest constitutional descriptors: MW, NA, nDB, .. • Molecular properties: LogP, hydrophilic factor, .. • Randic molecular profiles • Various topological and 3D indices and profiles
  • 8.
    Molecular descriptors 2.54 4.25 -5.71 3.26 0.57 -0.07 1.45 6.34 8.28 2.78 -5.67 -2.33 1.45 7.34 8.35 1.64 -5.56 -4.45
  • 9.
    Machine learning What kindof machine learning methods do you know? • Linear regression • K nearest neighbors (KNN) • Partial Least Regression • Neural networks • Support Vector Machines
  • 10.
    Some additional facts Popularformats for representing molecules in databases • SDF • SMILES • INCHI
  • 11.
    SDF — aplain text file benzene ACD/Labs0812062058 header 6 6 0 0 0 0 0 0 0 0 1 V2000 1.9050 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9050 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7531 -0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7531 -2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 atom information -0.3987 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.3987 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 3 1 2 0 0 0 0 4 2 2 0 0 0 0 5 3 1 0 0 0 0 bond information 6 4 1 0 0 0 0 6 5 2 0 0 0 0 M END $$$$ > <Unique_ID> XCA3464366 > <ClogP> 5.825 tags > <Vendor> Sigma > <Molecular Weight> 499.611
  • 12.
    SMILES — astring representation C1=CC=C(C=C1)Br CC(F)F COC(C(Cl)Cl)(F) F
  • 13.
    InChI — onemore approach InChI (international chemical identifier) — a standart, developed by IUPAC for a textual identifier of chemical substances InChI: InChI=1S/C6H5Br/c7-6-4-2-1-3-5-6/h1-5H InChIKey: QARVLSVVCXYDNA-UHFFFAOYSA InChI: InChI=1S/C2H4F2/c1-2(3)4/h2H,1H3 InChIKey: NPNPZTNLOVBDOC-UHFFFAOYSA InChI: InChI=1S/C3H4Cl2F2O/c1-8-3(6,7)2(4)5/h2H,1H3 InChIKey: RFKMCNOHBTXSMU-UHFFFAOYSA