Notes
Slide Show
Outline
1
Methods for Receptors with Known Sequence
2
 
3
Genomic Information
  • three most comprehensive data banks
    --- GenBank
         www.ncbi.nih.gov/Genbank/
    --- EMBL-EBI Nucleotide Sequence Database
         www.ebi.ac.uk/embl/
    --- DNA Data Bank of Japan
         www.ddbj.nig.ac.jp
  • daily exchange of data
  • tools for processing the information
4
DNA Sequence Types in Databanks
  • mRNA - may contain introns (non-coding sequences) and may not cover complete coding region
  • cDNA - derived from mRNA by reverse transcription
  • genomic DNA - from genome sequencing project;
    may contain introns, repeat regions, and other features;  usually complete
  • GSS (Genome Survey Sequence) - single-pass sequence with many errors
  • EST (Expressed Sequence Tags) - short cDNA sequences prepared from mRNA from cell under particular conditions (e.g. disease); will not cover complete coding region.
5
Target Sequences for Drug Design
  •  proteome mapping in disease
     --- the sequences that are over-expressed
          can be targeted
  •  protein-protein interactions – micro-arrays
     --- interactions that are occurring in disease
          can be prevented by compounds that
          bind to one of the proteins in the binding
          area
6
From Genome to Proteins
  • the aim for drug design:
    DNAàRNAàproteinà3D-structureàbinding
  • sequences of nucleotides (A, C, G, T) available, the aim is to translate them to amino acid sequences
  • coding regions for proteins (genes/cistrons) – no straightforward to find, annotation (sequence features) important
  • amino acids coded by codons (triplets) but the sequence also contains start and stop codons…
7
Translation Table(s)
8
Translation
  •  DNA contains two complementary
     antiparallel strands
       5’-AGCAGTCGATGCCGAATTCC-3’
           3’-TCGTCGACTACGGCTTAAGG-5’
  •  each strand can be read in two directions
  •  for each direction, there are three possible
     ways – six possible reading frames
  •  those that have stop codons early in the
     sequence can be discarded
9
Protein Sequence Databases I
10
Protein Sequence Databases II
11
Protein Sequence Databases III
12
Post-translational Modifications
13
Sequence Alignment I
  • structure is more conserved than sequence
    --- similarity in sequence implies similarity in structure
    --- proteins with different sequences can have similar
         structures
  • similarity in sequence also implies similarity in function
  • comparison of the same protein in different organisms
    --- evolutionary relationships (phylogenetic analysis)
  • when doing alignments, one can look at
    --- identical amino acids
    --- amino acids with similar properties
14
Sequence Alignment II
  • sequence identity
    --- the same amino acids at equivalent
         positions
  • sequence similarity
    --- amino acids with similar properties can be
         interchanged
  • sequence homology
    --- homology and similarity are often used as
         synonyms
    --- in phylogenetic analysis, two sequences
         are homologous if they share a common
         ancestor
15
 
16
Sequence Alignment: Procedure
  • the compared sequences are
    --- moved along each other
    --- the gaps are introduced if needed, penalties
         - for opening a gap
         - for extending a gap (smaller)
  • similarity matrix
    --- rows and columns are individual positions
    --- elements contain scores for comparison
         - highest scores for identity
         - lower scores for similarity
         - negative scores for dissimilar substitutions
  • the best alignment has the highest score
17
Sequence Alignment: Dotplots
  • the plot of the relation between two sequences
    --- the dots mark positions of identical AA
    --- can be optimized by ‘sliding window’
18
 
19
Sequence Alignment: Two Sequences
  • lalign
    --- www.ch.embnet.org/software/LALIGN_form.html
  • SIM
    --- ca.expasy.org/tools/sim.html
  • align
    --- www.ebi.ac.uk/emboss/align/
  • the best way to align two sequences is to align the whole family
    --- varying AAs less important
    --- conserved AAs functionally important
    --- gaps frequently in loops
20
Sequence Alignment: Databases
  • Basic Local Alignment Search Tool (BLAST)
    --- www.ncbi.nlm.nih.gov/BLAST/
    --- uses local alignments
    --- able to detect relationships among sequences
         which share only isolated regions of  similarity
    --- various enhancements and variants available
  • FASTA
    --- www.ebi.ac.uk/fasta33/
  • ClustalW
    --- www.ch.embnet.org/software/ClustalW.html
  • T-COFFEE (very good, comparatively slow)
    www.ch.embnet.org/software/TCoffee.html
21
Sequence Alignment: Quality
  • % identity
  • Expect (E-) value
    --- takes into account the size of the database
    --- describes the number of hits one can
         "expect" to see just by chance when
         searching a database of a particular size
  • E-value – rule of thumb
    --- E < 0.1 good, random alignment
    --- E ~ 1 means 50% chance alignment
    --- E > 10 no relationship
22
Motifs
  •  specific consensus patterns discovered in
     multiple alignments
  • discovered through profiles, Hidden Markov Models, position specific score matrices
  •  indicative of protein function
     --- function can be identified even if overall
          similarity is low
  •  organized in motif databases
  •  many motif databases can be searched
     simultaneously by InterPro
     --- www.ebi.ac.uk/interpro/scan.html
23
Alignment Plus Motifs
  •  Position-specific Iterated BLAST (PSI-BLAST)
     --- www.ncbi.nlm.nih.gov/BLAST/
  •  initial alignments generate a profile
     --- frequencies of AAs in individual positions
  •  motifs can be incorporated easily
  •  the procedure is iterated until no statistically different sequences are found in the database
  •  very good alignment tool
24
 
25
From Sequence to Structure
26
What Is a Template?
27
Comparative Modeling
  • identify and select related structures (templates)
  • align target sequence with templates
    --- for drug design, similarity >75%
    --- gaps should be between secondary structure
        elements
  • build a structural model for the target using information about template structures
    --- in principle, copy template structure to target
    --- gaps – loop search algorithms for structures
        from PDB-based libraries or optimization
  • evaluate model (repeat if not satisfied)
    --- procheck (bonds, dihedrals, noncovalent int’s…)
    --- molecular modeling (side chain optimization, MD)
28
Comparative Modeling: Software
  • Modeller (Andrej Sali) – unix software
  • MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure
  • SWISS-MODEL – an easy to use web-server
    --- www.expasy.ch/swissmod/SWISS-MODEL.html
29
Comparative Modeling: Illustration
30
Knowledge-based Modeling
  •  also called ‘fold recognition’, ‘threading’,
     ‘recognition of remote homologies’
  •  sequences with low % identity can adopt the
     same 3D fold
  •  around 1000 folds are expected to occur in
     all proteins
  •  sequence is ‘threaded’ into each of the known
     folds in a database and resulting energy is
     evaluated by a score
  •  Methods: e.g. FUGUE, 3D-PSSM
      www.sbg.bio.ic.ac.uk/~3dpssm/
31
Knowledge-based Modeling: Flow Chart
32
Ab Initio Protein Folding
  • Rosetta method – based on local structures observed in PDB; MC search of the possible combinations of likely local structures, minimizing a scoring function that accounts for nonlocal interactions such as compactness, hydrophobic burial, and pair interactions
  • Static approaches – search for kinetically accessible minima in hypersurfaces generated by force-field-type description of electrostatic, H-bonding and dispersion interactions
  • Dynamic approaches – simulations with different protein representations (coarse-grain Langevin simulations, torsional angular simulations, MD simulations
33
Protein Structure Databases
  •  PDB
     --- www.rcsb.org/pdb
  • Derivative databases:
  • SumPDB – annotated PDB
    --- www.biochem.ucl.ac.uk/bsm/pdbsum/
  • Relibase+ - based on PDB, focusing on ligands
    and binding sites
    --- relibase.rutgers.edu/
  •  MSD – Macromolecular Structure Database
     --- www.ebi.ac.uk/msd/
34
What are the Best Methods?
  •  CASP – Community-wide experiment on the Critical Assessment of techniques for protein Structure Prediction - http://predictioncenter.llnl.gov/
  • a competition organized every 2 years since 1994
  • experimentalists provide unpublished structures
  • the sequences are made available to all researchers
  • researchers can predict the structure by
    --- homology modeling
    --- fold recognition (threading)
    --- ab initio modeling
  • results are then compared against the real experimental structure.
35
Quality of the Predictions I
  •  comparative modeling – CASP3
36
Quality of the Predictions II
  •  threading, ab initio – CASP3