|
1
|
|
|
2
|
|
|
3
|
- three most comprehensive data banks
--- GenBank
www.ncbi.nih.gov/Genbank/
--- EMBL-EBI Nucleotide Sequence Database
www.ebi.ac.uk/embl/
--- DNA Data Bank of Japan
www.ddbj.nig.ac.jp
- daily exchange of data
- tools for processing the information
|
|
4
|
- mRNA - may contain introns (non-coding sequences) and may not cover
complete coding region
- cDNA - derived from mRNA by reverse transcription
- genomic DNA - from genome sequencing project;
may contain introns, repeat regions, and other features; usually complete
- GSS (Genome Survey Sequence) - single-pass sequence with many errors
- EST (Expressed Sequence Tags) - short cDNA sequences prepared from mRNA
from cell under particular conditions (e.g. disease); will not cover
complete coding region.
|
|
5
|
- proteome mapping in disease
--- the sequences that are
over-expressed
can be targeted
- protein-protein interactions –
micro-arrays
--- interactions that are
occurring in disease
can be prevented by
compounds that
bind to one of the
proteins in the binding
area
|
|
6
|
- the aim for drug design:
DNAàRNAàproteinà3D-structureàbinding
- sequences of nucleotides (A, C, G, T) available, the aim is to translate
them to amino acid sequences
- coding regions for proteins (genes/cistrons) – no straightforward to
find, annotation (sequence features) important
- amino acids coded by codons (triplets) but the sequence also contains
start and stop codons…
|
|
7
|
|
|
8
|
- DNA contains two
complementary
antiparallel strands
5’-AGCAGTCGATGCCGAATTCC-3’
3’-TCGTCGACTACGGCTTAAGG-5’
- each strand can be read in two
directions
- for each direction, there are
three possible
ways – six possible
reading frames
- those that have stop codons early
in the
sequence can be discarded
|
|
9
|
|
|
10
|
|
|
11
|
|
|
12
|
|
|
13
|
- structure is more conserved than sequence
--- similarity in sequence implies similarity in structure
--- proteins with different sequences can have similar
structures
- similarity in sequence also implies similarity in function
- comparison of the same protein in different organisms
--- evolutionary relationships (phylogenetic analysis)
- when doing alignments, one can look at
--- identical amino acids
--- amino acids with similar properties
|
|
14
|
- sequence identity
--- the same amino acids at equivalent
positions
- sequence similarity
--- amino acids with similar properties can be
interchanged
- sequence homology
--- homology and similarity are often used as
synonyms
--- in phylogenetic analysis, two sequences
are homologous if they
share a common
ancestor
|
|
15
|
|
|
16
|
- the compared sequences are
--- moved along each other
--- the gaps are introduced if needed, penalties
- for opening a
gap
- for extending a gap
(smaller)
- similarity matrix
--- rows and columns are individual positions
--- elements contain scores for comparison
- highest scores for
identity
- lower scores for
similarity
- negative scores for
dissimilar substitutions
- the best alignment has the highest score
|
|
17
|
- the plot of the relation between two sequences
--- the dots mark positions of identical AA
--- can be optimized by ‘sliding window’
|
|
18
|
|
|
19
|
- lalign
--- www.ch.embnet.org/software/LALIGN_form.html
- SIM
--- ca.expasy.org/tools/sim.html
- align
--- www.ebi.ac.uk/emboss/align/
- the best way to align two sequences is to align the whole family
--- varying AAs less important
--- conserved AAs functionally important
--- gaps frequently in loops
|
|
20
|
- Basic Local Alignment Search Tool (BLAST)
--- www.ncbi.nlm.nih.gov/BLAST/
--- uses local alignments
--- able to detect relationships among sequences
which share only
isolated regions of
similarity
--- various enhancements and variants available
- FASTA
--- www.ebi.ac.uk/fasta33/
- ClustalW
--- www.ch.embnet.org/software/ClustalW.html
- T-COFFEE (very good, comparatively slow)
www.ch.embnet.org/software/TCoffee.html
|
|
21
|
- % identity
- Expect (E-) value
--- takes into account the size of the database
--- describes the number of hits one can
"expect" to
see just by chance when
searching a database
of a particular size
- E-value – rule of thumb
--- E < 0.1 good, random alignment
--- E ~ 1 means 50% chance alignment
--- E > 10 no relationship
|
|
22
|
- specific consensus patterns
discovered in
multiple alignments
- discovered through profiles, Hidden Markov Models, position specific
score matrices
- indicative of protein
function
--- function can be
identified even if overall
similarity is low
- organized in motif databases
- many motif databases can be
searched
simultaneously by
InterPro
---
www.ebi.ac.uk/interpro/scan.html
|
|
23
|
- Position-specific Iterated BLAST
(PSI-BLAST)
---
www.ncbi.nlm.nih.gov/BLAST/
- initial alignments generate a profile
--- frequencies of AAs in
individual positions
- motifs can be incorporated easily
- the procedure is iterated until
no statistically different sequences are found in the database
- very good alignment tool
|
|
24
|
|
|
25
|
|
|
26
|
|
|
27
|
- identify and select related structures (templates)
- align target sequence with templates
--- for drug design, similarity >75%
--- gaps should be between secondary structure
elements
- build a structural model for the target using information about template
structures
--- in principle, copy template structure to target
--- gaps – loop search algorithms for structures
from PDB-based
libraries or optimization
- evaluate model (repeat if not satisfied)
--- procheck (bonds, dihedrals, noncovalent int’s…)
--- molecular modeling (side chain optimization, MD)
|
|
28
|
- Modeller (Andrej Sali) – unix software
- MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database
of annotated comparative protein structure models for all available
protein sequences matched to at least one known protein structure
- SWISS-MODEL – an easy to use web-server
--- www.expasy.ch/swissmod/SWISS-MODEL.html
|
|
29
|
|
|
30
|
- also called ‘fold recognition’,
‘threading’,
‘recognition of remote
homologies’
- sequences with low % identity can
adopt the
same 3D fold
- around 1000 folds are expected to
occur in
all proteins
- sequence is ‘threaded’ into each
of the known
folds in a database and
resulting energy is
evaluated by a score
- Methods: e.g. FUGUE,
3D-PSSM
www.sbg.bio.ic.ac.uk/~3dpssm/
|
|
31
|
|
|
32
|
- Rosetta method – based on local structures observed in PDB; MC search of
the possible combinations of likely local structures, minimizing a
scoring function that accounts for nonlocal interactions such as
compactness, hydrophobic burial, and pair interactions
- Static approaches – search for kinetically accessible minima in
hypersurfaces generated by force-field-type description of
electrostatic, H-bonding and dispersion interactions
- Dynamic approaches – simulations with different protein representations
(coarse-grain Langevin simulations, torsional angular simulations, MD
simulations
|
|
33
|
- PDB
--- www.rcsb.org/pdb
- Derivative databases:
- SumPDB – annotated PDB
--- www.biochem.ucl.ac.uk/bsm/pdbsum/
- Relibase+ - based on PDB, focusing on ligands
and binding sites
--- relibase.rutgers.edu/
- MSD – Macromolecular Structure
Database
--- www.ebi.ac.uk/msd/
|
|
34
|
- CASP – Community-wide experiment
on the Critical Assessment of techniques for protein Structure
Prediction - http://predictioncenter.llnl.gov/
- a competition organized every 2 years since 1994
- experimentalists provide unpublished structures
- the sequences are made available to all researchers
- researchers can predict the structure by
--- homology modeling
--- fold recognition (threading)
--- ab initio modeling
- results are then compared against the real experimental structure.
|
|
35
|
- comparative modeling – CASP3
|
|
36
|
- threading, ab initio – CASP3
|