Genome Transcriptome Proteome Metabolome Interactome Phenome Functome
A crisis in postgenomic nomenclature. Fields and Johnston (2002)
genome size of human and important model systems
Smallest genome for an organism that can live on its own: 1300 genes of Pelagbacter ubique…has the largest biomass
Others have genomes that are smaller (ex.Microplasm has 500 genes) however needs others to survive)
Genomics – initially, acquisition of sequence •where (and what) are the genes? Post-genomics – functional analysis •got genes, what do they do? •how do they work? •how do you generate an integrated whole? I use GENOMICS to encompass all of this
Differences: lie in number of genes, nature of genes or spatial organization/regulation/etc. Answer thought at the present time is the regulation (when turned off on)/organization etc.
Arabdiopsisas a bit more genes than us (25 000)
From this know: how many necessary, how amny required for basic survival, how many needed for free-survival etc. To understand a gene we need (at least):
•Sequence (Regulatory regions, Coding region) •Time/cell type of transcription/processing/translation •Stability of message and gene product •post-translational processing •3D structure •Catalytic/structural function •Cellular location (including movement) •Biochemical pathway(s) and regulation •Physical interactions •Genetic interactions
microarray = track transcriptomes
Publications (through Sept. 2002)
1800 1600 1400 1200 1000 800 600 400 200 0
Genomics 83477 Proteomics 37261 Microarray 52353 through 2012
Genomics Microarray Whole genome Transcriptome Proteomics Proteome
-NCBI, PubMed, search titles, abstracts, keywords
Growth of primary sequence database
Currently >126 billion bp traditional + >191 billion bp in whole genome
How is it generated?
Budding yeast chromsome I, Bussey et al., 1995
Currently >155 billion bp tradiional + > 418 billion bp in whole genomeWhole genome sequence is currently expanded asi t is easier to do now than targeted
Sub-cloning fragments again and again, and than putting the genome back together. Budding yeast the most understood organism genetically; however alter discover twice the amount of genes than initially thought
Became proof of principle for shot-gun sequencing
Fleischmann et al., Science 269:496-512, 1995
Dideoxy sequencing gel
Shot gun sequencing for individual molecule
Haemophilus influenzae – shotgun sequenced 1995 (1.8 mbp)
Single molecule sequencing Helicose, single molecule fluorescent Single molecule, single well Proton detection $1000 dollar exome
exomes = protein coding region Now 1000 dollar for the whole genome
Didn’t sub-clone anything; went rather for isolating DNA, chop it and than sent robots to sequence them – didn’t care where it cam from, just sequnce it. Therefore, after while create overlaps, and sequence it enough and enough coverage,than can overlap it and get the whole sequence. This is how its done now = hence why it is so cheap. Don’t even bother with the coning part anymore.
Only has to run four an hour; takes some computers than to put it together, and than go back if you missed a bit
Sequencing the first human genome took $ billion? dollars.
First individual sequences (Craig Venter and James Watson) were released a few years ago. Many are available now Genomes can be sequenced for a thousand dollars, some say for as little as $60. To some extent it depends on the accuracy required. All genomes will be available in the future – and cheap.
Never get a 100% of it because every time create a gamete it has 40 – 60 unique mistakes.
Can we identify all genes by computer analysis of the raw DNA sequence? NO! • We cannot unambiguously identify all genes, some are always missed (same for splice variants, even with lots of ESTs) • Even if sure that it is a gene, we often do not know its function. • Even for small fully sequenced genomes we have large numbers of genes with no substantial associated phenotype/function.
Human diversity SNP = single nucleotide polymorphism
Differ about one out of every 1500 base pairs between two people. Average 40.6 overlapping areas vs. 7.5 for earlier ones. Total to normal 3.0-3.8 million snps About 2.7-3.8 million were known Few hundred thousand novel (new mutations) are seen; getting significantly lower over time.
Gene ontology – formal method for annotating genes
by Biological process, Function and Cell component. This allows for easier cross comparisons of species/databases (data from 2008)
For genomes in hand 1. Smallest, Carsonella ruddii 160 kb, 182 genes
total genes Unknown process Unknown function
S. cerevisiae S. pombe ~6000 ~5000 1793 2409 1064 2084 2151
Mycoplasma genitalium 580 kb, 517 genes
300 genes are essential ones
2. Pelagibacter smallest free living 1300 genes
Unknown cell component 1102 GO – gene ontology consortium http://www.geneontology.org
2. One of largest, human (some plants larger) 3.2 x 109 bp, 23000 genes
Protein Domain Architecture Way complex creatures differ from less complex = more functional domains, more regulation binding sites, more than one catalytic activity
Used to think is just junk
How do we figure out what a gene does? Low throughput One gene = Many PhDs and a lot of money e.g. PTEN discovered in 1997 7162 papers on it and associated proteins as of 2012
•Intergenic space (remnants of transposons?) •+/- Introns •Gene numbers •Gene families •Alternative splicing •Protein domain architecture •Regulation
Think now almost 100 % of genes are alternatively spliced
General strategies High throughput Automated testing and Whole genome analysis High through put analysis must always be followed by extensive low through put follow up High through put provides focus and leads Sequence analysis – hints of function homologs (paralogs and orthologs) conserved domain analysis (catalysis, interactors, regulators)
Many species have the same genes, so therfore can say same function in the different species. Orthologs = different species are the same genes Paralogs = same gene within the same species
Phenotypic analysis • individual, pedigree or population need a mutant • classical genetics • single or whole genome deletion sets • population analysis, SNP hunting
Find a mutation; mutation defines the gene Restricted to systems with good genetics
Expression analysis • Microarrays – RNA, RNA-seq • Proteomics – expressed and modified proteins
Microarrays = look at transcripts being produced; put probes on a chip and hybridize the cDNA of the system In RNA-seq: get the sequence, frequency of occurence and all the alternative splices within the cell; don’t need to know anything unlike in microarrays
Proteomics: where, when and what and for many of the proteins we currently don’t know
If can find what proteins associate/bind with your original proteinmeans they are modifiers, regulators, or part of the same pathway. This is not always true, but in many instances can help. Build-a-mutant,
Guilt by association
Which proteins are in which complexes
6000 gene deletions (bar coded to detect on microarray), lots of hands
Subject them to any phenotypic test you can devise Technique is sensitive to relatively small contributions to fitness
Genes and gene products with correlated expression, physical interaction, or colocalization may work in similar pathways •coIP, affinity strategies, often coupled to mass spectrometry •two hybrid •synthetic lethal screening •localization
-http://www.nature.com/nature/journal/v418/n6896/full/nature00935.html -not going to look at specific data here
Synthetic lethal screening: if two proteins can interact with one another, means the two pathways are probably connected
Often only pick up the mutations that are lethal or extreme, often minor mutationsa re overlooked next to these big ones.
Microarrays allow us to look at the expression of all genes at once. Cluster analysis lets us extract information from the data set
Transcriptional profiling: basic tenet, pattern of gene expression at specific moment in time, reflects true physiological state of cell
RNA-seq which provides information on expression level , splice variants and SNPs is a strongly competing technology at present
Figure 2. Hierarchical cluster. A portion of a hierarchical cluster, which can be easily navigated, is shown. Red indicates up-regulation in the experimental sample, and green indicates down-regulation in the experimental sample, with respect to the control. The intensity of the color indicates the magnitude of up- or Sherlock et al., 2001 down-regulation
Directly tackling the proteome
Tag, isolate complexes of associated proteins and MS id them Argument: if they are stuck together they probably work together
Separation coupled with automated MS for ID Ho et al., 2002 DNA damage path Blue – known Red – previously unknown
There is probably a complete may for yeast cells.
How do we figure out what a gene does?
Take the high throughput data, analyse it extensively (data mining) – systems biology
Try to predict something; testable hypothesis
Then, choose a target One gene = Many PhDs and a lot of money
Tong et al., 2001 synthetic lethality
Asking about paired reactions, between different mutations in different pathways. Don’t know anything about the physical relationship, simple whatever pathway this protein is in it interacts with these other proteins. Output is looking at fitness (how well the cell works).
•cheap sequence for all (or any we care about) organisms •an absolute phylogeny • linkage disequilibrium for all human (or other) genes making a substantial contribution to any given phenotype • full knowledge of biochemical pathways and protein interactions and modifications • prediction of cell and organism function from sequence (given the environmental variables) • things we cannot yet predict
TO ORDER FOR THIS QUESTION OR A SIMILAR ONE, CLICK THE ORDER NOW BUTTON AND ON THE ORDER FORM, FILL ALL THE REQUIRED DETAILS THEN TRACE THE DISCOUNT CODE,
TYPE IT ON THE DISCOUNT BOX AND CLICK ON ‘USE CODE’ TO EFFECT YOUR DISCOUNT. THANK YOU