Skip to Navigation

GNAS2009 Abstract #16 - Goodstadt, Leo

Accurate inferences of orthology among closely related species

Leo Goodstadt1, Andreas Heger1

1MRC Functional Genomics Unit, University of Oxford, United Kingdom

We have developed a pipeline for the accurate and comprehensive inference of orthologous relationships among closely related species (e.g. amniotes) as a basis for the functional annotation of multiple whole genomes.

The key to our orthology inferences is the use of synonymous substitution rates (ds) to refine inferences of evolutionary relationships between protein coding genes. ds are largely unaffected by variable natural selection, and the use of cDNA rather than amino acid sequence provides much greater discriminatory power for closely related sequences. Our simulations indicate that estimates dS are less prone to saturation than commonly assumed and we have extended our initial analyses of orthology predictions between pairwise species (e.g. human and dog) in the PhyOP pipeline to the evolutionary provenances among genes of multiple species of amniotes in the OPTIC pipeline.

In order to make maximal use of alternatively-spliced transcripts, phylogenetic relationships are inferred from concatenated translations of all protein coding exons for each gene. This sidesteps errors resulting from the comparisons of transcripts with different exon usage, a particular problem for amniotes whose splicing repertoire is purely capture in existing cDNA libraries. This approach also allows the resolution of chimeric gene mis-predictions which concatenate neighboring paralogues, and extends orthology predictions to each exon. We find that conservation of orthologous exon boundaries and phase remains the strongest signal for the quality of gene predictions, suggesting that large number of errors in predicted gene models remain in current gene catalogues.

Most changes to gene sets occur in large multi-copy families, and their rapid sequence change driven by gene duplication, conversion and often positive selection, represents the greatest challenges to orthology prediction. Accurate inferences may be limited by the information within coding sequence. Our future directions include the automatic curation of orthology inferences by conservation of gene order (synteny), and the use of genome wide parameters in our evolutionary models.

These approaches to infer orthology with the greatest accuracy will allow us to better understand which parts of the large lineage-specific differences in protein coding gene sets underlie the distinct biology of each species.