HGM01 Nomenclature Workshop (HGM01-NW)

This is an archived page. The information displayed has not been updated.

HGM01-NW Gene Nomenclature Workshop

"Looking at genes in the draft human sequence"
Thursday April 19, 2001
09:00-12:30
Ochil Room 3
Edinburgh International Conference Centre, Morrison Street.

This nomenclature workshop will take place in Edinburgh as a satellite meeting before HGM 2001 (http://hgm2001.hgu.mrc.ac.uk/contents.htm), in the morning of 19th April.

Introduction

Following the success of the workshop held before the 50th Annual ASHG Meeting, we are holding another small meeting to discuss the human genomic draft sequence and its annotation and interpretation within the various online databases. Participation will be open to all, but we ask that you notify us in advance of your intent to attend.

Program

09:00    Review of ASHG-NW - Hester Wain (HGNC)
09:10    "Genome Annotation with Ensembl" Tim Hubbard (Sanger)
09:30    "Annotating genes at NCBI" Donna Maglott (NCBI)
09:50   Chaired discussion:

Sources of human data - clarity for identification of same sequence
Integrity of gene symbol mapping between databases
Cytogenetic location - verification
Gene families - correct identification of individual members
International Advisory Committee
Confidentiality of gene names and symbols

12:15     Summary - Sue Povey (HGNC)
12:30     Meeting Ends

Funding

There is currently no funding available for expenses for this workshop.

Contact

All enquiries should be sent to:

Dr Hester Wain
HUGO Gene Nomenclature Committee
University College London
Wolfson House
4 Stephenson Way
London, NW1 2HE
UK
Fax: +44 (1) 171 387 3496

Workshop report

This is an archived page. The information displayed has not been updated.

"Looking at genes in the draft human sequence"

Introduction

Registered participants were: Rolf Apweiler, Tom Broad (AgResearch), Elspeth Bruford, Dave Burt (ARKdb), Richard Cammack, Sally Cross (MRC Human Genetics Unit), Matthew Darlison, Ian Dunham (Chromosome 22), Janan Eppig (MGD), Philippe Gautier (MRC Human Genetics Unit), Midori Harris, Tim Hubbard, Ian Jackson, Youla Karavidopoulou, Marie-Paule Lefranc, Ruth Lovering, Michael Lush, Donna Maglott (NCBI), Sue Povey, Connie Talbot (GDB), Hester Wain, Mathew Wright.

Hester Wain presented a summary of the last nomenclature workshop held prior to ASHG 2000. This was followed by Tim Hubbard's presentation of "Genome Annotation with Ensembl". Ensembl (https://www.ensembl.org/) consists of genes and genomic features automatically identified using Genscan and Genewise software. Genscan evidence is generated from protein predictions which are compared against protein databases. The genomic sequence used is the "Golden Path" assembled at Santa Cruz (http://genome.cse.ucsc.edu/). Genscan over-predicts by 30-40% but does hit 95% of genes, as calibrated by sequence from 1Mb in the well studied BRCA2 region. Genewise is used to predict exons directly from genomic sequence. EST evidence is not directly used in Ensembl, but may be in the future, as a database of ESTs is being built. Pseudogenes are not currently well represented as the prediction systems are not accurate enough and "find" multiple exons with 1 bp introns. Tim also reported that 1 in 10 predicted pseudogenes may actually be due to sequencing errors. A new promoter prediction algorithm will appear soon.

The new version of Ensembl generated at the end of April 2001 contains stable identifiers from the last version. There is co-ordination with NCBI and comparison of transcripts are identified by at least four categories: T=total agreement, U=only one gene merged but with some discrepancy in the exon structure, C=cluster of exons but no exact boundaries, S=only predicted by one group. Mouse genome sequence giving approximately 3x coverage is also being mapped onto the human data with "exonerate" (using gapped 4mer like blasts).

In future the DAS (distributed annotation system) will hopefully improve database searching online. This system co-ordinates multi-server: single client searching, such that one browser will be able to access a variety of platforms simultaneously.

Donna Maglott presented "Annotating genes at NCBI" (https://www.ncbi.nlm.nih.gov/). NCBI is not using the Santa Cruz alignments but making a new genome assembly "pipeline" using mRNAs to find genes and place these in the alignment. Alongside this is annotation associated with data from UniGene, genome specific databases eg MGD, HomoloGene, RefSeq etc. New genes are identified via RefSeq using the sequence alignment criteria of >=98% for an identical product and canonical splice junctions. Proteome, Inc. has provided GO annotations, and Blast is used to validate sequence alignments. NCBI sequences are identified as: NM# containing submitted mRNA data, XM# containing mRNA data generated from model transcripts from NCBI's genomic contig, NT# containing NCBI's genomic contig, NG# containing clusters of highly related sequences.

Discussion

Proactive gene naming

HGNC has tried using Interpro-based, Ensembl-identified gene families to identify and assign new gene symbols. Rolf Apweiler suggested that HGNC were best at being reactive rather than proactive. Tim Hubbard stated that one third of the genome was finished and the full shotgun sequence would be available by the end of the year.

Confidentiality of gene names and symbols

The current HGNC confidentiality agreement was presented. Questions were raised concerning the problems that occur when two submitters have the same sequence but want to maintain confidentiality were discussed. Ian Dunham said that at the Sanger Centre if they are approached by two different groups they will inform both groups that there are two groups interested and they send the details to them both. Ian said it was important that the approved symbol was a permanent symbol and suggested that we should categorically state at the beginning of any negotiation that any other group with the same sequence will be informed of "others' interests" so that gene symbols can be negotiated with all parties aware. HGNC agreed to consider this. HGNC stated that this was not always possible as submitters still want their data to remain confidential until publication and it is essential that authors trust us to do this. The differing worldwide patent laws were also put forward by Tom Broad as a potential problem for data release. An alternative suggestion was to assign temporary "C#orf#" symbols to any gene for which the submitter refused to release all data.

Approved Symbol status

Midori Harris asked when is a symbol considered official? It is official when HGNC have approved or reserved it. New gene symbols under negotiation have a "pending" status but these are not made publicly available. Some HGNC approved symbols do not have associated sequence data and it has been found that the Drosophila genome as contained in FlyBase had 30% of excess symbols (from predicted genes) which are gradually being identified and merged. HGNC agreed to look into curating those human genes without associated sequence more thoroughly.

Sources of human data - clarity for identification of same sequence

It was shown that NCBI and Sanger/EBI have differing numerical identifiers for the same gene in genomic sequence. Confusion can also be compounded if version numbers are not used to identify sequences. HGNC agreed to add version numbers to those kept in Genew.

Annotation

Marie-Paule Lefranc raised the concern that electronic annotation in databases may become confused with the carefully hand-curated information, such as for the IG and TCR genes in LocusLink. However, Donna Maglott reassured her that although there is continuous curation of well annotated regions some are excluded from electronic annotation.

Cytogenetic location verification

Dave Burt stated that as location information could be generated by submitters from FISH, BAC, or other data, and this can further compound errors, was there any priority? Donna Maglott said that NCBI put a high value on STS mapping and were trying to identify and verify locations within their databases.

Genew Database

There were a number of requests for HGNC curated sequence accession IDs to be made available. This is not currently possible as these fields in the database have always been private and some may contain confidential data. It will be possible to address this in the future; but it will require detailed curation. Sequence accession IDs that are associated with a gene symbol may not be associated with approved symbols in the public databases. Therefore, we will need to check each gene to confirm that the accession IDs can be released.

HGNC ID numbers were also discussed. The publicly available online database is a search engine that uses some of the text files exported from the private database. The HGNC IDs are used internally and are exported to one of the downloadable text files; however, it is not possible to search these online as the search engine does not support this facility. Marie-Paule LeFranc challenged this and HGNC agreed to investigate the feasibility of changing the online search engine. However, it would be more useful to upgrade the database to give full online searching and editorial capabilities. We are investigating changing to MySQL and are currently awaiting delivery of our new server, which will be able to support this.

Integrity of gene symbol mapping between databases

The introduction of HGNC IDs for each gene has aided database interoperabiltiy, but Rolf Apweiler questioned the tracking of these between gene merges and splits; for current details see NomeNews Issue 5. HGNC agreed to re-evaluate their procedure in comparison to other databases.

Other Genomic Features

Tom Broad questioned whether LINEs and SINEs were to be named/identified within the human genome and how to name (and identify) non-translated mRNAs e.g. H19. He also queried how almost identical loci in multiple locations should best be named; in the human we assign symbols in a consecutive number series eg ABCD1P1 on 2p11, ABCD1P2 on 10p11, ABCD1P3 on 16p11 and ABCD1P4 on 22q11.

Gene families - correct identification of individual members

Dave Burt questioned the criteria behind each gene family and asked that these are made publicly available. Tom Broad suggested that gene family committees could be set up, with the ability to publish their decisions.

Publications and funding

Marie-Paule Lefranc, who has been successful in publishing gene family tables for IMGT in Karger publications, addressed this issue which is frequently bemoaned by database-oriented scientists. Marie-Paule encouraged all to pursue publication.

Sporadically-occurring genes

Matthew Darlison, who is specifically looking at the alpha globins, raised the problem of whole gene duplication and sequence fusion forming a new gene in some individuals. Nomenclature is particularly difficult for the clinician in such cases, for example a "triplicated alpha" caused by uneven crossing-over. This was discussed and the suggestion of HBA3 not thought an adequate or correctly descriptive solution. One idea was the addition of a "rarity" field to databases enabling identification of these and other population specific haplotypes.

International Advisory Committee

Suggestions were put forward for three new nominees to the IAC: Richard Cammack(IUBMB), Mark Paalman (Managing Editor, Human Mutation) and Steve Scherer (Senior Editor for Human Chromosome 7, GDB and Chair, International Human Genome Organization Mapping Committee).