![]() |
HGNC FAQs | ![]() |
||||||
| ||||||||
| Giving unique and meaningful names to every human gene | ||||||||
Rather than use a numerical identifier, HGNC approves a short-form abbreviation known as a gene symbol, and also a longer and more descriptive name. Each symbol is unique and the committee ensures that each gene is only given one approved gene symbol. It is necessary to provide a unique symbol for each gene so that we can talk about them, and to facilitate electronic data retrieval from publications. In preference, each symbol maintains parallel construction in different members of a gene family and can also be used in other species, especially the mouse.
The publications of the complete human genome sequence suggest that there are 26,000-40,000 genes (International Human Genome Sequencing Consortium (2001), Venter et al. (2001)). Because gene symbols are sometimes confused in publications, whenever you search for information about your 'pet' gene, you may miss lots of highly relevant information, or will waste time reading about genes with nothing in common, but the name/symbol, with your 'pet' gene. Therefore, we provide an efficient system for providing a unique identifier, the approved gene symbol. This will help to reduce the time it takes to access all data pertaining to your specific gene of interest, and also proves useful for data retrieval across species.
You can search the list of approved human gene symbols using the query engine.
First, see if the mouse ortholog of your gene has an approved symbol, by searching MGD; if so then you should use that symbol, if not then suggest someting novel in your application. Then, fill in a "gene symbol request form" and send it to the HUGO Gene Nomenclature committee. Remember that you need a name (description) and symbol (short-form abbreviation) for your gene e.g. ADK: adenosine kinase. See complete instructions.
Upon submission, you can specify that you want the information to remain confidential until publication. Both human and mouse nomenclature committees maintain a confidential database in which symbols can be reserved prior to publication if required.
The "symbol" is a unique series of Latin (upper case in human) letters and Arabic numbers which should preferably be no longer than six characters in length. The longer descriptive "name" should be concise and convey the character or function of the gene. The first letter of the symbol should be the same as that of the name in order to facilitate alphabetical listing and grouping e.g. the gene with the name "breast cancer, early onset 1" has the symbol "BRCA1".
A stem (or root) symbol is used as a basis for a series of approved symbols which are defined as members of either a functional or structural gene family. Stem symbols are approved prior to use by a number of scientists in that specific field (# denotes number in series) e.g. CASP#: caspase; apoptosis-related cysteine peptidase CYP#: cytochrome P450; HOX#: homeo box; DUSP#: dual specificity phosphatase; SCN2A#: sodium channel, voltage-gated, type II, alpha 2 polypeptide and SH3GL#: SH3-domain GRB2-like.
It is important for those in the field to agree and use a systematic stem symbol so that other researchers, for example those working on positional cloning within a particular chromosome region, can understand that there is a structural or functional relationship between the different genes. Also, these researchers may clone a member of this family and, with a good system in place, will find naming their gene far more straightforward.
Whenever possible the symbol used in publications is maintained. However, if this symbol has been used for other genes or if the gene is a member of a gene family then an alternative symbol will be approved.
In 2001 with over 13,000 gene symbols approved, there are still at least 13,000 - 27,000 genes to name. If every new gene had an individual combination of letters rather than a familial stem/root symbol with a unique identifying number, we would risk running out of 3, 4 and 5 character symbols which bear any relationship to the name of the gene.
This is the symbol by which a gene has been previously known in the literature or databases. Aliases are usually recorded along with the approved symbols as part of the gene entry to facilitate database searching. Thus the following databases all contain both approved symbols and aliases:
Aliases are taken from any published source, this includes abstracts, databases such as OMIM, SWISS-PROT, PubMed and GenBank and occasionally also from personal communications.
A list of HGNC nomenclature publications are available.
Punctuation, including hyphenation, causes considerable difficulty and confusion in searches of electronic databases. Therefore, we have to remove full-stops, slashes and usually hyphens when approving a new symbol.
Ideally, protein names and symbols should be identical to those used for the gene. However, as we are only a gene database we do not currently have any guidelines pertaining to proteins. There is a recommendation for the use of italics for gene symbols, and non-italicized letters for the protein; but a number of journals have editorial policies which prevent this convention being used, so it is not by any means universal.
The CD nomenclature is a valid and very useful system for human cell surface differentiation molecules, which is approved by the International Union of Immunological Societies (IUIS) and World Health Organization (WHO). It was initially designed to name unique human leucocyte cell surface structures identified by monoclonal antibodies. This nomenclature system was chosen to be neutral, unambiguous, informative and easily remembered. It was foreseen that including informations relating to the gene family, function, and tissue distribution could create ambiguity. Clearly, whereas more than half of the genes encoding CD belong to the Ig superfamily, many CDs are encoded by members of several gene families, while some CD represent lipid or sugar components. In addition the function of a given molecule can greatly vary because of the trans/cis interacting molecules in different cell types. The CD symbol means that the molecule identified by at least one monoclonal antibody is human, is a cell surface molecule, is a differentiation molecule (e.g. expressed by some but not all human cells) and the molecule shows little or no polymorphism.
We have very strong ties and interactions with a number of other Nomenclature Committees and databases, particularly the Mouse Nomenclature Committee and OMIM. We also interact on a less regular basis with the Mendel database (for plants) and with Drosophila, chicken, rat, pig, bovine, sheep, horse and yeast databases as these also have quite robust nomenclature guidelines. It is difficult to establish if different species' genes are orthologous, but where it can be shown we try to maintain the same or similar symbols as used in the other genomes. When homologs of genes in non-vertebrate organisms are identified we usually add an "L" for like at the end e.g. MAB21L1 "mab-21 (C.elegans)-like 1", AFG3L1 "AFG3 (ATPase family gene 3, yeast)-like 1".
We do realise that not everyone will consistenly use approved symbols; but if they are at least mentioned in a publication, it will ensure that the symbol can be used as a search term. This then gives a reference point to facilitate data retrieval in a number of databases; including PubMed, GenBank, OMIM, Entrez Gene and MGD. We do not categorically insist on the wholesale use of approved symbols because particular researchers form attachments to certain terms and not all journals at present insist on the approved nomenclature. But some journals, like Genomics and Nature Genetics, do insist on the use of approved nomenclature and may, at their discretion, insist on this throughout a publication.
We would like to encourage as many researchers as possible to contribute towards a new nomenclature system as we hope they would then be more likely to use it. In recent years we have found that once a system is established, and is found to be useful, it becomes much more prominent and frequently replaces the original designations in new publications.
Human homologs of genes first identified in other species should not be designated by a symbol beginning with H (or h) for human. When necessary to distinguish the species of origin for homologous genes with the same gene symbol, the letter-based code for different species already established by SWISS-PROT is recommended. The identification codes can be found at URL http://www.expasy.ch/cgi-bin/speclist. The code is for use in publications only and not incorporated as part of the gene symbol. The species designation is added as a prefix, in parentheses, to the gene symbol. For example HUMAN signifies Homo sapiens and MOUSE signifies Mus musculus. Examples this are: (HUMAN)G6PD; (HUMAN)HBB; (HUMAN)ALB; homologous mouse genes: (MOUSE)G6pd; (MOUSE)Hbb; (MOUSE)Alb. Further examples of the species codes can be found in the Guidelines.
Gene symbols are not usually based on functional data because:
1) Letters to specify tissue distribution have been used historically, but experience has shown that tissue specificity may not be as restricted as described initially. (see Guidelines)
2) Homologous genes in different vertebrate species (orthologs) should where possible have the same gene nomenclature. This is not practical when the functions vary between organisms. (see Guidelines)
3) Alternate transcripts from the same gene should not be given different gene symbols. If one gene has different products with differing functions this can create serious problems with gene nomenclature if it is based upon function. (see Guidelines)
Therefore, if at all possible we try to base gene symbols on data generated from nucleic acid sequence rather than from functional information.
There is clearly some concern over the private reservation of gene symbols but this is a facility that we need to have, especially for genes that are members of a closely related gene family. For each of the symbols we keep in our reserved database we maintain confidential information, including sequences and cytogenetic locations, against which we can check any new gene symbol request. While we do not insist that these are cDNA sequences they usually are, but we also accept ESTs. However, if more than one symbol is mistakenly assigned to the same gene, the lowest number in the series is kept and the other symbol used as an alias e.g. ADAMTS5, which also has the alias ADAMTS11.
Approved gene symbols are an important tool in the tracking and retrieval of information from the various on-line databases. Each gene is given a unique approved symbol, and merging two symbols for one gene into one entry has proven in the past to be less confusing than separating two genes that have been given the same symbol.
Gene symbols will not usually be assigned to alternative transcripts or genes predicted solely from in silico data.
"Symbol Withdrawn" refers to a previously approved HGNC symbol for a gene which now has a different approved symbol. "Entry Withdrawn" refers to a previously approved HGNC symbol for a gene that has since been shown not to exist.
Authors are requested to cite: Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA and Lush MJ. The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D319-21. PMID:16381876 PDF and the database in the following format: HGNC Database, HUGO Gene Nomenclature Committee (HGNC), EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. (http://www.genenames.org). Include the month and year you retrieved the data cited.