Current guidelines for naming human genes
For a discussion of our latest guidelines please go to https://rdcu.be/b53pu (PMID 32747822, doi: 10.1038/s41588-020-0669-3).
In the absence of a universally agreed alternative, the HGNC maintains the definition of a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”.
Each gene is assigned only one symbol; the HGNC does not routinely name isoforms (i.e. alternate transcripts or splice variants). This means no separate symbols for protein-coding or non-coding RNA isoforms of a protein-coding locus or alternative transcripts from a non-coding RNA locus. In exceptional circumstances, and following community demand, separate symbols have been approved for gene segments in complex loci, e.g. the UGT1 locus. Putative bicistronic loci may be assigned separate symbols to represent the distinct gene products.
Every gene that we name is assigned a unique symbol, HGNC ID (in the format HGNC:#) and descriptive name. Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups. Symbols should not be the same as commonly used abbreviations, to facilitate data retrieval. Nomenclature should not contain references to any species or ‘G’ for gene, nor should it be offensive or pejorative.
Protein coding genes
We aim to name protein-coding genes based on a key normal function of the gene product.
In the absence of functional data, protein-coding genes may be named in the following ways:
- Based on recognized structural domains and motifs encoded by the gene (e.g. BEND7, “BEN domain containing 7”)
- Based on homologous genes within the human genome (e.g. GPRIN3, “GPRIN family member 3”)
- Based on homologous genes from another species (e.g. FEM1A, “fem-1 homolog A”)
- Based only on the presence of an open reading frame (e.g. C17orf50, “chromosome 17 open reading frame 50”)
Where possible, related genes are named using a common root symbol to enable grouping, typically based on sequence homology, shared function or membership of protein complexes.
For genes involved in specific immune processes, or encoding an enzyme, receptor or ion channel, we consult with specialist nomenclature groups (please see supplementary note at https://www.readcube.com/articles/supplement?doi=10.1038%2Fs41588-020-0669-3&index=0). For other major gene groups we consult a panel of advisors when naming new members and discussing proposed nomenclature updates.
We define a pseudogene as a sequence that is incapable of producing a functional protein product but has a high level of homology to a functional gene. In general, we only name pseudogenes that retain homology to a significant proportion of the functional ancestral gene.
Processed pseudogenes are named based on the specific parent gene, with a P and number appended to the parent gene symbol (e.g. NACAP10, “NACA pseudogene 10”). The numbering is usually species-specific.
Pseudogenes that retain most of the coding sequence compared to other family members (and are usually unprocessed) are named as a new family member with a “P” suffix, e.g. DDX12P, “DEAD/H-box helicase 12, pseudogene”. This naming format is also used for genes that are pseudogenized relative to their functional ortholog in another species. Note, rarely such pseudogenes do not include the “P” if the symbol is well established, e.g. MMP23A; “matrix metallopeptidase 23A (pseudogene)”.
Non-coding RNA genes
We name non-coding RNA (ncRNA) genes according to their RNA type, please see our recent review (https://www.embopress.org/doi/full/10.15252/embj.2019103777) for a full description.
For small RNAs where an expert resource exists, we follow their naming conventions as follows:
- miRBase assigns each microRNA stem‐loop sequence a symbol in the format “mir‐#” and each mature miRNA a symbol in the format “miR‐#” followed by a unique sequential number that reflects order of submission to the database. The HGNC then approves a gene symbol for human miRNA genes in the format MIR#; for example, MIR17 represents the miRNA gene, mir‐17 represents the stem‐loop, and miR‐17 represents the mature miRNA.
- Transfer RNAs (tRNAs)
- The genomic tRNA database (GtRNAdb) ([http://gtrnadb.ucsc.edu/]) assigns a unique ID to each tRNA gene in the format tRNA‐[three letter amino acid code]‐[anticodon]‐[GtRNAdb gene identifier], e.g. tRNA‐Ala‐AGC‐1‐1. The HGNC assigns a slightly condensed but equivalent tRNA gene symbol in the format TR[one letter amino acid code]‐[anticodon][gtrnadb gene identifier], e.g. TRA‐AGC1‐1
Other classes of small ncRNAs are named in collaboration with specialist advisors. Major classes of small ncRNA include:
- Small nuclear RNAs
- Named with the root symbol “RNU” for “RNA, U# small nuclear”
- Small nucleolar RNAs
- Named with root symbols SNORD# for “small nucleolar RNA, C/D box” genes; SNORA# for “small nucleolar RNA, H/ACA box” genes; and SCARNA# for “small Cajal body‐specific RNA” genes
- Ribosomal RNAs
- Named with the root symbols RNA45S, RNA28S, RNA18S, RNA5S, RNA5-8S
Long non-coding RNAs (lncRNAs) are preferentially given unique symbols based on published function akin to protein-coding genes. LncRNA genes that have been annotated by the RefSeq and GENCODE projects for which no suitable published information on which to base a symbol exists are named in the following systematic way:
- LncRNAs that are intergenic with respect to protein coding genes are assigned the root symbol - LINC# followed by a 5‐digit number e.g. LINC01018
- LncRNAs that are antisense to the genomic span of a protein coding gene are assigned the symbol format [protein coding gene symbol]‐AS# e.g. FAS-AS1
- LncRNAs that are divergent to (share a bidirectional promoter with) a protein coding gene are assigned the symbol format [protein coding gene symbol]‐DT e.g. ABCF1-DT
- LncRNAs that are contained within an intron of a protein coding gene on the same strand are assigned the symbol format [protein coding gene symbol]‐IT# e.g. AOAH-IT1
- LncRNAs that overlap a protein coding gene on the same strand are assigned the symbol format [protein gene coding symbol]‐OT# e.g. C5-OT1
- LncRNAs that contain microRNA or snoRNA genes within introns or exons are named as host genes e.g. MIR17HG, SNHG7
Readthrough transcripts are normally produced from adjacent loci and include coding and/or non-coding parts of two (or more) genes. The HGNC only names readthrough transcripts that are consistently annotated by both the RefSeq annotators at NCBI and the GENCODE annotators at Ensembl. These transcripts have the locus type “readthrough transcript” and are symbolized using the two (or more) symbols from the parent genes, separated by a hyphen, e.g. ZNF511-PRAP1, and the name “[symbol] readthrough”, e.g. “ZNF511-PRAP1 readthrough”. The name may also include additional information about the potential coding status of the transcript, such as “(NMD candidate)”.
Genes only found within subsets of the population
Historically, the HGNC has only approved symbols for genes that are on the human reference genome. Rare exceptions have been made when requested by particular communities with dedicated nomenclature committees, such as the HLA community. Future naming of structural variants will be restricted to those on alternate loci that have been incorporated into the human reference genome by the Genome Reference Consortium (GRC). The underscore character is reserved for genes annotated on alternate reference loci, e.g. C4B_2 is a second copy of C4B on a 6p21.3 alternate reference locus.
Note: HGNC no longer name phenotypes (please see contact OMIM) or genomic regions, nor do we name transposable-element insertions in the human genome. For products of gene translocations or fusions, we recommend the format SYMBOL1/SYMBOL2, to avoid confusion with the SYMBOL1-SYMBOL2 format we approve for readthrough transcripts. Sequence variant nomenclature is the remit of the HGVS. For protein nomenclature, please see the International Protein Nomenclature Guidelines, which were written with the involvement of the HGNC. In agreement with these guidelines, we recommend that “protein and gene symbols should use the same abbreviation”, with proteins using non-italicised symbols to differentiate them from genes.
Naming orthologs across species
We recommend that orthologous genes across vertebrate (and where appropriate, non-vertebrate) species should have the same gene symbol. To distinguish the species of origin for homologous genes with the same gene symbol, we recommend citing the NCBI taxonomy ID, as well as the species name or the GenBank common name, e.g. Taxonomy ID: 9598 and either Pan troglodytes or chimpanzee.
The Vertebrate Gene Nomenclature Committee
The Vertebrate Gene Nomenclature Committee (VGNC, [https://vertebrate.genenames.org/]) is an extension of the HGNC responsible for assigning standardized nomenclature to genes in vertebrate species that currently lack their own nomenclature committee. The VGNC coordinates with the five established existing vertebrate nomenclature committees, MGNC (mouse), RGNC (rat), CGNC (chicken), XNC (Xenopus frog) and ZNC (zebrafish), to ensure vertebrate genes are named in line with their human homologs.
Vertebrate orthologs of human C#orf# genes are assigned the human symbol with the other species chromosome number as a prefix and an H denoting human. For example, as the ortholog of human C1orf100 is on cow chromosome 16, the cow symbol is C16H1orf100 with the corresponding gene name “chromosome 16 C1orf100 homolog”.
Gene families with a complex evolutionary history should ideally be named with the help of an expert in the field, as has already been implemented for the olfactory receptor and cytochrome P450 gene families.
Previous HGNC guidelines
Our previous HGNC guidelines can be found at https://www.genenames.org/about/old-guidelines/.