The HGNC Custom Database Download script serves as a front end for a MySQL database and provides a web-based interface which allows users to select columns of data for output, execute limited SQL queries, and save searches for future reference.
- Select columns to display from the checkboxes at the top of the page (see Field Definitions for more information about what the columns represent).
- Select Status to be displayed
- Approved - these genes have HGNC-approved gene symbols
- Entry and symbol withdrawn - these previously approved genes are no longer thought to exist (entry withdrawn) or have been
merged into other entries (symbol withdrawn)
- Select Chromosomes to display data from (if no individual chromosomes are selected all chromosomes are displayed, i.e. 'Select all Chromosomes' is the default setting)
- 'reserved' are symbols we have not publicly associated with a chromosomal location.
- The WHERE field enables you to specify an SQL query (see Pattern Matching and also the MySQL Documentation for more information)
- ORDER BY sets which column is used to order the data (this defaults to Approved Symbol)
- The LIMIT field takes an integer and restricts the number of lines returned by the script to the specified integer
- Output format specifies how the data is displayed
On pressing "submit" the form is replaced by the script output; if you bookmark the results page every time you return to it the search is rerun and the new output displayed. The "Bookmark Title:" field allows you to any name any HTML table the script generates (and this name will be picked up by your browser if the page is bookmarked).
If you want to change the column order in the HTML or Text outputs this can be done by directly editing the URL, the order of the 'col' parameters in the URL defines the column order. If a column is required more than once simply add an extra col parameter.
The SQL data type is listed in brackets after the field name.
Text output columns are tab delimited.
Approved Symbol (varchar(255)) - The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.
Approved Name (text) - The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.
- Approved - these genes have HGNC-approved gene symbols
- Entry withdrawn - these previously approved genes are no longer thought to exist
- Symbol withdrawn - a previously approved record that has since been merged into a another record
- gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) (SO:0001217)
- RNA, cluster - region containing a cluster of small non-coding RNA genes
- RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs) (SO:0001877); these are at least 200 nt in length. Subtypes include intergenic (SO:0001463), intronic (SO:0001903) and antisense (SO:0001904).
- RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)
- RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)
- RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)
- RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)
- RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)
- RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) (SO:0001272)
- RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs, such as vault (SO:0000404) and Y (SO:0000405) RNA genes
- phenotype only - mapped phenotypes where the causative gene has not been identified (SO:0001500)
- pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein (SO:0000336)
- RNA, pseudogene - pseudogene of a non-protein coding RNA
- complex locus constituent - transcriptional unit that is part of a named complex locus
- endogenous retrovirus - integrated retroviral elements that are transmitted through the germline (SO:0000100)
- fragile site - a heritable locus on a chromosome that is prone to DNA breakage
- immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes (SO:0000460). Also includes immunoglobulin gene segments with open reading frames that either cannot undergo somatic recombination, or encode a peptide that is not predicted to fold correctly; these are identified by inclusion of the term “non-functional” in the gene name.
- immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
- readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
- region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
- T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes (SO:0000460). Also includes T cell receptor gene segments with open reading frames that either cannot undergo somatic recombination, or encode a peptide that is not predicted to fold correctly; these are identified by inclusion of the term “non-functional” in the gene name.
- T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome (SO:0000101)
- unknown - entries where the locus type is currently unknown
- virus integration site - target sequence for the integration of viral DNA into the genom
Locus Group (varchar(100)) - Groups locus types together into related sets. Below is a list of groups and the locus types within the group:
- protein-coding gene - contains the "gene with protein product" locus type
- non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- pseudogene - contains the following types:
- immunoglobulin pseudogene
- RNA, pseudogene
- T cell receptor pseudogene
- phenotype - contains the "phenotype only" locus type
- other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- T cell receptor gene
- transposable element
- virus integration site
- withdrawn - contains the "withdrawn" locus type only
Previous Symbols (text) CD - Symbols previously approved by the HGNC for this gene
Previous Names (text) QCD - Gene names previously approved by the HGNC for this gene
Synonyms (text) CD - Other symbols used to refer to this gene
Name Synonyms (text) QCD - Other names used to refer to this gene
Date Symbol Changed (date) - If applicable, the date the gene symbol was last changed by the HGNC from a previously approved symbol. Many genes receive approved symbols and names which are viewed as temporary (eg C2orf#) or are non-ideal when considered in the light of subsequent information. In the case of individual genes a change to the name (and subsequently the symbol) is only made if the original name is seriously misleading.
Accession Numbers (text) CD - Accession numbers for each entry selected by the HGNC
Entrez Gene ID (int) - Entrez Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.
CCDS ID (text) - The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
VEGA ID (text) - This contains a curated VEGA gene ID
Mouse Genome Database ID (varchar(50)) - MGI identifier. In the HTML results page this ID links to the MGI Report for that gene.
Specialist Database Links (text) CD - This column contains links to specialist databases with a particular interest in that symbol/gene (also see Specialist Database IDs).
Specialist Database IDs (text) CD - The Specialist Database Links column contains HTML links to the database in question. This column contains the database ID only. It is a comma delimited list with each position dedicated to a particular database:-
- miRBase the microRNA database
- HORDE ID Human Olfactory Receptor Data Exploratorium
- CD Human Cell Differentiation Antigens
- Rfam RNA families database of alignments and CMs
- snoRNABase database of human H/ACA and C/D box snoRNAs
- KZNF Gene Catalog Human KZNF Gene Catalog
- Intermediate Filament DB Human Intermediate Filament Database
- IUPHAR Committee on Receptor Nomenclature and Drug Classification.(mapped)
- IMGT/GENE-DB the international ImMunoGeneTics information system for immunoglobulins (mapped)
- MEROPS the peptidase database
- COSMIC Catalogue Of Somatic Mutations In Cancer
- Orphanet portal for rare diseases and orphan drugs
- Pseudogene.org database of identified pseudogenes
- piRNABank database of piwi-interacting RNA clusters
- HomeoDB a database of homeobox gene diversity
- Mamit-tRNAdb a compilation of mammalian mitochondrial tRNA genes
- lncRNAdb a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs).
RefSeq IDs (varchar(50)) CD - The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one selected RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry.
Gene Family Tag (text) CD - Tag used to designate a gene family or group the gene has been assigned to, according to either sequence similarity or information from publications, specialist advisors for that family or other databases. Families/groups may be either structural or functional, therefore a gene may belong to more than one family/group. These tags are used to generate gene family or grouping specific pages at genenames.org and do not necessarily reflect an official nomenclature. Each gene family has an associated gene family tag and gene family description. If a particular gene is a member of more than one gene family, the tags and the descriptions will be shown in the same order.
Gene Family Description (text) CD - Name given to a particular gene family. The gene family description has an associated gene family tag. Gene families are used to group genes according to either sequence similarity or information from publications, specialist advisors for that family or other databases. Families/groups may be either structural or functional, therefore a gene may belong to more than one family/group.
Mapped Field Definitions
Please note that mapped data are derived from external sources and as such are not subject to our strict checking and curation procedures. They should therefore be treated with some caution.
Entrez Gene ID (mapped data) (int) - Entrez Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.
OMIM ID (mapped data) (varchar(50)) - Identifier provided by Online Mendelian Inheritance in Man (OMIM) at the NCBI. This database is described as a catalog of human genes and genetic disorders containing textual information and links to MEDLINE and sequence records in the Entrez system, and links to additional related resources at NCBI and elsewhere. In the HTML results page this ID links to the OMIM page for that entry.
RefSeq (mapped data) (varchar(50)) - The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one mapped RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry.
UniProt ID (mapped data) (varchar(50)) - The UniProt identifier, provided by the EBI. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. In the HTML results page this ID links to the UniProt page for that entry.
SQL syntax can be used within the WHERE box to limit the data returned to a particular set. The main operators are =, and LIKE. Negative versions of each of these operators can also be obtained (see below).
The general syntax of an SQL pattern matching command is
column_name OPERATOR 'pattern'. This specifies that you wish to select entries within column
column_name that contain or match in some way the specified
pattern. See Field Definitions for a description of the content of each column.
For more information on patten matching see the MySQL reference pages
- = is the most simple of the pattern matching operators available
- = can be used to select data in which the column entry is exactly equal to
- A useful example may be to limit the search to pseudogenes only using:
- NOT = or != can be used to give the reverse result to =, eg to exclude pseudogenes from the search, use. (Note the limit of 15 records to increase the search speed):
- LIKE is case insensitive and useful for slightly more complex queries in which a non-exact match is required
- It can be used to select data in which the column entry matches a pattern containing the wildcards % or _:
- % matches 1 or more characters of any type
- _ matches any single character
- This is useful, for example, to limit the search to genes with approved symbols that begin with a string like OPN, using:
- NOT LIKE gives the reverse result to LIKE
- LIKE pattern matches always cover the entire string. To match a pattern anywhere within a string, the pattern must therefore start and end with a percent sign. For example, to select host genes use:
Terms can be combined with and/or
- The whole opsin family (opsins and rhodopsin):
Getting to the mart
- Open a new mart using the link www.genenames.org/biomart/martview
- Click on CHOOSE DATABASE and select HGNC database
Getting records by symbol
- Click on 'Filters' then click on the + next to Filter by symbol
- Enter the desired symbol into the Approved Symbol textarea. The search is case insensitive, you can also enter a comma delimited list of symbols to fetch multiple entries.
- Click on Attributes and select the desired fields from the list of check boxes.
- Click on the + next to a field name to get a description of the field contents.
- If the field name is plural (e.g. Synonyms, CCDS IDs etc) it contains a comma delimited list of values, otherwise it contains a single value.
- Attributes in the 'Normalized data' section unwrap the comma delimited list and return it as a list of symbol/value pairs. If more than one normalized attribute is selected it returns a Cartesian join of the lists
- Click on Results to get a preview of your output and, if satisfied, use the 'Export all results to' settings to select the format and location of the output file.
- The pull down menus can accept multiple values and return values with an implicit OR. ie selecting the Locus types "gene with no protein product" and "gene with protein product" returns returns records with either locus type.
- Once selected the only way to unselect a menu value is to uncheck the checkbox on left hand side of the filter