Skip to Navigation

Custom downloads help

The HGNC Custom Database Download script serves as a front end for a MySQL database and provides a web-based interface which allows users to select columns of data for output, execute limited SQL queries, and save searches for future reference.

Contents  

 

Overview

  1. Select columns to display from the checkboxes at the top of the page (see Field Definitions for more information about what the columns represent).
  2. Select Status to be displayed
    • Approved - these genes have HGNC-approved gene symbols
    • Entry and symbol withdrawn - these previously approved genes are no longer thought to exist (entry withdrawn) or have been
      merged into other entries (symbol withdrawn)
  3. Select Chromosomes to display data from (if no individual chromosomes are selected all chromosomes are displayed, i.e. 'Select all Chromosomes' is the default setting)
    • 'reserved' are symbols we have not publicly associated with a chromosomal location.
  4. The WHERE field enables you to specify an SQL query (see Pattern Matching and also the Postgres Documentation for more information)
  5. ORDER BY sets which column is used to order the data (this defaults to Approved Symbol)
  6. The LIMIT field takes an integer and restricts the number of lines returned by the script to the specified integer
  7. Output format specifies how the data is displayed
    • "Text" displays the data as a tab delimited text file (Misc IDs are returned as HTML links to indicate which database they came from) Example
    • "Make perl code" generates a short perl program that uses LWP simple to download a text version of the selected data Example

On pressing "submit" the form is replaced by the script output; if you bookmark the results page every time you return to it the search is rerun and the new output displayed. The "Bookmark Title:" field allows you to any name any HTML table the script generates (and this name will be picked up by your browser if the page is bookmarked).

If you want to change the column order in the HTML or Text outputs this can be done by directly editing the URL, the order of the 'col' parameters in the URL defines the column order. If a column is required more than once simply add an extra col parameter.

Curated Field Definitions

The SQL data type is listed in brackets after the field name.

CD indicates the field can contain multiple values as a comma delimited list.

PD indicates the field can contain multiple values as a pipe (i.e |) delimited list.

QCD indicates the field can contain multiple values. Each value is enclosed in double quote marks and placed in a comma delimited list.

Text output columns are tab delimited.

HGNC ID (int) - A unique ID provided by the HGNC. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Symbol (varchar(255)) - The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Name (text) - The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.

Status (varchar(50)) - Indicates whether the gene is classified as:

  • Approved - these genes have HGNC-approved gene symbols
  • Entry withdrawn - these previously approved genes are no longer thought to exist
  • Symbol withdrawn - a previously approved record that has since been merged into a another record

Locus Type (varchar(100)) - Specifies the type of locus described by the given entry:

  • complex locus constituent - transcriptional unit that is part of a named complex locus
  • endogenous retrovirus - integrated retroviral elements that are transmitted through the germline (SO:0000100)
  • fragile site - a heritable locus on a chromosome that is prone to DNA breakage
  • gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) (SO:0001217)
  • immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes (SO:0000460)
  • immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • phenotype only - mapped phenotypes (SO:0001500)
  • protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
  • pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein (SO:0000336)
  • readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
  • region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
  • RNA, cluster - region containing a cluster of small non-coding RNA genes
  • RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
  • RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)
  • RNA, misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
  • RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)
  • RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)
  • RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)
  • RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)
  • RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) (SO:0001272)
  • RNA, vault - non-protein coding genes that encode vault RNAs (SO:0000404)
  • RNA, Y - non-protein coding genes that encode Y RNAs (SO:0000405)
  • T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes (SO:0000460)
  • T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome (SO:0000101)
  • unknown - entries where the locus type is currently unknown
  • virus integration site - target sequence for the integration of viral DNA into the genome

Locus Group (varchar(100)) - Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

  • protein-coding gene - contains the "gene with protein product" locus type
  • non-coding RNA - contains the following locus types:
    • RNA, Y
    • RNA, cluster
    • RNA, long non-coding
    • RNA, micro
    • RNA, misc
    • RNA, ribosomal
    • RNA, small cytoplasmic
    • RNA, small nuclear
    • RNA, small nucleolar
    • RNA, transfer
    • RNA, vault
  • phenotype - contains the "phenotype only" locus type
  • pseudogene - contains the following locus types:
    • T cell receptor pseudogene
    • immunoglobulin pseudogene
    • pseudogene
  • other - contains the following types:
    • T cell receptor gene
    • complex locus constituent
    • endogenous retrovirus
    • fragile site
    • immunoglobulin gene
    • protocadherin
    • readthrough
    • region
    • transposable element
    • unknown
    • virus integration site
  • withdrawn - contains the "withdrawn" locus type only

Previous Symbols (text) CD - Symbols previously approved by the HGNC for this gene

Previous Names (text) QCD - Gene names previously approved by the HGNC for this gene

Synonyms (text) CD - Other symbols used to refer to this gene

Name Synonyms (text) QCD - Other names used to refer to this gene

Chromosome (varchar(255)) - Indicates the location of the gene or region on the chromosome

Date Approved (date) - Date the gene symbol and name were approved by the HGNC

Date Modified (date) - If applicable, the date the entry was modified by the HGNC

Date Symbol Changed (date) - If applicable, the date the gene symbol was last changed by the HGNC from a previously approved symbol. Many genes receive approved symbols and names which are viewed as temporary (eg C2orf#) or are non-ideal when considered in the light of subsequent information. In the case of individual genes a change to the name (and subsequently the symbol) is only made if the original name is seriously misleading.

Date Name Changed (date) - If applicable, the date the gene name was last changed by the HGNC from a previously approved name

Accession Numbers (text) CD - Accession numbers for each entry selected by the HGNC

Enzyme ID (text) CD - Enzyme entries have Enzyme Commission (EC) numbers associated with them that indicate the hierarchical functional classes to which they belong

Entrez Gene ID (int) - Entrez Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.

CCDS ID (text) - The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.

VEGA ID (text) - This contains a curated VEGA gene ID

Mouse Genome Database ID (varchar(50)) - MGI identifier. In the HTML results page this ID links to the MGI Report for that gene.

Specialist Database Links (text) CD - This column contains links to specialist databases with a particular interest in that symbol/gene (also see Specialist Database IDs).

Ensembl Gene ID (varchar(50)) - This column contains a manually curated Ensembl Gene ID

Specialist Database IDs (text) CD - The Specialist Database Links column contains HTML links to the database in question. This column contains the database ID only. It is a comma delimited list with each position dedicated to a particular database:-

  1. miRBase the microRNA database
  2. HORDE ID Human Olfactory Receptor Data Exploratorium
  3. CD Human Cell Differentiation Antigens
  4. Rfam RNA families database of alignments and CMs
  5. snoRNABase database of human H/ACA and C/D box snoRNAs
  6. KZNF Gene Catalog Human KZNF Gene Catalog
  7. Intermediate Filament DB Human Intermediate Filament Database
  8. IUPHAR Committee on Receptor Nomenclature and Drug Classification.(mapped)
  9. IMGT/GENE-DB the international ImMunoGeneTics information system for immunoglobulins (mapped)
  10. MEROPS the peptidase database
  11. COSMIC Catalogue Of Somatic Mutations In Cancer
  12. Orphanet portal for rare diseases and orphan drugs
  13. Pseudogene.org database of identified pseudogenes
  14. piRNABank database of piwi-interacting RNA clusters
  15. HomeoDB a database of homeobox gene diversity
  16. Mamit-tRNAdb a compilation of mammalian mitochondrial tRNA genes
  17. lncRNAdb a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs).
  18. BioParadigms SLC tables provides the latest up-to-date information on the SLC families and their members.
Most of these IDs have undergone manual curation, however a few are mapped from regularly updated files kindly provided by the specialist database. When we add new databases these will be appended to the end of this list

Pubmed IDs (text) CD - Identifier that links to published articles relevant to the entry in the NCBI's PubMed database.

RefSeq IDs (varchar(50)) CD - The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one selected RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry.

Gene Family ID (int) PD - ID used to designate a gene family or group the gene has been assigned to. Each gene family has an associated family ID and family name. If a particular gene is a member of more than one gene family, the IDs and the names will be shown in the same order delimited by a pipe.

Gene Family Name (text) PD - Name given to a gene family or group the gene has been assigned to. Each gene family has an associated family ID and family name. If a particular gene is a member of more than one gene family, the IDs and the names will be shown in the same order delimited by a pipe.

Mapped Field Definitions

Please note that mapped data are derived from external sources and as such are not subject to our strict checking and curation procedures. They should therefore be treated with some caution.

Mouse Genome Database ID (mapped data) (varchar(50)) - MGI identifier. In the HTML results page this ID links to the MGI Report for that gene.

Rat Genome Database ID (mapped data) (varchar(50)) - RGD identifier. In the HTML results page this ID links to the RGD Report for that gene.

Entrez Gene ID (mapped data) (int) - Entrez Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.

OMIM ID (mapped data) (varchar(50)) - Identifier provided by Online Mendelian Inheritance in Man (OMIM). This database is described as a catalog of human genes and genetic disorders containing textual information and links to additional related resources. In the HTML results page this ID links to the OMIM page for that entry.

RefSeq (mapped data) (varchar(50)) - The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one mapped RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry.

UniProt ID (mapped data) (varchar(50)) - The UniProt identifier, provided by the EBI. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. In the HTML results page this ID links to the UniProt page for that entry.

Ensembl Gene ID (mapped data) (varchar(50)) - The Ensembl ID is derived from the current build of the Ensembl database and provided by the Ensembl team.

Vega gene ID (mapped data) (varchar(50)) - The Vega gene ID is derived from the current build of the Vega database and provided by the Vega team.

UCSC (mapped data) (varchar(50)) - The UCSC ID is derived from the current build of the UCSC database

Pattern Matching

SQL syntax can be used within the WHERE box to limit the data returned to a particular set. The main operators are =, LIKE, SIMILAR TO and ~. Negative versions of each of these operators can also be obtained (see below).

The general syntax of an SQL pattern matching command is column_name OPERATOR 'pattern'. This specifies that you wish to select entries within column column_name that contain or match in some way the specified pattern. See Field Definitions for a description of the content of each column.

For more information on patten matching see the MySQL reference pages

Equals

  • = is the most simple of the pattern matching operators available
  • = can be used to select data in which the column entry is exactly equal to 'pattern'.
  • A useful example may be to limit the search to pseudogenes only using:
  • NOT = or != can be used to give the reverse result to =, eg to exclude pseudogenes from the search, use. (Note the limit of 15 records to increase the search speed):

LIKE

  • LIKE is case insensitive and useful for slightly more complex queries in which a non-exact match is required
  • It can be used to select data in which the column entry matches a pattern containing the wildcards % or _:
    • % matches 1 or more characters of any type
    • _ matches any single character
  • This is useful, for example, to limit the search to genes with approved symbols that begin with a string like OPN, using:
  • NOT LIKE gives the reverse result to LIKE
  • LIKE pattern matches always cover the entire string. To match a pattern anywhere within a string, the pattern must therefore start and end with a percent sign. For example, to select host genes use:

Terms can be combined with and/or

Examples:

Biomart Quickstart

Getting to the mart

Getting records by symbol

  • Click on 'Filters' then click on the + next to Filter by symbol
  • Enter the desired symbol into the Approved Symbol textarea. The search is case insensitive, you can also enter a comma delimited list of symbols to fetch multiple entries.
  • Click on Attributes and select the desired fields from the list of check boxes.
    • Click on the + next to a field name to get a description of the field contents.
    • If the field name is plural (e.g. Synonyms, CCDS IDs etc) it contains a comma delimited list of values, otherwise it contains a single value.
    • Attributes in the 'Normalized data' section unwrap the comma delimited list and return it as a list of symbol/value pairs. If more than one normalized attribute is selected it returns a Cartesian join of the lists
  • Click on Results to get a preview of your output and, if satisfied, use the 'Export all results to' settings to select the format and location of the output file.

Other filters

  • The pull down menus can accept multiple values and return values with an implicit OR. ie selecting the Locus types "gene with no protein product" and "gene with protein product" returns returns records with either locus type.
  • Once selected the only way to unselect a menu value is to uncheck the checkbox on left hand side of the filter