Custom downloads help

The HGNC Custom Downloads application serves as a front end for a MySQL database and provides a web-based interface which allows users to select columns of data for output, execute limited SQL queries, and save searches for future reference.

Overview

  1. Select columns to display from the checkboxes at the top of the page (see Field Definitions for more information about what the columns represent).
  2. Select Status to be displayed:
    • Approved - these genes have HGNC-approved gene symbols
    • Entry and symbol withdrawn - these previously approved genes are no longer thought to exist (entry withdrawn) or have been merged into other entries (symbol withdrawn)
  3. Select Chromosomes to display data from (if no individual chromosomes are selected all chromosomes are displayed, i.e. 'Select all Chromosomes' is the default setting)
    • 'reserved' are symbols we have not publicly associated with a chromosomal location.
  4. The WHERE field enables you to specify an SQL query (see Pattern Matching and also the MySQL reference pages for more information)
  5. ORDER BY sets which column is used to order the data (this defaults to Approved Symbol)
  6. The LIMIT field takes an integer and restricts the number of lines returned by the script to the specified integer
  7. Output format specifies how the data is displayed
    • "Text" displays the data as a tab delimited text file Example
    • "Make URL" creates a URL to the results page so that you can copy the URL, saving the query for bookmarks or scripts

If you want to change the column order in the Text output this can be done by clearing all the chosen column checkboxes and then selecting them in the order you would like to see them displayed.

Curated field definitions

The SQL data type is listed in brackets after the field name and the columns are tab delimited.

HGNC ID (int)

A unique ID provided by the HGNC. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Symbol (varchar(255))

The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Name (text)

The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.

Status (varchar(50))

Indicates whether the gene is classified as:

  • Approved - these genes have HGNC-approved gene symbols
  • Entry withdrawn - these previously approved genes are no longer thought to exist
  • Symbol withdrawn - a previously approved record that has since been merged into a another record

Locus Type (varchar(100))

Specifies the type of locus described by the given entry:

  • complex locus constituent - transcriptional unit that is part of a named complex locus
  • endogenous retrovirus - integrated retroviral elements that are transmitted through the germline
  • fragile site - a heritable locus on a chromosome that is prone to DNA breakage
  • gene with protein product - protein-coding genes (the protein may be predicted and of unknown function)
  • immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes
  • immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • phenotype only - mapped phenotypes
  • protocadherin - gene segments that constitute the three clustered protocadherins (alpha, beta and gamma)
  • pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein
  • readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
  • region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
  • RNA, cluster - region containing a cluster of small non-coding RNA genes
  • RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
  • RNA, micro - non-protein coding genes that encode microRNAs (miRNAs)
  • RNA, misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
  • RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs)
  • RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs)
  • RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains
  • RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs)
  • RNA, vault - non-protein coding genes that encode large ribonucleoprotein particles in the cytoplasm known as vaults
  • RNA, Y - non-protein coding genes that encode components of the Ro60 ribonucleoprotein particle
  • T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes
  • T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome
  • unknown - entries where the locus type is currently unknown
  • virus integration site - target sequence for the integration of viral DNA into the genome

Locus Group (varchar(100))

Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

  • protein-coding gene - contains the "gene with protein product" locus type
  • non-coding RNA - contains the following locus types:
    • RNA, cluster
    • RNA, long non-coding
    • RNA, micro
    • RNA, misc
    • RNA, ribosomal
    • RNA, small nuclear
    • RNA, small nucleolar
    • RNA, transfer
    • RNA, vault
    • RNA, Y
  • pseudogene - contains the following types:
    • immunoglobulin pseudogene
    • pseudogene
    • T cell receptor pseudogene
  • phenotype - contains the "phenotype only" locus type
  • other - contains the following types:
    • endogenous retrovirus
    • fragile site
    • immunoglobulin gene
    • protocadherin
    • readthrough
    • region
    • T cell receptor gene
    • transposable element
    • unknown
    • virus integration site

Previous Symbols (text)

Symbols previously approved by the HGNC for this gene. This field can contain multiple values as a comma delimited list.

Previous Names (text)

Gene names previously approved by the HGNC for this gene. This field can contain multiple values. Each value is enclosed in double quote marks and placed in a comma delimited list.

Synonyms (text)

Other symbols used to refer to this gene. This field can contain multiple values as a comma delimited list.

Name Synonyms (text)

Other names used to refer to this gene. This field can contain multiple values as a comma delimited list.

Chromosome (varchar(255))

Indicates the location of the gene or region on the chromosome.

Date Approved (date)

Date the gene symbol and name were approved by the HGNC.

Date Modified (date)

If applicable, the date the entry was modified by the HGNC.

Date Symbol Changed (date)

If applicable, the date the gene symbol was last changed by the HGNC from a previously approved symbol. Many genes receive approved symbols and names which are viewed as temporary (eg C2orf#) or are non-ideal when considered in the light of subsequent information. In the case of individual genes a change to the name (and subsequently the symbol) is only made if the original name is seriously misleading.

Date Name Changed (date)

If applicable, the date the gene name was last changed by the HGNC from a previously approved name.

Accession Numbers (text)

Accession numbers for each entry selected by the HGNC. This field can contain multiple values as a comma delimited list.

Enzyme ID (text)

Enzyme entries have Enzyme Commission (EC) numbers associated with them that indicate the hierarchical functional classes to which they belong. This field can contain multiple values as a comma delimited list.

NCBI Gene ID (int)

Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the NCBI Gene page for that gene.

CCDS ID (text)

The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations. This field can contain multiple values as a comma delimited list.

VEGA ID (text)

This contains a curated VEGA gene ID.

Mouse Genome Database ID (varchar(50))

MGI identifier. In the HTML results page this ID links to the MGI Report for that gene. This field can contain multiple values as a comma delimited list.

Specialist Database Links (text)

This column contains links to specialist databases with a particular interest in that symbol/gene (also see Specialist Database IDs). This field contains multiple values as a comma delimited list.

Specialist Database IDs (text)

The Specialist Database Links column contains HTML links to the database in question. This column contains the database ID only. It is a comma delimited list with each position dedicated to a particular database:

  1. miRBase the microRNA database
  2. HORDE ID Human Olfactory Receptor Data Exploratorium
  3. CD Human Cell Differentiation Antigens
  4. NA - this column is left blank in this comma separated field.
  5. snoRNABase database of human H/ACA and C/D box snoRNAs
  6. KZNF Gene Catalog Human KZNF Gene Catalog
  7. Intermediate Filament DB Human Intermediate Filament Database
  8. IUPHAR/BPS Guide to pharmacology Committee on Receptor Nomenclature and Drug Classification.(mapped)
  9. IMGT/GENE-DB the international ImMunoGeneTics information system for immunoglobulins (mapped)
  10. MEROPS the peptidase database
  11. COSMIC Catalogue Of Somatic Mutations In Cancer
  12. Orphanet portal for rare diseases and orphan drugs
  13. Pseudogene.org database of identified pseudogenes
  14. piRNABank database of piwi-interacting RNA clusters
  15. HomeoDB a database of homeobox gene diversity
  16. Mamit-tRNAdb a compilation of mammalian mitochondrial tRNA genes
  17. lncRNAdb a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs).
  18. BioParadigms SLC tables provides the latest up-to-date information on the SLC families and their members.

Most of these IDs have undergone manual curation, however a few are mapped from regularly updated files kindly provided by the specialist database. When we add new databases these will be appended to the end of this list. This field contains multiple values as a comma delimited list.

Ensembl Gene ID varchar(50)

This column contains a manually curated Ensembl Gene ID.

Pubmed IDs (text)

Identifier that links to published articles relevant to the entry in the NCBI's PubMed database. This field may contain multiple values as a comma delimited list.

RefSeq IDs (varchar(50))

The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one selected RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry. This field may contain multiple values as a comma delimited list.

Gene Group ID (int)

ID used to designate a gene group the gene has been assigned to. Each gene group has an associated group ID and group name. If a particular gene is a member of more than one gene group, the IDs and the names will be shown in the same order. This field can contain multiple values as a pipe (i.e |) delimited list.

Gene Group Name (text)

Name given to a gene group the gene has been assigned to. Each gene group has an associated group ID and group name. If a particular gene is a member of more than one gene group, the IDs and the names will be shown in the same order. This field can contain multiple values as a pipe (i.e |) delimited list.

Mapped field definitions

Please note that mapped data are derived from external sources and as such are not subject to our strict checking and curation procedures. They should therefore be treated with some caution.

Mouse Genome Database ID (varchar(50))

MGI identifier. In the HTML results page this ID links to the MGI Report for that gene. This field may contain multiple values as a comma delimited list.

Rat Genome Database ID (varchar(50))

RGD identifier. In the HTML results page this ID links to the RGD Report for that gene. This field may contain multiple values as a comma delimited list.

NCBI Gene ID (int)

Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.

Identifier provided by Online Mendelian Inheritance in Man (OMIM). This database is described as a catalog of human genes and genetic disorders containing textual information and links to additional related resources. In the HTML results page this ID links to the OMIM page for that entry. This field may contain multiple values as a comma delimited list.

RefSeq (varchar(50))

The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one mapped RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry. This field may contain multiple values as a comma delimited list.

UniProt ID (varchar(50))

The UniProt identifier, provided by the EBI. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. In the HTML results page this ID links to the UniProt page for that entry. This field may contain multiple values as a comma delimited list.

Ensembl Gene ID (varchar(50))

The Ensembl ID is derived from the current build of the Ensembl database and provided by the Ensembl team.

Vega gene ID (varchar(50))

The Vega gene ID is derived from the current build of the Vega database and provided by the Vega team.

UCSC (varchar(50))

The UCSC ID is derived from the current build of the UCSC database

Pattern matching

SQL syntax can be used within the WHERE box to limit the data returned to a particular set. The main operators are =, LIKE, SIMILAR TO and ~. Negative versions of each of these operators can also be obtained (see below).

The general syntax of an SQL pattern matching command is column_name OPERATOR 'pattern'. This specifies that you wish to select entries within column column_name that contain or match in some way the specified pattern. See Field Definitions for a description of the content of each column.

For more information on patten matching see the MySQL reference pages

Equals

LIKE

Terms can be combined with and/or

Examples:

The whole opsin group (opsins and rhodopsin):