Custom downloads help

The HGNC Custom Downloads application serves as a front end for a MySQL database and provides a web-based interface which allows users to select columns of data for output, execute limited SQL queries, and save searches for future reference.

Overview

  1. Select columns to display from the checkboxes at the top of the page (see Field Definitions for more information about what the columns represent).
  2. Select Status to be displayed:
    • Approved - these genes have HGNC-approved gene symbols
    • Entry and symbol withdrawn - these previously approved genes are no longer thought to exist (entry withdrawn) or have been merged into other entries (symbol withdrawn)
  3. Select Chromosomes to display data from (if no individual chromosomes are selected all chromosomes are displayed, i.e. 'Select all Chromosomes' is the default setting)
    • 'reserved' are symbols we have not publicly associated with a chromosomal location.
  4. The WHERE field enables you to specify an SQL query (see Pattern Matching and also the MySQL reference pages for more information)
  5. ORDER BY sets which column is used to order the data (this defaults to Approved Symbol)
  6. The LIMIT field takes an integer and restricts the number of lines returned by the script to the specified integer
  7. Output format specifies how the data is displayed
    • "Text" displays the data as a tab delimited text file Example
    • "Make URL" creates a URL to the results page so that you can copy the URL, saving the query for bookmarks or scripts

If you want to change the column order in the Text output this can be done by clearing all the chosen column checkboxes and then selecting them in the order you would like to see them displayed.

Curated field definitions

The SQL data type is listed in brackets after the field name and the columns are tab delimited.

HGNC ID DB name: gd_hgnc_id (int)

A unique ID provided by the HGNC. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Symbol DB name: gd_app_sym (varchar(255))

The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.

Approved Name DB name: gd_app_name (text)

The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.

Status DB name: gd_status (varchar(50))

Indicates whether the gene is classified as:

  • Approved - these genes have HGNC-approved gene symbols
  • Entry withdrawn - these previously approved genes are no longer thought to exist
  • Symbol withdrawn - a previously approved record that has since been merged into a another record

Locus Type DB name: gd_locus_type (varchar(100))

Specifies the type of locus described by the given entry:

  • complex locus constituent - transcriptional unit that is part of a named complex locus
  • endogenous retrovirus - integrated retroviral elements that are transmitted through the germline
  • fragile site - a heritable locus on a chromosome that is prone to DNA breakage
  • gene with protein product - protein-coding genes (the protein may be predicted and of unknown function)
  • immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes
  • immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • phenotype only - mapped phenotypes
  • pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein
  • readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
  • region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
  • RNA, cluster - region containing a cluster of small non-coding RNA genes
  • RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
  • RNA, micro - non-protein coding genes that encode microRNAs (miRNAs)
  • RNA, misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
  • RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs)
  • RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs)
  • RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains
  • RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs)
  • RNA, vault - non-protein coding genes that encode large ribonucleoprotein particles in the cytoplasm known as vaults
  • RNA, Y - non-protein coding genes that encode components of the Ro60 ribonucleoprotein particle
  • T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes
  • T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
  • transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome
  • unknown - entries where the locus type is currently unknown
  • virus integration site - target sequence for the integration of viral DNA into the genome

Locus Group DB name: gd_locus_group (varchar(100))

Groups locus types together into related sets. Below is a list of groups and the locus types within the group:

  • protein-coding gene - contains the "gene with protein product" locus type
  • non-coding RNA - contains the following locus types:
    • RNA, cluster
    • RNA, long non-coding
    • RNA, micro
    • RNA, misc
    • RNA, ribosomal
    • RNA, small nuclear
    • RNA, small nucleolar
    • RNA, transfer
    • RNA, vault
    • RNA, Y
  • pseudogene - contains the following types:
    • immunoglobulin pseudogene
    • pseudogene
    • T cell receptor pseudogene
  • phenotype - contains the "phenotype only" locus type
  • other - contains the following types:
    • endogenous retrovirus
    • fragile site
    • immunoglobulin gene
    • readthrough
    • region
    • T cell receptor gene
    • transposable element
    • unknown
    • virus integration site

Previous Symbols DB name: gd_prev_sym (text)

Symbols previously approved by the HGNC for this gene. This field can contain multiple values as a comma delimited list.

Previous Names DB name: gd_prev_name (text)

Gene names previously approved by the HGNC for this gene. This field can contain multiple values. Each value is enclosed in double quote marks and placed in a comma delimited list.

Alias symbols DB name: gd_aliases (text)

Other symbols used to refer to this gene. This field can contain multiple values as a comma delimited list.

Alias names DB name: gd_name_aliases (text)

Other names used to refer to this gene. This field can contain multiple values as a comma delimited list.

Chromosome DB name: gd_pub_chrom_map (varchar(255))

Indicates the location of the gene or region on the chromosome.

Date Approved >DB name: gd_date2app_or_res (date)

Date the gene symbol and name were approved by the HGNC.

Date Modified DB name: gd_date_mod (date)

If applicable, the date the entry was modified by the HGNC.

Date Symbol Changed DB name: gd_date_sym_change (date)

If applicable, the date the gene symbol was last changed by the HGNC from a previously approved symbol. Many genes receive approved symbols and names which are viewed as temporary (eg C2orf#) or are non-ideal when considered in the light of subsequent information. In the case of individual genes a change to the name (and subsequently the symbol) is only made if the original name is seriously misleading.

Date Name Changed DB name: gd_date_name_change (date)

If applicable, the date the gene name was last changed by the HGNC from a previously approved name.

Accession Numbers DB name: gd_pub_acc_ids (text)

Accession numbers for each entry selected by the HGNC. This field can contain multiple values as a comma delimited list.

Enzyme ID DB name: gd_enz_ids (text)

Enzyme entries have Enzyme Commission (EC) numbers associated with them that indicate the hierarchical functional classes to which they belong. This field can contain multiple values as a comma delimited list.

NCBI Gene ID DB name: gd_pub_eg_id (int)

Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the NCBI Gene page for that gene.

CCDS ID DB name: gd_ccds_ids (text)

The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations. This field can contain multiple values as a comma delimited list.

VEGA ID DB name: gd_vega_ids (text)

This contains a curated VEGA gene ID.

Mouse Genome Database ID DB name: gd_mgd_id (varchar(50))

MGI identifier. In the HTML results page this ID links to the MGI Report for that gene. This field can contain multiple values as a comma delimited list.

Specialist Database Links DB name: gd_other_ids (text)

This column contains links to specialist databases with a particular interest in that symbol/gene (also see Specialist Database IDs). This field contains multiple values as a comma delimited list.

Specialist Database IDs DB name: gd_other_ids_list (text)

The Specialist Database Links column contains HTML links to the database in question. This column contains the database ID only. It is a comma delimited list with each position dedicated to a particular database:

  1. miRBase the microRNA database
  2. HORDE ID Human Olfactory Receptor Data Exploratorium
  3. CD Human Cell Differentiation Antigens
  4. NA - this column is left blank in this comma separated field.
  5. snoRNABase database of human H/ACA and C/D box snoRNAs
  6. Intermediate Filament DB Human Intermediate Filament Database
  7. IUPHAR/BPS Guide to pharmacology Committee on Receptor Nomenclature and Drug Classification.(mapped)
  8. IMGT/GENE-DB the international ImMunoGeneTics information system for immunoglobulins (mapped)
  9. MEROPS the peptidase database
  10. COSMIC Catalogue Of Somatic Mutations In Cancer
  11. Orphanet portal for rare diseases and orphan drugs
  12. Pseudogene.org database of identified pseudogenes
  13. piRNABank database of piwi-interacting RNA clusters
  14. HomeoDB a database of homeobox gene diversity
  15. Mamit-tRNAdb a compilation of mammalian mitochondrial tRNA genes
  16. lncRNAdb a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs).
  17. BioParadigms SLC tables provides the latest up-to-date information on the SLC families and their members.

Most of these IDs have undergone manual curation, however a few are mapped from regularly updated files kindly provided by the specialist database. When we add new databases these will be appended to the end of this list. This field contains multiple values as a comma delimited list.

Ensembl Gene ID DB name: gd_pub_ensembl_id varchar(50)

This column contains a manually curated Ensembl Gene ID.

Pubmed IDs DB name: gd_pubmed_ids (text)

Identifier that links to published articles relevant to the entry in the NCBI's PubMed database. This field may contain multiple values as a comma delimited list.

RefSeq IDs DB name: gd_pub_refseq_ids (varchar(50))

The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one selected RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry. This field may contain multiple values as a comma delimited list.

Gene Group ID DB name: family.id (int)

ID used to designate a gene group the gene has been assigned to. Each gene group has an associated group ID and group name. If a particular gene is a member of more than one gene group, the IDs and the names will be shown in the same order. This field can contain multiple values as a pipe (i.e |) delimited list.

Gene Group Name DB name: faily.name (text)

Name given to a gene group the gene has been assigned to. Each gene group has an associated group ID and group name. If a particular gene is a member of more than one gene group, the IDs and the names will be shown in the same order. This field can contain multiple values as a pipe (i.e |) delimited list.

Mapped field definitions

Please note that mapped data are derived from external sources and as such are not subject to our strict checking and curation procedures. They should therefore be treated with some caution.

Mouse Genome Database ID DB name: md_mgd_id (varchar(50))

MGI identifier. In the HTML results page this ID links to the MGI Report for that gene. This field may contain multiple values as a comma delimited list.

Rat Genome Database ID DB name: md_rgd_id (varchar(50))

RGD identifier. In the HTML results page this ID links to the RGD Report for that gene. This field may contain multiple values as a comma delimited list.

NCBI Gene ID DB name: md_eg_id (int)

Gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. In the HTML results page this ID links to the Entrez Gene page for that gene. Entrez Gene has replaced LocusLink.

Omim ID DB Name: md_mim_id (int)

Identifier provided by Online Mendelian Inheritance in Man (OMIM). This database is described as a catalog of human genes and genetic disorders containing textual information and links to additional related resources. In the HTML results page this ID links to the OMIM page for that entry. This field may contain multiple values as a comma delimited list.

RefSeq DB name: md_refseq_id (varchar(50))

The Reference Sequence (RefSeq) identifier for that entry, provided by the NCBI. As we do not aim to curate all variants of a gene only one mapped RefSeq is displayed per gene report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. In the HTML results page this ID links to the RefSeq page for that entry. This field may contain multiple values as a comma delimited list.

UniProt ID DB name: md_prot_id (varchar(50))

The UniProt identifier, provided by the EBI. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. In the HTML results page this ID links to the UniProt page for that entry. This field may contain multiple values as a comma delimited list.

Ensembl Gene ID DB name: md_ensembl_id (varchar(50))

The Ensembl ID is derived from the current build of the Ensembl database and provided by the Ensembl team.

Vega gene ID DB name: md_vega_id (varchar(50))

The Vega gene ID is derived from the current build of the Vega database and provided by the Vega team.

UCSC DB name: md_ucsc_id (varchar(50))

The UCSC ID is derived from the current build of the UCSC database

LNCipedia DB name: md_lncipedia (varchar(15))

LNCipedia is a public database for long non-coding RNA (lncRNA) sequence and annotation.

LNCipedia IDs that are return via the custom downloads tool are derived from the current build of the LNCipedia database

GtRNAdb DB name: md_gtrnadb (varchar(20))

GtRNAdb contains tRNA gene predictions made by tRNAscan-SE on complete or nearly complete genomes.

GtRNAdb IDs that are return via the custom downloads tool are derived from the current build of the GtRNAdb database

AGR HGNC ID DB name: md_agr (varchar(20))

The primary mission of the Alliance of Genome Resources is to develop and maintain sustainable genome information resources that facilitate the use of diverse model organisms in understanding the genetic and genomic basis of human biology, health and disease.

AGR HGNC IDs that are return via the custom downloads tool are derived from the current build of the Alliance of Genome Resources database and can be used to link to the AGR's Human gene page.

MANE Select Ensembl transcript ID DB name: mane_select.ensembl_nuc_acc (varchar(20))

Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration between the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratories-European Bioinformatics Institute (EMBL-EBI). The goal of this project is to provide a minimal set of matching RefSeq and Ensembl transcripts of human protein-coding genes, where the transcripts from a matched pair are identical (5’ UTR, coding region and 3’ UTR), but retain their respective identifiers.

A MANE Select transcript is one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.

MANE Select Ensembl transcript IDs/accessions that are return via the custom downloads tool are derived from the current build of the MANE Select resource and can be used to link to the MANE Select transcript within EMBL-EBI's Ensembl where you can retrieve more information about the transcript.

MANE Select RefSeq transcript ID DB name: mane_select.refseq_nuc_acc (varchar(20))

Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration between the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratories-European Bioinformatics Institute (EMBL-EBI). The goal of this project is to provide a minimal set of matching RefSeq and Ensembl transcripts of human protein-coding genes, where the transcripts from a matched pair are identical (5’ UTR, coding region and 3’ UTR), but retain their respective identifiers.

A MANE Select transcript is one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.

MANE Select RefSeq transcript IDs/accessions that are return via the custom downloads tool are derived from the current build of the MANE Select resource and can be used to link to the MANE Select transcript within NCBI's RefSeq where you can retrieve more information about the transcript.

Pattern matching

SQL syntax can be used within the WHERE box to limit the data returned to a particular set. The main operators are =, LIKE, SIMILAR TO and ~. Negative versions of each of these operators can also be obtained (see below).

The general syntax of an SQL pattern matching command is column_name OPERATOR 'pattern'. This specifies that you wish to select entries within column column_name that contain or match in some way the specified pattern. See Field Definitions for a description of the content of each column.

For more information on patten matching see the MySQL reference pages

Equals

LIKE

Terms can be combined with and/or

Examples:

The whole opsin group (opsins and rhodopsin):