HGNC complete set archive

Archive directory structure

The HGNC now archive the complete HGNC dataset file (both tab separated and JSON formats) each month and each quarter. You may find these files in the archive section of our FTP site. Within this area of the FTP site we provide an archive of past complete HGNC datasets and withdrawn files.

archive/
├── monthly/
│   ├── json/
│   │   ├── hgnc_complete_set_2020-07-01.json
│   │   ├── hgnc_complete_set_2020-08-01.json
│   │   ├── hgnc_complete_set_2020-09-01.json
│   │   ├── hgnc_complete_set_2020-10-01.json
│   │   ├── hgnc_complete_set_2020-11-01.json
│   │   ├── hgnc_complete_set_2021-01-01.json
│   │   ├── hgnc_complete_set_2021-02-01.json
│   │   ├── hgnc_complete_set_2021-03-01.json
│   │   ├── hgnc_complete_set_2021-04-01.json
│   │   ├── hgnc_complete_set_2021-05-01.json
│   │   ├── hgnc_complete_set_2021-06-01.json
│   │   ├── withdrawn_2020-09-01.json
│   │   ├── withdrawn_2020-10-01.json
│   │   ├── withdrawn_2020-11-01.json
│   │   ├── withdrawn_2021-01-01.json
│   │   ├── withdrawn_2021-02-01.json
│   │   ├── withdrawn_2021-03-01.json
│   │   ├── withdrawn_2021-04-01.json
│   │   ├── withdrawn_2021-05-01.json
│   │   └── withdrawn_2021-06-01.json
│   └── tsv/
│       ├── hgnc_complete_set_2020-07-01.txt
│       ├── hgnc_complete_set_2020-08-01.txt
│       ├── hgnc_complete_set_2020-09-01.txt
│       ├── hgnc_complete_set_2020-10-01.txt
│       ├── hgnc_complete_set_2020-11-01.txt
│       ├── hgnc_complete_set_2021-01-01.txt
│       ├── hgnc_complete_set_2021-02-01.txt
│       ├── hgnc_complete_set_2021-03-01.txt
│       ├── hgnc_complete_set_2021-04-01.txt
│       ├── hgnc_complete_set_2021-05-01.txt
│       ├── hgnc_complete_set_2021-06-01.txt
│       ├── withdrawn_2020-09-01.txt
│       ├── withdrawn_2020-10-01.txt
│       ├── withdrawn_2020-11-01.txt
│       ├── withdrawn_2021-01-01.txt
│       ├── withdrawn_2021-02-01.txt
│       ├── withdrawn_2021-03-01.txt
│       ├── withdrawn_2021-04-01.txt
│       ├── withdrawn_2021-05-01.txt
│       └── withdrawn_2021-06-01.txt
└── quarterly/
    ├── json/
    │   ├── hgnc_complete_set_2020-07-01.json
    │   ├── hgnc_complete_set_2020-10-01.json
    │   ├── hgnc_complete_set_2021-01-01.json
    │   ├── hgnc_complete_set_2021-04-01.json
    │   ├── withdrawn_2020-10-01.json
    │   ├── withdrawn_2021-01-01.json
    │   └── withdrawn_2021-04-01.json
    └── tsv/
        ├── hgnc_complete_set_2020-07-01.txt
        ├── hgnc_complete_set_2020-10-01.txt
        ├── hgnc_complete_set_2021-01-01.txt
        ├── hgnc_complete_set_2021-04-01.txt
        ├── withdrawn_2020-10-01.txt
        ├── withdrawn_2021-01-01.txt
        └── withdrawn_2021-04-01.txt
  
Figure 1: A snapshot of the directory structure from 2021-06-07.

The root directory for the archive contains two directories, monthly and quarterly and both contain two more directories for the file format of json and tsv where json refers to the JavaScript Object Notation format and tsv refers to text files that are separated using a tab (Tab Separated Variables). Under the file format named directories, you will then find the archived files in the file format you have chosen (see fig. 1).

Archive files

The monthly files are produced on the 1st of every month while the quarterly files are produced on the 1st of Jan, Apr, Jul & Oct. If a monthly file is over 365 days old, the file will be deleted for us to save disk space however, the quarterly files are not deleted currently so for snapshots older than 365 days please use the quarterly files.

There are essentially two types of data file (excluding the file format type) of hgnc_complete_set and withdraw. The hgnc_complete_set is a set of all approved gene symbol reports found on the GRCh38 reference and the alternative reference loci (see fig. 2 for a list of columns/headings). The withdrawn file contains all gene symbol reports that are no longer approved. Either the symbol has been withdrawn or merged/split into another report (see fig. 3 for a list of columns/headings).

hgnc_id                  = HGNC ID. A unique ID created by the HGNC for every
                           approved symbol. 

symbol                   = The HGNC approved gene symbol. Equates to the
                           "APPROVED SYMBOL" field within the gene symbol
                           report.

name                     = HGNC approved name for the gene. Equates to the
                           "APPROVED NAME" field within the gene symbol report.

locus_group              = A group name for a set of related locus types as
                           defined by the HGNC (e.g. non-coding RNA).

locus_type               = The locus type as defined by the HGNC (e.g. RNA,
                           transfer).

status                   = Status of the symbol report, which can be either
                           "Approved" or "Entry Withdrawn".

location                 = Cytogenetic location of the gene (e.g. 2q34).

location_sortable        = Same as "location" but single digit chromosomes are
                           prefixed with a 0 enabling them to be sorted in
                           correct numerical order (e.g. 02q34).

alias_symbol             = Other symbols used to refer to this gene as seen in
                           the "SYNONYMS" field in the symbol report. 

alias_name               = Other names used to refer to this gene as seen in
                           the "SYNONYMS" field in the gene symbol report.

prev_symbol              = Symbols previously approved by the HGNC for this
                           gene. Equates to the "PREVIOUS SYMBOLS & NAMES" field
                           within the gene symbol report.

prev_name                = Gene names previously approved by the HGNC for this
                           gene. Equates to the "PREVIOUS SYMBOLS & NAMES" field
                           within the gene symbol report.

gene_family              = Name given to a gene family or group the gene has been
                           assigned to. Equates to the "GENE FAMILY" field within
                           the gene symbol report.

gene_family_id           = ID used to designate a gene family or group the gene
                           has been assigned to.

date_approved_reserved   = The date the entry was first approved.

date_symbol_changed      = The date the gene symbol was last changed.

date_name_changed        = The date the gene name was last changed.

date_modified            = Date the entry was last modified.

entrez_id                = Entrez gene ID. Found within the "GENE RESOURCES"
                           section of the gene symbol report.

ensembl_gene_id          = Ensembl gene ID. Found within the "GENE RESOURCES"
                           section of the gene symbol report.

vega_id                  = Vega gene ID. Found within the "GENE RESOURCES"
                           section of the gene symbol report.

ucsc_id                  = UCSC gene ID. Found within the "GENE RESOURCES"
                           section of the gene symbol report.

ena                      = International Nucleotide Sequence Database
                           Collaboration (GenBank, ENA and DDBJ) accession
                           number(s). Found within the "NUCLEOTIDE SEQUENCES"
                           section of the gene symbol report.

refseq_accession         = RefSeq nucleotide accession(s). Found within the
                           "NUCLEOTIDE SEQUENCES" section of the gene symbol
                           report.

ccds_id                  = Consensus CDS ID. Found within the
                           "NUCLEOTIDE SEQUENCES" section of the gene symbol
                           report.

uniprot_ids              = UniProt protein accession. Found within the
                           "PROTEIN RESOURCES" section of the gene symbol
                           report.

pubmed_id                = Pubmed and Europe Pubmed Central PMID(s).

mgd_id                   = Mouse genome informatics database ID. Found within
                           the "HOMOLOGS" section of the gene symbol report.

rgd_id                   = Rat genome database gene ID. Found within the
                           "HOMOLOGS" section of the gene symbol report.

lsdb                     = The name of the Locus Specific Mutation Database and
                           URL for the gene separated by a | character

cosmic                   = Symbol used within the Catalogue of somatic
                           mutations in cancer for the gene.

omim_id                  = Online Mendelian Inheritance in Man (OMIM) ID

mirbase                  = miRBase ID

homeodb                  = Homeobox Database ID

snornabase               = snoRNABase ID

bioparadigms_slc         = Symbol used to link to the SLC tables database at
                           bioparadigms.org for the gene

orphanet                 = Orphanet ID

pseudogene.org           = Pseudogene.org

horde_id                 = Symbol used within HORDE for the gene

merops                   = ID used to link to the MEROPS peptidase database

imgt                     = Symbol used within international ImMunoGeneTics
                           information system

iuphar                   = The objectId used to link to the IUPHAR/BPS Guide to
                           PHARMACOLOGY database. To link to IUPHAR/BPS Guide
                           to PHARMACOLOGY database only use the number
                           (only use 1 from the result objectId:1)

kznf_gene_catalog        = ID used to link to the Human KZNF Gene Catalog

mamit-trnadb             = ID to link to the Mamit-tRNA database

cd                       = Symbol used within the Human Cell Differentiation
                           Molecule database for the gene

lncrnadb                 = lncRNA Database ID

enzyme_id                = ENZYME EC accession number

intermediate_filament_db = ID used to link to the Human Intermediate Filament
                           Database

agr                      = The HGNC ID that the Alliance of Genome Resources
                           (AGR) have linked to their record of the gene. Use
                           the HGNC ID to link to the AGR.

mane_select              = NCBI and Ensembl transcript IDs/acessions
                           including the version number for one high-quality
                           representative transcript per protein-coding gene
                           that is well-supported by experimental data and
                           represents the biology of the gene. The IDs are
                           delimited by |.
Figure 2: Columns/headings within the hgnc_complete_set files.
HGNC_ID               = The HGNC ID of the withdrawn record.
STATUS                = Can either be "Entry Withdrawn" or "Merged/Split"
WITHDRAWN_SYMBOL      = The symbol of the withdrawn record.
MERGED_INTO_REPORT(S) = Shows what record(s) replaced the withdrawn record if the
                        status is "Merged/Split". Each replacement has the format
                        HGNC_ID|SYMBOL|STATUS. If the withdrawn record is split,
                        their will be more than one replacement and they will be
                        comma separated.
Figure 3: Columns/headings within the withdrawn files.

Quick links to HGNC full set files.