GENCODE

Releases

GENCODE supports genomics projects that are still attached to GRCh37/hg19 by providing updated human gene annotation on this genome assembly version.

The following GENCODE releases were built on GRCh38, but GRCh37-mapped versions are also available from the links below.


Freeze date * GENCODE release Ensembl release Release date Genome assembly version UCSC version Notes
10.2016 26 88 03.2017 mapped to GRCh37 - re-merge with new Havana annotation, updated Ensembl gene set
03.2016 25 85, 86, 87 07.2016 mapped to GRCh37 - re-merge with new Havana annotation, updated Ensembl gene set
08.2015 24 83, 84 12.2015 mapped to GRCh37 24lift37 re-merge with new Havana annotation, updated Ensembl gene set
03.2015 23 81, 82 07.2015 mapped to GRCh37 - re-merge with new Havana annotation, updated Ensembl gene set

Mapping algorithm

The gene annotation originally created on the GRCh38 reference chromosomes was mapped to GRCh37 using gencode-backmap following the instructions provided in its website.

The program takes the current ("source") GENCODE GFF3 or GTF, cross-assembly (UCSC hg38-to-hg19 liftover) genomic alignments, and the GENCODE 19 ("target") annotation files. The mapping algorithm described in the documentation is as follows.

Mapping is done on a per-gene basis using the following steps:

  • Project transcripts of the gene through the alignments, keeping exons chained.
    • If there are multiple mappings, first look for ones that the overlapping to the previous version of the transcript, if it exists. Otherwise, if there is a previous version of the gene, select mappings overlapping the gene. Otherwise, to filter for paralog mappings, pick the mapping with the most similar span as the source.
    • Project features of the transcript, such as CDS and start codons, to the transcript alignment between the genomes. This ensures that features stay in the same location within the transcript.
  • Check all transcripts of the gene for consistency. Reject source gene mappings with transcripts on different chromosomes or strands, or where the genomic length of the gene has changed more than 50%.
  • If a version of the gene exists in the target and the mapped gene doesn't overlap the target gene, it is also rejected.
    • If a gene did not map or was rejected and a version of the gene with the same biotype exists in the target annotations, use the existing gene.
  • Small, automatic-only or all automatic genes are optionally not mapped, with the target annotation being passed through. This avoids complex mappings of small RNAs imported from other database (e.g. mirRNAs).
  • Target genes with no corresponding mappings and that overlap patched regions or regions with GRC incident reports in the target genome may optionally be passed through. This addresses a fair number of problem cases. This was a common problem on GRCh37 chrX.

Pairing of source and target genes is somewhat complex due to instability of some gene identifiers between assemblies. If a matching base gene id (less version) is not found, an attempt is made to match the genes using the symbolic name.



Mapping categories

Information on each gene mapping is stored as attributes in the GFF3/GTF files. The attributes and their values are:


attribute name attribute value
remap_status Attribute that indicates the status of the mapping. Possible values are:
  • full_contig: Gene or transcript completely mapped to the target genome with all features intact.
  • full_fragment: Gene or transcript completely mapped to the target genome with insertions in some features. These are usually small insertions.
  • partial: Gene or transcript partially mapped to the target genome.
  • deleted: Gene or transcript did not map to the target genome.
  • no_seq_map: The source sequence is not in the assembly alignments. This will occur with alt loci genes if the alignments only contain the primary assembly.
  • gene_conflict: Transcripts in the gene mapped to multiple locations.
  • gene_size_change: Transcripts caused gene length to change by more than 50%. This is to detect mapping to processed pseudogenes and mapping across tandem gene duplications.
  • automatic_small_ncrna_gene: Gene is from a small, automatic (ENSEMBL source) non-coding RNA. Taken from the target annotation.
  • automatic_gene: Gene is from an automatic process (ENSEMBL source). Taken from the target annotation.
  • pseudogene: Pseudogene annotations (excluding polymorphic).
remap_original_id Original ID attribute of the feature. If a feature is split when mapped, new IDs are created, otherwise the original ID is used.
remap_original_location Location of the feature in the source genome.
remap_num_mappings Number of mappings of the feature, only one of them was used.
remap_target_status Attribute that compares the mapping to the existing target annotations. Possible values are:
  • new: Gene or transcript was not in target annotations.
  • lost: Gene or transcript exists in source and target genome, however source was not mapped.
  • overlap: Gene or transcript overlaps previous version of annotation on target genome.
  • nonOverlap: Gene or transcript exists in target, however source mapping is to a different location. This is often mappings to a gene family members or pseudogenes.
remap_substituted_missing_target Target annotation from which this gene annotation was taken, if the source gene couldn't be mapped or the mapping was ignored (eg. ENSEMBL source). The usual value is "V19" (GENCODE 19).


 
Cookies policy | Terms & Conditions. This site is hosted by the Wellcome Trust Sanger Institute.