GENCODE


Frequently asked questions


  1. What is the difference between GENCODE and Ensembl annotation?

    The GENCODE annotation is made by merging the Havana manual gene annotation and the Ensembl automated gene annotation. The GENCODE annotation is the default gene annotation displayed in the Ensembl browser. The GENCODE releases coincide with the Ensembl releases, although we can skip an Ensembl release is there is no update to the annotation with respect to the previous release. In practical terms, the GENCODE annotation is identical to the Ensembl annotation.

  2. What is the difference between GENCODE GTF and Ensembl GTF?

    The gene annotation is the same in both files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

    In addition, the GENCODE GTF contains a number of attributes not present in the Ensembl GTF, including annotation remarks, APPRIS tags and other tags highlighting transcripts experimentally validated by the GENCODE project or 3-way-consensus pseudogenes (predicted by Havana, Yale and UCSC). Find here the complete list of tags.

    Please note that the Ensembl GTF covers the annotation in all sequence regions whereas GENCODE produces a similar file but also a GTF file with the annotation on the reference chromosomes only.

  3. Which are the reference chromosomes?

    The reference chromosomes are those in the primary genome assemblies, ie. chromosomes 1 to 22, X and Y in human; chromosomes 1 to 19, X and Y in mouse. The mitochondrial chromosome is also considered as part of the reference chromosomes. Some GENCODE files contain annotation on reference chromosomes only, thus excluding other sequence regions as unlocalized and unplaced scaffolds, assembly patches and alternate loci (haplotypes).

  4. What is the "basic" annotation in the GTF/GFF3?

    The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

  5. What do "HAVANA" and "ENSEMBL" mean in the GTF/GFF3?

    The second field in the GTF/GFF3 files shows the annotation source for each feature. "HAVANA" indicates that the feature was manually annotated, although it may also be the product of the merge between Havana manual annotation and Ensembl automated annotation. "ENSEMBL" refers exclusively to annotation provided by the Ensembl gene build.

  6. What is the gene/transcript biotype in the GTF/GFF3?

    The biotype is an indicator of biological significance of a gene or transcript. There is a large number of possible biotypes in our annotation files but these can be classified into four broad categories: protein-coding, long non-coding RNAs, pseudogenes and small RNAs. A definition of every biotype can be found here.

  7. What is the gene/transcript status in the GTF/GFF3?

    The status indicates the type of evidence that supports the annotation.

    • KNOWN: Identical to known cDNAs or proteins from the same species and has an entry in species specific model databases: EntrezGene for human, MGI for mouse.
    • NOVEL: Identical or homologous to cDNAs from the same species, or proteins from all species.
    • PUTATIVE: Identical or homologous to spliced ESTs from the same species.
    • KNOWN_BY_PROJECTION: Based on a known orthologue gene in another species.
  8. Why do some gene and transcript ids start with ENSGR or ENSTR in the GTF/GFF3?

    The Ensembl ids, by convention, are made of a species index ("ENS" for human and "ENSMUS" for mouse) followed by a feature type indicator ("G" for gene, "T" for transcript, "E" for exon, "P" for translation) and an 11-number figure.

    The GENCODE GTF/GFF3 files make an exception to this rule in the case of the so called "pseudoautosomal regions" (PAR) of chromosome Y. The gene annotation in these regions is identical between chromosomes X and Y. Ensembl do not provide different feature ids for both chromosomes. The Ensembl GTF file only includes this annotation once, for chromosome X. However, we decided that the GENCODE GTF/GFF3 files would include the annotation in the PAR regions of both chromosomes.

    Since the GTF convention dictates that feature ids have to be unique for different genome regions, we slightly modify the Ensembl feature id by replacing the first zero with an "R". Thus, "ENSG00000182378.10" in chromosome X becomes "ENSGR0000182378.10" in chromosome Y.

    Please note that this applies until release 24. From release 25, the PAR genes and transcripts have "_PAR_Y" appended to their ids.

    This annotation is also labeled using the tag "PAR".

  9. What is the difference between the Biodalliance and the Ensembl genome browsers?

    In the Biodalliance browser embedded in our website you can access the annotation of all GENCODE releases, unlike in the Ensembl browser, which solely shows the current one. Thus, it is a very convenient way of comparing the evolution of the annotation for a given genomic region.

    This browser also displays additional tracks such as the PhyloCSF coding potential prediction.

  10. What does level 1, 2 or 3 mean in the GTF/GFF3?

    We supply genome-wide features on three different confidence levels.

    • Level 1 - validated: pseudogene loci that were jointly predicted by the Yale Pseudopipe and UCSC Retrofinder pipelines as well as by Havana manual annotation; other transcripts that were verified experimentally by RT-PCR and sequencing through the GENCODE experimental pipeline.
    • Level 2 - manual annotation: Havana manual annotation (and Ensembl annotation where it is identical to Havana).
    • Level 3 - automated annotation: Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found.
    Please note that not all transcripts have been tested by the GENCODE experimental pipeline and that level 2/3 transcripts may have been experimentally validated elsewhere.
  11. What does the transcript support level mean in the GTF/GFF3?

    The transcript support level indicates how well supported a transcript model is, based on mRNA and EST alignments supplied by UCSC and Ensembl. Find here more information. Please note that this transcript support level classification is completely independent from the three-confidence-level classification described above.

  12. What are the OTT gene/transcript ids in the GTF/GFF3?

    The 'havana_gene' and 'havana_transcript' attributes indicate the internal gene and trancript stable ids used by Havana and are also the main identifiers in the Vega genome browser. They start with 'OTTHUM' and 'OTTMUS' for human and mouse respectively.

  13. What is the gene name in the GTF/GFF3?

    Gene names are usually HGNC or MGI-approved gene symbols mapped to the GENCODE genes by the Ensembl xref pipeline. Sometimes, when there is no official gene symbol, the Havana clone-based name is used.

 
Cookies policy | Terms & Conditions. This site is hosted by the Wellcome Trust Sanger Institute.