Tag description in GENCODE

The following tags can be found in the GENCODE GTF/GFF3 files. Read more about the GTF file format

3_nested_supported_extension
3' end extended based on RNA-seq data.
3_standard_supported_extension
3' end extended based on RNA-seq data.
454_RNA_Seq_supported
annotated based on RNA-seq data.
5_nested_supported_extension
5' end extended based on RNA-seq data.
5_standard_supported_extension
5' end extended based on RNA-seq data.
alternative_3_UTR
shares an identical CDS but has alternative 5' UTR with respect to a reference variant.
alternative_5_UTR
shares an identical CDS but has alternative 3' UTR with respect to a reference variant.
appris_principal_1
(This flag corresponds to the older flag "appris_principal") Where the transcript expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants.
appris_principal_2
(This flag corresponds to the older flag "appris_candidate_ccds") Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.
appris_principal_3
Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag.
appris_principal_4
(This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_ccds") Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
appris_principal_5
(This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_seq") Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
appris_alternative_1
Candidate transcript(s) models that are conserved in at least three tested non-primate species.
appris_alternative_2
Candidate transcript(s) models that appear to be conserved in fewer than three tested non-primate species.
appris_principal
transcript expected to code for the main functional isoform based on a range of protein features (APPRIS pipeline).
appris_candidate
where there is no single 'appris_principal' variant the main functional isoform will be translated from one of the 'appris_candidate' genes.
appris_candidate_ccds
the "appris_candidate" transcript that has an unique CCDS.
appris_candidate_highest_score
where there is no 'appris_principal' variant, the candidate with highest APPRIS score is selected as the primary variant.
appris_candidate_longest
where there is no 'appris_principal' variant, the longest of the 'appris_candidate' variants is selected as the primary variant.
appris_candidate_longest_ccds
the "appris_candidate" transcripts where there are several CCDS, in this case APPRIS labels the longest CCDS.
appris_candidate_longest_seq
where there is no "appris_candidate_ccds" or "appris_candidate_longest_ccds" variant, the longest protein of the "appris_candidate" variants is selected as the primary variant.
artifactual_duplication
annotated on an artifactual duplicate region of the genome assembly.
basic
identifies a subset of representative transcripts for each gene; prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.
bicistronic
transcript contains two confidently annotated CDSs. Support may come from eg proteomic data, cross-species conservation or published experimental work.
CAGE_supported_TSS
transcript 5' end overlaps ENCODE or Fantom CAGE cluster.
CCDS
member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA.
cds_end_NF
the coding region end could not be confirmed.
cds_start_NF
the coding region start could not be confirmed.
dotter_confirmed
transcript QC checked using dotplot to identify features eg splice junctions, end of homology.
downstream_ATG
an upstream ATG is used where a downstream ATG seems more evolutionary conserved.
Ensembl_canonical
most representative transcript of the gene. This will be the MANE_Select transcript if there is one, or a transcript chosen by an Ensembl algorithm otherwise.
exp_conf
transcript was tested and confirmed experimentally.
fragmented_locus
locus consists of non-overlapping transcript fragments either because of genome assembly issues (i.e., gaps or mis-assemblies), or because supporting transcripts (e.g., from another species) cannot be completely mapped, or because the supporting transcripts are non-overlapping end pairs (i.e., 5' and 3' ESTs from a single cDNA).
inferred_exon_combination
transcript model contains all possible in-frame exons supported by homology, experimental evidence or conservation, but the exon combination is not directly supported by a single piece of evidence and may not be biological. Used for large genes with repetitive exons (e.g. titin (TTN)) to represent all the exons individual transcript variants can pool from.
inferred_transcript_model
transcript model is not supported by a single piece of transcript evidence. May be supported by multiple fragments of transcript evidence or by combining different evidence sources e.g. protein homology, RNA-seq data, published experimental data.
low_sequence_quality
transcript supported by transcript evidence that, while ampping best-in-genome, shows regions of poor sequence quality.
mRNA_end_NF
the mRNA end could not be confirmed.
mRNA_start_NF
the mRNA start could not be confirmed.
MANE_Select
the transcript belongs to the MANE Select data set. The Matched Annotation from NCBI and EMBL-EBI project (MANE) is a collaboration between Ensembl-GENCODE and RefSeq to select a default transcript per human protein coding locus that is representative of biology, well-supported, expressed and conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
MANE_Plus_Clinical
the transcript belongs to the MANE Plus Clinical data set. Within the MANE project, these are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known pathogenic or likely pathogenic clinical variants not reportable using the MANE Select data set. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
NAGNAG_splice_site
in-frame type of variation where, at the acceptor site, some variants splice after the first AG and others after the second AG.
ncRNA_host
the locus is a host for small non-coding RNAs.
nested_454_RNA_Seq_supported
annotated based on RNA-seq data.
NMD_exception
the transcript looks like it is subject to NMD but publications, experiments or conservation support the translation of the CDS.
NMD_likely_if_extended
codon if the transcript were longer but cannot currently be annotated as NMD as does not fulfil all criteria - most commonly lack of an intron downstream of the stop codon.
non_ATG_start
the CDS has a non-ATG start and its validity is supported by publication or conservation.
non_canonical_conserved
the transcript has a non-canonical splice site conserved in other species.
non_canonical_genome_sequence_error
the transcript has a non-canonical splice site explained by a genomic sequencing error.
non_canonical_other
the transcript has a non-canonical splice site explained by other reasons.
non_canonical_polymorphism
the transcript has a non-canonical splice site explained by a SNP.
non_canonical_TEC
the transcript has a non-canonical splice site that needs experimental confirmation.
non_canonical_U12
the transcript has a non-canonical splice site explained by a U12 intron (i.e. AT-AC splice site).
non_submitted_evidence
a splice variant for which supporting evidence has not been submitted to databases, i.e. the model is based on literature or collaborator evidence.
not_best_in_genome_evidence
a transcript is supported by evidence from same species paralogous loci.
not_organism_supported
evidence from other species was used to build model.
orphan
protein-coding locus with no paralogues or orthologs.
overlapping_locus
exon(s) of the locus overlap exon(s) of a readthrough transcript or a transcript belonging to another locus.
overlapping_uORF
a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG.
PAR
annotation in the pseudo-autosomal region, which is duplicated between chromosomes X and Y.
pseudo_consens
member of the pseudogene set predicted by YALE, UCSC and HAVANA.
readthrough_gene
protein-coding gene that has a readthrough transcript.
readthrough_transcript
a transcript that overlaps two or more independent loci but is considered to belong to a third, separate locus.
reference_genome_error
locus overlaps a sequence error or an assembly error in the reference genome that affects its annotation (e.g., 1 or 2bp insertion/deletion, substitution causing premature stop codon). The main effect is that affected transcripts that would have had a CDS are currently annotated without one.
retained_intron_CDS
internal intron of CDS portion of transcript is retained.
retained_intron_final
final intron of CDS portion of transcript is retained.
retained_intron_first
first intron of CDS portion of transcript is retained.
retrogene
protein-coding locus created via retrotransposition.
RNA_Seq_supported_only
transcript supported by RNAseq data and not supported by mRNA or EST evidence.
RNA_Seq_supported_partial
transcript annotated based on mixture of RNA-seq data and EST/mRNA/protein evidence.
RP_supported_TIS
transcript that contains a CDS that has a translation initiation site supported by Ribosomal Profiling data.
seleno
contains a selenocysteine.
semi_processed
a processed pseudogene with one or more introns still present. These are likely formed through the retrotransposition of a retained intron transcript.
sequence_error
transcript contains at least 1 non-canonical splice junction that is associated with a known or novel genome sequence error.
stop_codon_readthrough
Transcript whose coding sequence contains an internal stop codon that does not cause the translation termination
TAGENE
Transcript created or extended using assembled RNA-seq long reads.
upstream_ATG
an upstream ATG exists when a downstream ATG is better supported.
upstream_uORF
a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG.