GENCODE promoter windows

The concept of the ‘extended gene’ includes not just the transcripts themselves, but also core elements that contribute to gene function, including promoters and enhancers. Thus, the integration of traditional gene and transcript annotations with regulatory elements can offer improved support for genome interpretation. In collaboration with the Ensembl project, we present an initial annotation file of human GENCODE promoter windows. This set of annotations, while basic, is primarily intended at this stage to offer a standardised alternative to ad hoc user-defined promoter annotations based on GENCODE gene annotation. It also offers a step towards deeper integration with Ensembl regulatory annotations. Initially, promoter windows are provided for each human protein-coding gene in Ensembl release 114 / GENCODE 48; not for other gene biotypes (e.g. long non-coding RNAs (lncRNAs)), while readthrough genes are also ignored unless the relevant transcript is a MANE Select.

Ensembl provides standardised cross-species annotation which is regularly upgraded to reflect the latest developments in experimental and computational methods. Ensembl annotates promoters using gene annotation, but also requires sufficient experimental evidence from open chromatin assays, such as ATAC-seq or DNase-seq. In contrast, GENCODE provides stable and reliable gene annotation based on transcriptomics evidence with manual oversight, and with an in-built expectation of high stability to match user needs. GENCODE promoter windows are then conceived as an extension of gene annotation that allows for a stable concept of promoter annotation. It centres around the logic that a ‘true’ transcript start site (TSS) should always colocalise with a promoter element, and so certain transcriptomics datasets can be used to predict promoter locations tied to specific transcript models.

The promoter window is defined as the 1000 bp immediately upstream of the MANE Select TSS (i.e. the promoter window ends on the base before the first base of the MANE Select model). MANE Select transcripts are representative transcript models that are identically annotated by NCBI RefSeq and Ensembl/GENCODE following exhaustive comparisons and data analysis. The usage of MANE Select transcripts is crucial for two reasons. Firstly, the vast majority of models have a TSS accurately established from transcriptomics data, which means the window can also be appropriately localised with confidence. Secondly, these models are highly stable, so promoter windows anchored to their TSS can also be stable. Within the file, promoter windows of this type (97% of the total set) are tagged as ‘MANE_Select_with_TSS’. We anticipate that the sequence coordinates of these windows will remain constant in future releases.

However, it has also been necessary to define two additional window types. Firstly, 353 MANE Select models do not have TSS, simply because it is considered that TSS can not yet be annotated with confidence. Such models are instead tagged as ‘MANE_Select_without_TSS’. Secondly, 240 GENCODE protein-coding genes do not yet have MANE Select models. These are tagged as ‘Non_MANE_Select_model’, with the window being anchored to the Ensembl Canonical model instead. We anticipate that further gene annotation work will resolve and refine these cases, and so their promoter window annotations cannot be considered as stable at present.

We emphasise that the GENCODE promoter windows presented in this set are transcript-anchored regions expected to contain promoter-like features and experimental data relevant to that gene; they are not annotations of the gene promoters themselves. In reality, promoters vary in size, although defining exactly where an element should begin and end in terms of sequence coordinates is not easy, even based on experimental data. Thus, the 1000bp window was chosen because any promoter annotations for a given gene provided by Ensembl Regulation and ENCODE are highly likely to overlap with this sequence. Indeed, around 80% of Ensembl promoters linked to protein-coding genes overlap GENCODE promoter windows, and the Ensembl promoter IDs are provided in the file for these cases. Also, we note that genes often display substantial variability in the location of their TSS, even within a given first exon, and that this variation is often captured in GENCODE transcript models. As such, a given promoter window may overlap with transcript sequence from an alternative GENCODE transcript model. Finally, we recognise that protein-coding genes can have multiple promoters due to the usage of alternative first exons, and so the current model of a single promoter window per protein-coding gene does not capture this full complexity.

The promoter windows for release 48 can be downloaded as a gff3 file.

A track hub to view the promoter windows in a genome browser is available here.