Improving lncRNA annotation with the GENCODE Capture Long-read Sequencing project
Over the last five years GENCODE has been developing and deploying our Capture Long-read Sequencing (CLS) pipeline to improve our human and mouse gene annotations. We recently finalised the third phase of this project (CLS3), which specifically focuses on lncRNA annotations in both species. Following this work, human v47 and mouse vM36 contain the largest increases in transcript models ever seen between consecutive releases for both species.
The CLS project began with the design of capture arrays, targeting over 300,000 regions of the human and mouse genomes considered to be potentially interesting with respect to lncRNA transcription and annotation. PacBio and Oxford Nanopore (ONT) long-read sequencing was performed by GENCODE partners at the Centre for Genomic Regulation (CRG) in Barcelona, combining the CLS methodology [1] with CapTrap cDNA library preparation [2] to produce over 1.5 billion raw reads. These were processed using the CRG LyRic pipeline, generating a collection of full-length transcript models that could be used as a data source for annotation. Further information is available here.
These models were integrated into the GENCODE annotations for both species via a new manually supervised computational workflow that we call TAGENE, which has been developed over several years via iterative testing. Following the deployment of TAGENE, over 132,000 and 129,000 novel lncRNA transcripts were added to human v47 and mouse vM36, which are approximately 3 and 6-fold increases in these respective counts. We have also nearly doubled the number of lncRNA genes annotated by GENCODE.
This work is described in a pre-print submitted to bioRxiv: 
https://www.biorxiv.org/content/10.1101/2024.10.29.620654v1
References
- Lagarde J, Uszczynska-Ratajczak B, Carbonell S, et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nature Genetics. 2017 Dec;49(12):1731-1740. DOI: 10.1038/ng.3988. PMID: 29106417; PMCID: PMC5709232.
- Carbonell-Sala S, Perteghella T, Lagarde J, et al. CapTrap-seq: a platform-agnostic and quantitative approach for high-fidelity full-length RNA sequencing. Nature Communications. 2024 Jun;15(1):5278. DOI: 10.1038/s41467-024-49523-3. PMID: 38937428.
