Prediction of transcriptional pause sites

Disclaimer: Pauser is currently in prerelease. It has not been peer-reviewed or published.

Transcription elongation is carried out by RNA polymerase (RNAP). This involves RNAP translocating the DNA in the $3^\prime \rightarrow 5^\prime$ direction, while the RNA grows sequentially in the $5^\prime \rightarrow 3^\prime$ direction. In prokaryotes and eukaryotes, this process occurs at a mean rate of 20-100 bp/s. However, individual RNAP molecules proceed erractically along their template. Approximately once every 100 bp there exists a pause site, at which RNAP can take seconds or minutes to transcribe. Transcriptional pausing plays a range of important biological roles in gene expression.

Pauser is a web-based program dedicated to predicting the locations of pause sites in gene sequences in a binary classification framework. The two predictors - SimPol and a Naive Bayes classifier - were trained on the E. coli RNAP using published data.

Contents

Loading...




Getting started
Information icons

Help icons are scattered throughout the interface of Pauser. Pressing one of these buttons will redirect to the relevant section on this page.
Uploading nucleotide sequences

One or more nucleotide sequences must be uploaded to begin Pauser, in .fasta format. Ensure that the sequences are positive sense and are specified $5^\prime \rightarrow 3^\prime$. Optionally, the known locations of pause sites within the sequence may be included, where a pause site is defined as the length of the mRNA when the pause occurs. This is done by appending the pause site indices onto the accession lines like so:
>seq1|pausesites=(2,6,18-20)
AGCGTTAGGCGATTCGGGAAATGCGATTG
>seq2
ATTCGGGATCATCGAT
Uploading a SimPol session

Optionally, a session file may be generated using SimPol and uploaded to Pauser, in .xml format. This will determine the kinetic model to use for predicting pause sites. The session file specifies the parameters and model settings, including their prior distributions, and the number of simulations to perform per sequence. The default XML file can be viewed below and session files can be generated at www.polymerase.nz/simpol.

Default session: view XML




Prediction of transcriptional pause sites
Let $X = (x_1, x_2, \dotso , x_L)$ be a sequence with $L$ nucleotides. Each site within $X$ is classified as being a pause site $C_l = \mathcal{P}$ or as not being a pause site $C_l = \mathcal{N}$. Site $l$ is defined as a pause site if the average time to transcribe $l+1$ is significantly long, as specified by an adjustable threshold.

Upon beginning Pauser, each uploaded sequence is analysed by two classifiers - SimPol and a Naive Bayes classifier (NBC). Each classifier calculates the evidence of each site being a pause site. Interpretation of evidence differs between classifiers. The evidence threshold of each classifier can be adjusted independently. Low thresholds result in more sites classified as pause sites, and higher thresholds result in fewer pause classifications.
Classifiers

SimPol

SimPol is a framework for developing kinetic models of transcription elongation. These models are physically informed by the thermodynamics of nucleic acids. While these models lack predictive power when it comes to transcriptional pausing, they are explanatory models and can guide the development of testable hypotheses. To change the default model, see Uploading a SimPol session.

Evidence: the evidence that site $l$ is a pause site is equal to the median time that the mRNA has length $l$, averaged across all simulations of the sequence.
Naive Bayes

A Naive Bayes classifier (NBC) is a simple statistical technique and is a sequence-based model. All nucleotides within a window of size $w_1 + w_2$ around site $l$ are examined by the model, and this is used to classify $l$ into one of the two classes. The NBC assumes independence between sites. While NBC has significantly stronger predictive power than SimPol, it lacks explanatory power.

Evidence: the evidence that site $l$ is a pause site is equal to the log posterior ratio: $$ \log \frac{P(C_l = \mathcal{P}|X)}{P(C_l = \mathcal{N}|X)} = \log \frac {P(\mathcal{P}) \sum\limits_{j=l-w_1}^{l+w_2} P(x_j| C_l = \mathcal{P}) } {P(\mathcal{N}) \sum\limits_{j=l-w_1}^{l+w_2} P(x_j| C_l = \mathcal{N}) } $$ where $P(\mathcal{P})$ and $P(\mathcal{N})$ are the prior probabilities of a site being a pause site / not a pause site (equal to 0.007 and 0.993 respectively).
Evaluating classifier adequacy

If the locations of pause sites are already known (see Uploading nucleotide sequences ), then the adequacy of the classifiers can be quantified. The recall, precision, and accuracy of each sequence, and the average across all sequences are given. Let $T_l = \mathcal{P}$ if site $l$ is a known pause site, and let $T_l = \mathcal{N}$ if $l$ is known to not be a pause site.

Recall is the probability of a true pause being classified as a pause: $P(C_l = \mathcal{P}|T_l = \mathcal{P}) = \frac{P(C_l = \mathcal{P} \; \cap \; T_l = \mathcal{P})}{P(T_l = \mathcal{P})}$

Precision is the probability that a site classified as a pause is actually a pause: $P(T_l = \mathcal{P}|C_l = \mathcal{P}) = \frac{P(C_l = \mathcal{P} \; \cap \; T_l = \mathcal{P})}{P(C_l = \mathcal{P})}$

Accuracy is the overall proportion of correct classifications: $P(T_l = \mathcal{P} \; \cap \; C_l = \mathcal{P}) + P(T_l = \mathcal{N} \; \cap \; C_l = \mathcal{N})$
ROC curve

If the locations of pause sites are already known, then a ROC curve will be constructed. A ROC analysis accounts for all possible evidence thresholds of a classifier and explores true and false positive rates simultaneously. The total adequacy of a classifier can be summarised by integrating the area under the curve (AUC) of a ROC plot. The AUC lies in the range [0,1]. A perfect classifier will have an AUC of 1, while a classifier with an AUC of 0.5 or less is no better than a coin toss. See review by Fawcett for an introduction to ROC analyses.
Example session

Consider the following E. coli genes with known pause sites, as experimentally determined by Larson et al. 2014.
>flgF|pausesites=(109,252,531,567,627) GCCCAGCGTAACTATCAGTCTAACGCCCAGACCATCAAAACCCAGGACCAGATCCTCAACACGCTGGTTAACTTACGCTAATCGCTGACGGGATAGCTCAATGGATCACGCAATTTATACCGCGATGGGAGCAGCCAGCCAGACACTGAATCAACAGGCGGTAACCGCCAGTAATCTGGCCAATGCCTCAACGCCCGGTTTTCGCGCGCAGTTGAATGCTTTACGCGCGGTGCCAGTGGAAGGGCTTTCTCTGCCCACGCGCACGTTGGTCACGGCGTCAACGCCGGGCGCAGATATGACGCCCGGCAAAATGGATTACACCTCGCGCCCGCTGGACGTCGCGTTGCAGCAGGATGGCTGGCTGGCCGTGCAGACCGCTGACGGCAGCGAAGGGTATACGCGTAATGGCAGCATTCAGGTTGATCCCACCGGGCAACTGACAATTCAGGGGCATCCGGTGATAGGCGAGGCTGGGCCAATTGCTGTGCCGGAAGGGGCGGAAATCACTATTGCTGCCGATGGCACAATCTCGGCGCTCAATCCGGGCGATCCGGCAAATACGGTTGCGCCAGTAGGGCGTCTTAAACTGGTGAAAGCCACGGGCAGCGAAGTGCAGCGCGGTGACGACGGCATTTTTCGTTTAAGCGCAGAAACCCAGGCCACGCGTGGGCCGGTACTGCAGGCAGATCCAACCTTGCGTGTGATGTCGGGGGTTCTGGAAGGCAGTAACGTCAATGCCGTTGCGGCAATGAGCGACATGATTGCCAGCGCGCGGCGTTTTGAAATGCAGATGAAGGTGATCAGCAGCGTCGATGATAACGCAGGCCGTGCCAACCAACTGCTGTCGATGAGTTAATTGAAAGGATACATGACAAGTATAAGTTGCCCGATGCGCAAGTTTATCGGGTCTATGGGGGCAATCGCAATTTATCGATTTTGCGAGCACTTGTAGGCCG

>purF|pausesites=(112,216,434,444,483,610,1004,1075,1138,1142,1442,1525,1528) TTTTATCATCAGATGTTTTTTTGATTATCTGCAAAGCTCGTCAAGTTTCTTGCCCAGAGCGTAAGTGCTCTGAGATGTGGCTTAACGAGGAAAAAGACGTATGTGCGGTATTGTCGGTATCGCCGGTGTTATGCCGGTTAACCAGTCGATTTATGATGCCTTAACGGTGCTTCAGCATCGCGGTCAGGATGCCGCCGGCATCATCACCATAGATGCCAATAACTGCTTCCGTTTGCGTAAAGCGAACGGGCTGGTGAGCGATGTATTTGAAGCTCGCCATATGCAGCGTTTGCAGGGCAATATGGGCATTGGTCATGTGCGTTACCCCACGGCTGGCAGCTCCAGCGCCTCTGAAGCGCAGCCGTTTTACGTTAACTCCCCGTATGGCATTACGCTTGCCCACAACGGCAATCTGACCAACGCTCACGAGTTGCGTAAAAAACTGTTTGAAGAAAAACGCCGCCACATCAACACCACTTCCGACTCGGAAATTCTGCTTAATATCTTCGCCAGCGAGCTGGACAACTTCCGCCACTACCCGCTGGAAGCCGACAATATTTTCGCTGCCATTGCTGCCACAAACCGCTTAATCCGCGGCGCGTATGCCTGTGTGGCGATGATTATCGGCCACGGTATGGTTGCTTTCCGCGATCCAAACGGGATTCGTCCGCTGGTACTGGGAAAACGTGATATTGACGAGAACCGTACAGAATATATGGTCGCTTCCGAAAGCGTAGCGCTCGATACGCTGGGCTTTGATTTCCTGCGTGACGTCGCGCCGGGCGAAGCGATTTACATCACTGAAGAAGGGCAGTTGTTTACCCGTCAATGTGCTGACAATCCGGTCAGCAATCCGTGCCTGTTTGAGTATGTATACTTTGCCCGCCCGGACTCGTTTATCGACAAAATTTCCGTTTACAGCGCGCGTGTGAATATGGGCACGAAACTGGGCGAGAAAATTGCCCGCGAATGGGAAGATCTGGATATCGACGTGGTGATCCCGATCCCAGAAACCTCGTGTGATATCGCGCTGGAAATTGCTCGTATTCTGGGCAAACCGTACCGCCAGGGCTTCGTTAAAAACCGCTATGTTGGCCGCACCTTTATCATGCCGGGCCAGCAGCTGCGTCGTAAGTCCGTGCGCCGTAAACTGAATGCCAACCGCGCCGAGTTCCGCGATAAAAACGTCCTGCTGGTCGACGACTCCATCGTCCGTGGCACCACTTCTGAGCAGATTATCGAGATGGCACGCGAAGCCGGAGCGAAGAAAGTGTACCTCGCTTCTGCGGCACCGGAAATTCGCTTCCCGAACGTTTATGGTATTGATATGCCGAGCGCCACGGAACTGATCGCTCACGGTCGCGAAGTTGATGAAATTCGCCAGATCATCGGTGCTGACGGGTTGATTTTCCAGGATCTGAACGATCTGATCGACGCCGTTCGCGCTGAAAATCCGGATATCCAGCAGTTTGAATGCTCGGTGTTCAACGGCGTCTACGTCACCAAAGATGTTGATCAGGGCTACCTCGATTTCCTCGATACGTTACGTAATGATGACGCCAAAGCAGTGCAACGTCAGAACGAAGTGGAAAATCTCGAAATGCATAACGAAGGATGATGCCTTCGCTGAGGGTGCCGGTCTGGCACCCTGACTTGCAACTCCCGCCGAAATCCTGCAAAGTCTGCCTGCAAGTCTGACAGGGCAACTATTTATGAAA

>cvpA|pausesites=(102,118,129,140,166) TTTATTGATGCGCGGGAAGGAAATCCCTACGCAAACGTTTTCTTTTTCTGTTAGAATGCGCCCCGAACAGGATGACAGGGCGTAAAATCGTGGGACACATATGGTCTGGATTGATTACGCCATAATCGCGGTGATTGCTTTTTCCTCTCTGGTTAGCCTGATCCGCGGCTTTGTTCGTGAAGCGTTATCGCTGGTGACATGGGGTTGTGCTTTCTTTGTTGCCAGTCATTACTACACTTACCTGTCAGTCTGGTTTACGGGCTTTGAAGACGAACTGGTTCGAAATGGGATTGCCATCGCGGTACTGTTTATCGCTACCCTGATCGTTGGTGCTATCGTGAACTTCGTGATAGGCCAGTTGGTGGAGAAAACGGGGTTGTCAGGCACCGATCGGGTGCTGGGCGTCTGTTTCGGTGCGTTGCGCGGTGTGTTGATTGTTGCTGCCATTCTCTTCTTTCTCGACTCCTTTACCGGGGTGTCGAAAAGCGAAGACTGGAGCAAATCACAGCTGATCCCGCAATTCAGTTTTATCATCAGATGTTTTTTTGATTATCTGCAAAGCTCGTCAAGTTTCTTGCCCAGAGCGTAAGTGCTCTGAGATGTGGCTTAACGAGGAAAAAGACGTATGTGCGGTATTGTCGGTATCGCCGGTGTTATGCCGGTTAACCAGTCGATTTATGATGCCTTAA

>rplE|pausesites=(230,249,293,294,404,410,495,512) CCGGCAAGGCTGACCGTGTAGGCTTTAGATTCGAAGACGGTAAAAAAGTCCGTTTCTTCAAGTCTAACAGCGAAACTATCAAGTAATTTGGAGTAGTACGATGGCGAAACTGCATGATTACTACAAAGACGAAGTAGTTAAAAAACTCATGACTGAGTTTAACTACAATTCTGTCATGCAAGTCCCTCGGGTCGAGAAGATCACCCTGAACATGGGTGTTGGTGAAGCGATCGCTGACAAAAAACTGCTGGATAACGCAGCAGCAGACCTGGCAGCAATCTCCGGTCAAAAACCGCTGATCACCAAAGCACGCAAATCTGTTGCAGGCTTCAAAATCCGTCAGGGCTATCCGATCGGCTGTAAAGTAACTCTGCGTGGCGAACGCATGTGGGAGTTCTTTGAGCGCCTGATCACTATTGCTGTACCTCGTATCCGTGACTTCCGTGGCCTGTCCGCTAAGTCTTTCGACGGTCGTGGTAACTACAGCATGGGTGTCCGTGAGCAGATCATCTTCCCAGAAATCGACTACGATAAAGTCGACCGCGTTCGTGGTTTGGATATTACCATTACCACTACTGCGAAATCTGACGAAGAAGGCCGCGCTCTGCTGGCTGCCTTTGACTTCCCGTTCCGCAAGTAAGGTAGGGTTACTAAATGGCTAAGCAATCAATGAAAGCACGCGAAGTAAAACGCGTAGCTTTAGCTGATAAATACTTCGCGAAACGCGCTGAACTGAAAGC

>ybjD|pausesites=(398,456,572,583,955,991,1436,1438) GACATTTGCCATGCTTAAATGTGATGTCATCACGTATTAGCAAGGCCTTTCCCGTTATACTGCCAGCGTAAAGGATAAGTCACATATTTCTGGAGGGGATATGATTCTTGAGCGCGTTGAAATTGTGGGTTTTCGCGGTATCAACCGTTTGTCGTTGATGCTGGAACAAAACAACGTCCTGATTGGGGAGAACGCGTGGGGTAAATCCAGCTTGCTGGACGCCTTAACTCTGCTGCTATCGCCAGAATCAGATCTCTACCATTTTGAGCGCGACGATTTCTGGTTCCCGCCGGGAGATATCAACGGGCGAGAACATCATCTGCATATTATTTTGACCTTCCGCGAATCGCTGCCAGGCCGACATCGGGTTCGCCGTTATCGGCCGCTGGAAGCGTGCTGGACGCCATGCACCGATGGCTATCACCGTATTTTTTATCGTCTGGAAGGGGAGAGTGCGGAAGACGGCAGCGTGATGACACTGCGCAGTTTTCTCGATAAAGACGGACATCCGATTGATGTCGAGGATATTAACGATCAGGCACGCCATCTGGTGCGTTTAATGCCGGTGCTGCGCTTGCGTGATGCCCGTTTTATGCGCCGTATTCGTAACGGCACGGTGCCAAATGTCCCTAATGTGGAAGTCACCGCGCGCCAGCTCGATTTCCTCGCCCGTGAGTTATCCTCACATCCGCAAAATCTCTCTGATGGGCAGATTCGTCAGGGACTTTCCGCAATGGTACAGCTGCTTGAGCATTATTTCTCTGAGCAGGGGGCCGGACAGGCGCGATATCGTTTAATGCGGCGGCGAGCCAGCAATGAGCAACGAAGCTGGCGCTATCTGGATATCATCAACCGGATGATTGACCGACCTGGTGGGCGCTCGTATCGGGTTATTTTGCTCGGCCTATTTGCTACTTTGTTGCAGGCAAAAGGCACATTGCGACTGGATAAAGACGCCCGTCCATTGTTGCTGATCGAAGATCCAGAAACCCGTTTACACCCCATTATGCTTTCAGTTGCCTGGCATCTGTTGAATCTTCTGCCATTGCAGCGCATTGCCACCACCAACTCGGGTGAGTTGCTTTCGTTAACGCCGGTAGAGCATGTTTGCCGACTGGTACGTGAGTCCTCGCGCGTTGCCGCCTGGCGTCTGGGGCCGAGTGGCTTGAGTACCGAAGATAGCCGACGCATATCCTTTCACATTCGTTTTAACCGTCCGTCATCGCTGTTTGCACGCTGCTGGTTGCTGGTGGAAGGGGAAACGGAAACCTGGGTTATCAATGAACTGGCGCGTCAGTGCGGACATCATTTTGATGCCGAAGGGATCAAGGTCATTGAGTTTGCCCAGTCCGGGCTAAAGCCACTGGTTAAATTTGCCCGCCGAATGGGGATTGAATGGCATGTACTGGTCGATGGCGATGAAGCAGGGAAGAAATATGCCGCTACGGTACGCAGCCTGTTGAATAACGATCGGGAAGCCGAACGAGAACATTTAACGGCGTTACCGGCGCTGGATATGGAACATTTTATGTATCGCCAGGGATTTTCCGATGTGTTCCACCGCATGGCGCAAATCCCGGAAAATGTACCGATGAATCTACGCAAAATTATCTCGAAAGCGATCCATCGCTCTTCCAAACCCGATCTTGCCATTGAAGTGGCAATGGAGGCAGGACGTCGTGGTGTGGACTCCGTACCGACGCTGCTGAAAAAAATGTTCTCACGCGTGCTGTGGCTGGCGCGCGGTCGCGCGGATTAACCGCGAAACATCGTGGCCATTTGTGGCTGAATAGCGTCGAGCATCTCATAGCGCCGACGGTATTCAGCCCGTTTTTTACTGGCGATTTCGGCAATCTCTT
By following the link below, the .fasta content above will be loaded into Pauser. The SimPol and NBC pause site predictions can be compared with the known locations specified above. The algorithms should take several seconds to complete, depending on your machine.

Pauser example session: Open





Software information
Running Pauser from the command line

Running Pauser from the command line is the recommended option for large batches of long sequences. The following instructions will work for Linux/Mac.
1. Download Pauser from here and unzip. Open up the command terminal and navigate to this folder.

2. Compile the ViennaRNA suite module. RNAfold is used when enabled in the SimPol .xml file.

cd src/asm/ViennaRNA
bash build_vrna.sh

3. Compile Pauser. ViennaRNA must be successfully compiled first.

cd ../../../pauser/src/asm/
bash build_vrna.sh

4. Run Pauser.

bash pauser.sh -fasta path/to/sequences.fasta -o path/to/output.csv

where sequences.fasta is the file containing sequences, and output.csv the file to save the results to. Optional arguments:

-h: View the command line argument specifications.

-xml session.xml: Load the specified SimPol session in .xml format. If unspecified, will load the default session at pauser/pauser.xml.

-nbc params.txt: Load the specified naive Bayes classifier parameters. If unspecified, will load the default session at pauser/NBC_pauser.txt.





Pauser in the Literature


Approximate Bayesian computation of transcriptional pausing mechanisms

In this project (manuscript in preparation) we inferred mechanisms of transcriptional pausing as a binary classification problem. The known locations of pause sites were experimentally measured by Larson et al. 2014. The SimPol session file (.xml) used to train the model can be downloaded below, as can the posterior distribution (.log, which can be viewed with SimPol or with Tracer). Training set (50 genes) and test set (2403 genes) are available for download or to visualise and test the two classifiers on in Pauser.
NCBI: NC_000913



Enzyme View .xml Download .log Training set Test set


Jordan Douglas, Richard Kingston, and Alexei Drummond. "Approximate Bayesian computation of transcriptional pausing mechanisms." bioRxiv (2019): 748210. DOI: view




References














































































Pauser | Centre for Computational Evolution & School of Biological Sciences, University of Auckland | 2019