Methods and Interpretations

It was previously demonstrated that splicing elements are positional dependent. We exploited this relationship between location and function by comparing positional distributions between all possible 4,096 hexamers. The distance measure used in this study found that point mutations that produced higher distances disrupted splicing, whereas point mutations with smaller distances generally had no effect on splicing. Reasoning the idea that splicing elements have signature positional distributions around constitutively spliced exons, we introduced Spliceman - an online tool that predicts how likely distant mutations around annotated splice sites were to disrupt splicing.

The computational methods and algorithms are explained below:

1. Constructing exon databases

Each exon/intron database was built from RefSeq annotations of the following assemblies stored at the UCSC Table Browser.

  • Human (hg19) - 197,082 exons
Duplicated entries were screened and removed, and each sequence entry was divided into two distinct regions:

  • two 200-nucleotide intronic flanks and
  • two 100-nucleotide exonic flanks
on each side of the splice sites (ss) (see figure below; intronic regions are represented in line and exonic region in box). In the case that intronic or exonic length was less than 400 or 200 nucleotides, respectively, the sequence was divided by half and each half was assigned to its nearest splice site.

2. Generating profiles:

2.1 Why hexamers?
RNA binding proteins typically contain one to four RNA recognition motif domains so that motifs recovered are expected to be of heterogeneous length. It is therefore unlikely that there is a single word size that is appropriate for all motifs presented in the data. Previous implementations of dictionary methods illustrated how a smaller word size choice was generally self-correcting. Our analysis of prior SELEX studies indicated RNA binding proteins recognized motifs between the length of 6 and 10 nucleotides. For these reasons, as well as computation efficiency, we selected hexamers for the analysis presented in this tool.

2.2 Counting hexamers
The algorithm traversed each position of the two following regions: upstream 3'ss and downstream 5'ss as illustrated in the figure above. For each hexamer, the counting algorithm generated two vectors of 300 nucleotides, and each vector contained several pieces of information:

  • 300 positions relative to splice sites,
  • frequencies on each position,
  • raw counts on each position, and
  • the depth of the exon database on each position (mostly due to short exons, we keep track of the depth of the database on each position to generate positional frequencies).
Combining the two vectors - we called it the feature vector - would quantify the signature of a hexamer on positions relative to splice sites. For instance, plotting the positional frequencies of a feature vector for hexamer GCTGGG would produce a frequency plot as shown below:

Repeating this procedure for 4,096 times generated a feature vector for each hexamer. Because overlapping occurrences of internally repetitive words can occur more frequently than complex words, overlapping occurrences of any words were counted as a single occurrence in a window of 11. For example, a run of 11 A's (i.e. AAAAAAAAAAA) was counted as a single occurrence at the position where it was first observed.

3. Computing distance matrix

This tool uses the L1 distance metric to qualify the "closeness" between two feature vectors (i.e. two hexamers). An obvious choice for distance metric is the Euclidance distance; however, the sharp peaks created by the splice site hexamers themselves dominated the comparison and prevented the detection of more subtle signals. This was remedied by using the Manhattan distance, also referred to as the city block distance or simply L1 distance.

The L1 distance metric, as illustrated in the equation below, was calculated as the sum of the absolute value of the differences in normalized counts between the two feature vectors at each of the 600 positions.

where p and q represent the normalized counts of two feature vectors at position i from -200 to 399.

4. Calculating percentile ranks

To allow standardized comparisons among L1 distances, we converted these two variables into percentile ranks. This was archived by binning all L1 distances into 100 intervals (from 1 to 100) and assigning each L1 distance to its corresponding bin (i.e. a comparison between two hexamers that resulted in low L1 distance would be assigned with a low percentile rank). The higher the percentile rank, the more likely the point mutation is to disrupt splicing.