It was previously demonstrated that splicing elements are positional dependent. We exploited this relationship between location and function by comparing positional distributions between all possible 4,096 hexamers. The distance measure used in this study found that point mutations that produced higher distances disrupted splicing, whereas point mutations with smaller distances generally had no effect on splicing. Reasoning the idea that splicing elements have signature positional distributions around constitutively spliced exons, we introduced Spliceman - an online tool that predicts how likely distant mutations around annotated splice sites were to disrupt splicing.

The computational methods and algorithms are explained below:

Each exon/intron database was built from RefSeq annotations of the following assemblies stored
at the UCSC Table Browser.

- Human (hg19) - 197,082 exons

- two 200-nucleotide intronic flanks and
- two 100-nucleotide exonic flanks

RNA binding proteins typically contain one to four RNA recognition motif domains so that motifs recovered are expected to be of heterogeneous length. It is therefore unlikely that there is a single word size that is appropriate for all motifs presented in the data. Previous implementations of dictionary methods illustrated how a smaller word size choice was generally self-correcting. Our analysis of prior SELEX studies indicated RNA binding proteins recognized motifs between the length of 6 and 10 nucleotides. For these reasons, as well as computation efficiency, we selected hexamers for the analysis presented in this tool.

The algorithm traversed each position of the two following regions: upstream 3'ss and downstream 5'ss as illustrated in the figure above. For each hexamer, the counting algorithm generated two vectors of 300 nucleotides, and each vector contained several pieces of information:

- 300 positions relative to splice sites,
- frequencies on each position,
- raw counts on each position, and
- the depth of the exon database on each position (mostly due to short exons, we keep track of the depth of the database on each position to generate positional frequencies).

This tool uses the L1 distance metric to qualify the "closeness" between two feature vectors (i.e. two hexamers). An obvious choice for distance metric
is the Euclidance distance; however, the sharp peaks created by the splice site hexamers themselves dominated the comparison
and prevented the detection of more subtle signals. This was remedied by using the Manhattan distance, also referred to as the city block distance or
simply L1 distance.

The L1 distance metric, as illustrated in the equation below, was calculated as the sum of the absolute value of the differences in normalized counts between the two feature vectors at each of the 600 positions.

where *p* and *q* represent the normalized counts of two feature vectors at position *i* from -200 to 399.

The L1 distance metric, as illustrated in the equation below, was calculated as the sum of the absolute value of the differences in normalized counts between the two feature vectors at each of the 600 positions.

To allow standardized comparisons among L1 distances,
we converted these two variables into percentile ranks. This was archived by binning all L1 distances into 100
intervals (from 1 to 100) and assigning each L1 distance to its corresponding bin (i.e. a comparison between two hexamers that resulted
in low L1 distance would be assigned with a low percentile rank). The higher the percentile rank, the more likely the
point mutation is to disrupt splicing.