**Methods and Interpretations**

It was previously demonstrated that splicing elements are positional dependent. We exploited this relationship between location and function by comparing positional distributions between all possible 4,096 hexamers. The distance measure used in this study found that point mutations that produced higher distances disrupted splicing, whereas point mutations with smaller distances generally had no affect on splicing. Reasoning the idea that splicing elements have signature positional distributions around constitutively spliced exons, we introduced Spliceman - an online tool that predicts how likely distant mutations around annotated splice sites were to disrupt splicing.

All files that were used by this tool can be found at the Reference Lists.

The computational methods and algorithms are explained below:

**1. Constructing exon databases**

Each exon/intron database for all species was built from RefSeq annotations of the following assemblies stored
at the UCSC Table Browser.

- Human (hg18) - 196,851 exons
- Human (hg19) - 197,082 exons
- Chimp (panTro3) - 5,551 exons
- Rhesus (rheMac2) - 12,249 exons
- Mouse (mm9) - 192,865 exons
- Rat (rn4) - 139,742 exons
- Guinea Pig (cavPro3) - 2,524 exons
- Cat (felCat4) - 1,933 exons
- Dog (canFam2) - 10,858 exons
- Chicken (galGal3) - 45,932 exons
- X. tropicalis (xenTro2) - 72,767 exons
- Zebrafish (danRer7) - 121,369 exons

- two 200 nucleotides intronic flanks and
- two 100 nucleotides exonic flanks

**2. Generating profiles:**

*2.1 Why hexamers?*RNA binding proteins typically contain one to four RNA recognition motif domains so that motifs recovered are expected to be of heterogeneous length. It is therefore unlikely that there is a single word size that is appropriate for all motifs presented in the data. Previous implementations of dictionary methods illustrated how a smaller word size choice was generally self-correcting. Our analysis of prior SELEX studies indicated RNA binding proteins recognized motifs between the length of 6 and 10 nucleotides. For these reasons, as well as computation efficiency, we selected hexamers for the analysis presented in this tool.

*2.2 Counting hexamers*The algorithm traversed each position of the two following regions: upstream 3'ss and downstream 5'ss as illustrated in the figure above. For each hexamer, the counting algorithm generated two vectors of 300 nucleotides, and each vector contained several information:

- 300 positions relative to splice sites,
- frequencies on each position,
- raw counts on each position, and
- the depth of the exon database on each position (mostly due to short exons, we keep track of the depth of the database on each position to generate positional frequencies).

*3 Computing distance matrix*
This tool uses the L1 distance metric to qualify the "closeness" between two feature vectors (i.e. two hexamers). An obvious choice for distance metric
is the Euclidance distance; however, the sharp peaks created by the splice site hexamers themselves dominated the comparison
and prevented the detection of more subtle signals. This was remedied by using the Manhattan distance, also referred to as the city block distance or
simply L1 distance.

The L1 distance metric, as illustrated in the equation below, was calculated as the sum of the absolute value of the differences in normalized counts between the two feature vectors at each of the 600 positions.

where

The L1 distance metric, as illustrated in the equation below, was calculated as the sum of the absolute value of the differences in normalized counts between the two feature vectors at each of the 600 positions.

*p*and*q*represent the normalized counts of two feature vectors at position*i*from -200 to 399.

*5 Calculating percentile ranks*
To allow standardized comparisons among L1 distances,
we converted these two variables into percentile ranks. This was archived by binning all L1 distances into 100
intervals (from 1 to 100) and assigning each L1 distance to its corresponding bin (i.e. a comparison between two hexamers that resulted
in low L1 distance would be assigned with a low percentile rank). The higher the percentile rank, the more likely the
point mutation is to disrupt splicing.