General Information

What it does

The Imperfect SSR Finder is an online tool to help geneticists find Simple Sequence Repeats (SSR), aka microsatellites or Short Tandem Repeats (STR), in uploaded FASTA sequences.

Reason for Existance

A perl script at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl is able to find perfect SSRs, but I haven't found any readily-available tool for finding imperfects SSRs, so I wrote this tool to do so. It also gives more detailed and easily-readable output.

Browser Requirements

Any browser should work. If you have at least a 1024x768 screen, the three main input sections "fit" using Firefox, but not Internet Explorer.

File Format

This program requires a FASTA-formatted sequence file. Briefly, it should contain one or more FASTA sequences, and look something like this:

>Description Line
cagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagag
acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtg
tcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagaga
gatactcgagagag

>Another Description Line
acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagacgtc
gatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgtcgag
agagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagagaga

UPPERCASE/lowercase doesn't matter. Line lengths don't matter. Comments (defined as a semi-colon ';' and all characters following it on that line) are stripped out.

Attribution

This program was inspired by S. Cartinhour's May 2000 perl script available at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl, but it has been almost entirely rewritten. The only parts left from the original program are the regular expressions and the fact that the program loops through each motif length individually (i.e., all trinucleotides are found in one pass, then all dinucleotides are found in a later pass).

Imad A. Eujayl, PhD., USDA-ARS-NWISRL in Kimberly, Idaho, provided the program requirements and made sure that the program's output was useful for scientists.

To cite this work, you may use
Stieneke, D. L., Eujayl, I. A. (2007) Imperfect SSR Finder. http://ssr.nwisrl.ars.usda.gov/. Version 1.0. Kimberly, ID: USDA-ARS-NWISRL.

Requests / Comments / Suggestions

If you have any suggestions to improve the program or its documentation, or you find an error in the program, please send me an email at dan.R.stienekeDon'tCutAndPaste-TypeItInAsYouSeeIt@ars.usda.gov.com.

Step 1 - Define what you're looking for

General Options

Output linewrap width
The number of columns for output. Columns will be line-wrapped at this width. Typically use 70-80. Applies only to the data lines, not the description line.
Output UPPERCASE/lowercase
The alpha case for output, either UPPERCASE (ACGT) or lowercase (acgt). Applies only to the data lines, not the description line.
Invert SSR case
If yes, invert the alpha case of found SSRs, for example actgaCGCGCGCGCGagtca or ACTGAcgcgcgcgcgAGTCA. If no, found SSRs will appear in the same case as the rest of the sequence, as determined by Output UPPERCASE/lowercase.
Description processing
Change the description line in a specified way. Currently available:
  Remove fields > 20 chars Remove fields with more than 20 characters from the description line (to make them less unwieldy).
  Leave it alone Do nothing.
If you have a suggestion for description processing that will be widely used, please email me at Dan.Stieneke@ars.usda.gov.
Ignore initial pair count
Do not look for SSRs in the first 'x' base pairs, as they may be less reliable than the rest of the sequence.
Ignore trailing pair count
Do not look for SSRs in the last 'x' base pairs, as they may be less reliable than the rest of the sequence.
Aggressive join
When both sides of an imperfect SSR "want" to join, they always join. When neither side "wants" to join, they never join. This setting determines what to do if only one one side "wants" to join. Yes: join. No: don't join. See "Wanting to Join" under Imperfect SSR Minimums for more information about SSRs "wanting" to join.

Perfect SSR Minimums

For each type of SSR (dinucleotides, trinucleotides, etc.), you must enter the minimum number of repeats necessary for that SSR to be relevant. For instance, if for a trinucleotide you enter a repeat of 5, trinucleotides repeated only 4 times will be ignored, but those repeated 5 or more times will be reported.

Imperfect SSR Minimums

An imperfect SSR is similar to an SSR, except that it is interrupted by one or more NRRs. To determine if an SSR is relevant, the program must know both the minimum repeat length and the maximum length allowed between any two SSRs.

The table of fields

For each type of imperfect SSR, there is a place to enter 3 information pairs. Each pair is optional. The first column in each pair, Rptx, is the minimum number of repeats necessary for that SSR to be considered relevant (much like the perfect SSR minimums). The second column in each pair, NRRx, is the maximum distance between the SSR and an adjacent SSR. If the distance between an SSR and its neighboring SSR is less than or equal to the number entered in NRRx, the SSR will "want" to join with its neighbor. Otherwise, it will not "want" to join with its neighbor (see "Wanting" to join"). For example, if for your purposes a trinucleotide of 4 repeats is relevant provided there are no more than 6 NRR base pairs between it and neighboring strands, but a trinucleotide of only 3 repeats must be within 2 NRR of another strand to be considered relevant, you would enter 4,6 ; 3,2 ; blank,blank in the trinucleotide row. Or, if even a repeat of 2 trinucleotides is relevant, provided it's within one base pair of another SSR, you would enter 4,6 ; 3,2 ; 2,1.

"Wanting" to join

To be considered relevant, each section of the imperfect SSR must pass two tests:
1 - is the SSR long enough?
2 - is the SSR near another SSR?
If the SSR is at least "Rptx" long, and is no more than "NRRx" base pairs away from an adjacent SSR (in either direction), it will be treated as an SSR "wanting" to join. If it is long enough, but too far away from its neighbors, it will be treated as an SSR not "wanting" to join. (If it is not long enough, it is ignored.)

Using aggressive join, adjacent SSRs will be joined even if only one side "wants" to join. Without aggressive join, SSRs will be joined only if both SSRs "want" to join. This is important because SSRs are only reported if they are long enough to be perfect by themselves, or if they are joined with other perfect or imperfect SSRs. Incidentally, perfect SSRs "want" to join because their repeat count is (or at least should always be) greater than that of the corresponding imperfect SSR's repeat count.

For each of the following example sequences, assume settings for trinucleotides of 5 repeats for perfect SSRs, and 4,6 ; 3,2 ; blank; blank for imperfect SSRs. (You can copy these sequences and paste them into a file and then upload the file to test this yourself, but you must set both "Ignore initial pair count" and "Ignore trailing pair count" to 0 (zero), or no sequences will be found.)

>ONE PERFECT ONE IMPERFECT, different 'wants': Example with gat(5), which is perfect, then 6 pairs of NRR, then (tag)3
atcgctaGATGATGATGATGATccccccTAGTAGTAGtacgcta
The left repeat, (gat)5, is perfect. It also "wants" to join with other SSRs within 6 pairs, because it exceeds the 4 repeats specified in the imperfect SSR setting of ( 4,6 ). The right repeat, (tag)3, is imperfect and does not want to join, because it is more than 2 pairs away from the nearest SSR. With aggressive join on, this will be reported as (gat)5-(tag)3. With aggressive join off, this will be reported as (gat)5, as it meets the requirements for a perfect SSR.
>TWO IMPERFECTS, different 'wants': (gat)4, then 6 pairs of NRR, then (tag)3
atcgctaGATGATGATGATccccccTAGTAGTAGtacgcta
The left repeat, (gat)4, will "want" to join because it is not more than 6 pairs away from the nearest SSR. The right repeat, (tag)3, will not "want" to join, because it is more than 2 pairs away from the nearest SSR. So, with aggressive join on, this will be reported as (gat)4-(tag)3. With aggressive join off, this will not be reported at all.
>TWO IMPERFECTS, both 'want': (gat)4, 2 NRR, (tag)3
atcgctaGATGATGATGATccTAGTAGTAGtacgcta
Since the intervening NRR is at or below the threshold for both sides, both sides want to join, and this will be reported as (gat)4-(tag)3 regardless of the aggressive join setting.
>TWO IMPERFECTS, neither 'want': (gat)4, 7 NRR, tag(3)
atcgctaGATGATGATGATcccccccTAGTAGTAGtacgcta
Since the intervening NRR exceeds the threshold for both sides, neither side "wants" to join, and this will not be reported regardless of the aggressive join setting.

Note that the quote marks used in the decriptions are single quotes ('), not double quotes ("). Double quote marks will confuse Excel when you upload the summary .CSV file.

Once you are finished entering parameters, click the "Go to Step 2" button.

Step 2 - Upload a sequence

The page for step two presents a confirmation of all of your settings and two upload fields. You can bookmark the step 2 page to save your parameters. After confirming that the parameters are correct, upload a FASTA-formatted sequence by either entering it into the text box (typcially by cut-and-paste) or by clicking the "Browse..." button to choose a file, then click the "Process!" button. Please note that you cannot both enter data into the text box and upload a file simultaneously.

Please be aware that processing time grows exponentially with the size of the sequence(s) being processed. A 10MB file consisting of 10,000 1K sequences completes in about a minute. A 10MB file consisting of a single 10MB sequence takes about an hour and a half. If you have large sequences, you may want to consider splitting them into smaller pieces.

Step 3 - View results

For each sequence in which an SSR is reported, a summary showing the sequence count (the first sequence in the file is #1, the 2nd is #2, and so on), description line, and found SSRs is shown on the web page as each sequence completes. Once the whole file has completed processing, three files are made available for download:

Miscellaneous

Trouble Shooting