The Imperfect SSR Finder is an online tool to help geneticists find Simple Sequence Repeats (SSR), aka microsatellites or Short Tandem Repeats (STR), in uploaded FASTA sequences.
A perl script at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl is able to find perfect SSRs, but I haven't found any readily-available tool for finding imperfects SSRs, so I wrote this tool to do so. It also gives more detailed and easily-readable output.
Any browser should work. If you have at least a 1024x768 screen, the three main input sections "fit" using Firefox, but not Internet Explorer.
This program requires a FASTA-formatted sequence file. Briefly, it should contain one or more FASTA sequences, and look something like this:
>Description Line cagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagag acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtg tcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagaga gatactcgagagag >Another Description Line acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagacgtc gatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgtcgag agagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagagaga
UPPERCASE/lowercase doesn't matter. Line lengths don't matter. Comments (defined as a semi-colon ';' and all characters following it on that line) are stripped out.
This program was inspired by S. Cartinhour's May 2000 perl script available at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl, but it has been almost entirely rewritten. The only parts left from the original program are the regular expressions and the fact that the program loops through each motif length individually (i.e., all trinucleotides are found in one pass, then all dinucleotides are found in a later pass).
Imad A. Eujayl, PhD., USDA-ARS-NWISRL in Kimberly, Idaho, provided the program requirements and made sure that the program's output was useful for scientists.
To cite this work, you may use
Stieneke, D. L., Eujayl, I. A. (2007) Imperfect SSR Finder. http://ssr.nwisrl.ars.usda.gov/. Version 1.0. Kimberly, ID: USDA-ARS-NWISRL.
ACTGAcgcgcgcgcgAGTCA. If no, found SSRs will appear in the same case as the rest of the sequence, as determined by Output UPPERCASE/lowercase.
|Remove fields > 20 chars||Remove fields with more than 20 characters from the description line (to make them less unwieldy).|
|Leave it alone||Do nothing.|
For each type of SSR (dinucleotides, trinucleotides, etc.), you must enter the minimum number of repeats necessary for that SSR to be relevant. For instance, if for a trinucleotide you enter a repeat of 5, trinucleotides repeated only 4 times will be ignored, but those repeated 5 or more times will be reported.
An imperfect SSR is similar to an SSR, except that it is interrupted by one or more NRRs. To determine if an SSR is relevant, the program must know both the minimum repeat length and the maximum length allowed between any two SSRs.
4,6 ; 3,2 ; blank,blankin the trinucleotide row. Or, if even a repeat of 2 trinucleotides is relevant, provided it's within one base pair of another SSR, you would enter
4,6 ; 3,2 ; 2,1.
To be considered relevant, each section of the imperfect SSR must pass two tests:
1 - is the SSR long enough?
2 - is the SSR near another SSR?
If the SSR is at least "Rptx" long, and is no more than "NRRx" base pairs away from an adjacent SSR (in either direction), it will be treated as an SSR "wanting" to join. If it is long enough, but too far away from its neighbors, it will be treated as an SSR not "wanting" to join. (If it is not long enough, it is ignored.)
Using aggressive join, adjacent SSRs will be joined even if only one side "wants" to join. Without aggressive join, SSRs will be joined only if both SSRs "want" to join. This is important because SSRs are only reported if they are long enough to be perfect by themselves, or if they are joined with other perfect or imperfect SSRs. Incidentally, perfect SSRs "want" to join because their repeat count is (or at least should always be) greater than that of the corresponding imperfect SSR's repeat count.
For each of the following example sequences, assume settings for trinucleotides of 5 repeats for perfect SSRs, and
4,6 ; 3,2 ; blank; blank for imperfect SSRs. (You can copy these sequences and paste them into a file and then upload the file to test this yourself, but you must set both "Ignore initial pair count" and "Ignore trailing pair count" to 0 (zero), or no sequences will be found.)
>ONE PERFECT ONE IMPERFECT, different 'wants': Example with gat(5), which is perfect, then 6 pairs of NRR, then (tag)3 atcgctaGATGATGATGATGATccccccTAGTAGTAGtacgcta
The left repeat, (gat)5, is perfect. It also "wants" to join with other SSRs within 6 pairs, because it exceeds the 4 repeats specified in the imperfect SSR setting of ( 4,6 ). The right repeat, (tag)3, is imperfect and does not want to join, because it is more than 2 pairs away from the nearest SSR. With aggressive join on, this will be reported as (gat)5-(tag)3. With aggressive join off, this will be reported as (gat)5, as it meets the requirements for a perfect SSR.
>TWO IMPERFECTS, different 'wants': (gat)4, then 6 pairs of NRR, then (tag)3 atcgctaGATGATGATGATccccccTAGTAGTAGtacgcta
The left repeat, (gat)4, will "want" to join because it is not more than 6 pairs away from the nearest SSR. The right repeat, (tag)3, will not "want" to join, because it is more than 2 pairs away from the nearest SSR. So, with aggressive join on, this will be reported as (gat)4-(tag)3. With aggressive join off, this will not be reported at all.
>TWO IMPERFECTS, both 'want': (gat)4, 2 NRR, (tag)3 atcgctaGATGATGATGATccTAGTAGTAGtacgcta
Since the intervening NRR is at or below the threshold for both sides, both sides want to join, and this will be reported as (gat)4-(tag)3 regardless of the aggressive join setting.
>TWO IMPERFECTS, neither 'want': (gat)4, 7 NRR, tag(3) atcgctaGATGATGATGATcccccccTAGTAGTAGtacgcta
Since the intervening NRR exceeds the threshold for both sides, neither side "wants" to join, and this will not be reported regardless of the aggressive join setting.
Note that the quote marks used in the decriptions are single quotes ('), not double quotes ("). Double quote marks will confuse Excel when you upload the summary .CSV file.
Once you are finished entering parameters, click the "Go to Step 2" button.
The page for step two presents a confirmation of all of your settings and two upload fields. You can bookmark the step 2 page to save your parameters. After confirming that the parameters are correct, upload a FASTA-formatted sequence by either entering it into the text box (typcially by cut-and-paste) or by clicking the "Browse..." button to choose a file, then click the "Process!" button. Please note that you cannot both enter data into the text box and upload a file simultaneously.
Please be aware that processing time grows exponentially with the size of the sequence(s) being processed. A 10MB file consisting of 10,000 1K sequences completes in about a minute. A 10MB file consisting of a single 10MB sequence takes about an hour and a half. If you have large sequences, you may want to consider splitting them into smaller pieces.
For each sequence in which an SSR is reported, a summary showing the sequence count (the first sequence in the file is #1, the 2nd is #2, and so on), description line, and found SSRs is shown on the web page as each sequence completes. Once the whole file has completed processing, three files are made available for download: