USDA-ARS-NWISRL - Imperfect SSR Finder : Detailed instructions

General Information

What it does

The Imperfect SSR Finder is an online tool to help geneticists find Simple Sequence Repeats (SSR), aka microsatellites or Short Tandem Repeats (STR), in uploaded FASTA sequences.

Reason for Existence

A perl script at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl is able to find perfect SSRs, but I haven't found any readily-available tool for finding imperfects SSRs, so I wrote this tool to do so. It also gives more detailed and easily-readable output.

Browser Requirements

Internet Explorer doesn't render the front page correctly, but it still functions.

File Format

This program requires a FASTA-formatted sequence file. Briefly, it should contain one or more FASTA sequences, and look something like this:

>Description Line
cagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagag
acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtg
tcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagaga
gatactcgagagag

>Another Description Line
acgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagacgtc
gatactcgagagagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagagcagtgtcgag
agagcagtgacgtcgatactcgagagagcagtgacgtcgatactcgagagaggatactcgagagagaga

UPPERCASE/lowercase doesn't matter. Line lengths don't matter. Comments (defined as a semi-colon ';' and all characters following it on that line) are stripped out.

Attribution

This program was inspired by S. Cartinhour's May 2000 perl script available at ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl, but it has been almost entirely rewritten. The only parts left from the original program are the regular expressions and the fact that the program loops through each motif length individually (i.e., all trinucleotides are found in one pass, then all dinucleotides are found in a later pass).

Imad A. Eujayl, PhD., USDA-ARS-NWISRL in Kimberly, Idaho, provided the program requirements and made sure that the program's output was useful for scientists.

To cite this work, you may use
Stieneke, D. L., Eujayl, I. A. (2007) Imperfect SSR Finder. http://ssr.nwisrl.ars.usda.gov/. Version 1.0. Kimberly, ID: USDA-ARS-NWISRL.

Requests / Comments / Suggestions

If you have any suggestions to improve the program or its documentation, or you find an error in the program, please send me an email at dan.stieneke@usda.gov.

Step 1 - Define what you're looking for

General Options

Output linewrap width

The number of columns for output. Columns will be line-wrapped at this width. Typically use 70-80. Applies only to the data lines, not the description line.

Output UPPERCASE/lowercase

The alpha case for output, either UPPERCASE (ACGT) or lowercase (acgt). Applies only to the data lines, not the description line.

Invert SSR case

If yes, invert the alpha case of found SSRs, for example actgaCGCGCGCGCGagtca or ACTGAcgcgcgcgcgAGTCA. If no, found SSRs will appear in the same case as the rest of the sequence, as determined by Output UPPERCASE/lowercase.

Description processing

Change the description line in a specified way. Currently available:

Remove fields > 20 chars	Remove fields with more than 20 characters from the description line (to make them less unwieldy).
Leave it alone	Do nothing.

If you have a suggestion for description processing that will be widely used, please email me at dan.stieneke@usda.gov

Ignore initial pair count

Do not look for SSRs in the first 'x' base pairs, as they may be less reliable than the rest of the sequence.

Ignore trailing pair count

Do not look for SSRs in the last 'x' base pairs, as they may be less reliable than the rest of the sequence.

Aggressive join

When both sides of an imperfect SSR "want" to join, they always join. When neither side "wants" to join, they never join. This setting determines what to do if only one one side "wants" to join. Yes: join. No: don't join. See "Wanting to Join" under Imperfect SSR Minimums for more information about SSRs "wanting" to join.

Perfect SSR Minimums

For each type of SSR (dinucleotides, trinucleotides, etc.), you must enter the minimum number of repeats necessary for that SSR to be relevant. For instance, if for a trinucleotide you enter a repeat of 5, trinucleotides repeated only 4 times will be ignored, but those repeated 5 or more times will be reported.

Imperfect SSR Minimums

An imperfect SSR is similar to an SSR, except that it is interrupted by one or more NRRs. To determine if an SSR is relevant, the program must know both the minimum repeat length and the maximum length allowed between any two SSRs.

The table of fields

For each type of imperfect SSR, there is a place to enter 3 information pairs. Each pair is optional. The first column in each pair, Rpt_x, is the minimum number of repeats necessary for that SSR to be considered relevant (much like the perfect SSR minimums). The second column in each pair, NRR_x, is the maximum distance between the SSR and an adjacent SSR. If the distance between an SSR and its neighboring SSR is less than or equal to the number entered in NRR_x, the SSR will "want" to join with its neighbor. Otherwise, it will not "want" to join with its neighbor (see "Wanting" to join"). For example, if for your purposes a trinucleotide of 4 repeats is relevant provided there are no more than 6 NRR base pairs between it and neighboring strands, but a trinucleotide of only 3 repeats must be within 2 NRR of another strand to be considered relevant, you would enter 4,6 ; 3,2 ; blank,blank in the trinucleotide row. Or, if even a repeat of 2 trinucleotides is relevant, provided it's within one base pair of another SSR, you would enter 4,6 ; 3,2 ; 2,1.

"Wanting" to join

To be considered relevant, each section of the imperfect SSR must pass two tests:
1 - is the SSR long enough?
2 - is the SSR near another SSR?
If the SSR is at least "Rpt_x" long, and is no more than "NRR_x" base pairs away from an adjacent SSR (in either direction), it will be treated as an SSR "wanting" to join. If it is long enough, but too far away from its neighbors, it will be treated as an SSR not "wanting" to join. (If it is not long enough, it is ignored.)

Using aggressive join, adjacent SSRs will be joined even if only one side "wants" to join. Without aggressive join, SSRs will be joined only if both SSRs "want" to join. This is important because SSRs are only reported if they are long enough to be perfect by themselves, or if they are joined with other perfect or imperfect SSRs. Incidentally, perfect SSRs "want" to join because their repeat count is (or at least should always be) greater than that of the corresponding imperfect SSR's repeat count.

For each of the following example sequences, assume settings for trinucleotides of 5 repeats for perfect SSRs, and 4,6 ; 3,2 ; blank; blank for imperfect SSRs. (You can copy these sequences and paste them into a file and then upload the file to test this yourself, but you must set both "Ignore initial pair count" and "Ignore trailing pair count" to 0 (zero), or no sequences will be found.)

>ONE PERFECT ONE IMPERFECT, different 'wants': Example with gat(5), which is perfect, then 6 pairs of NRR, then (tag)3
atcgctaGATGATGATGATGATccccccTAGTAGTAGtacgcta

The left repeat, (gat)5, is perfect. It also "wants" to join with other SSRs within 6 pairs, because it exceeds the 4 repeats specified in the imperfect SSR setting of ( 4,6 ). The right repeat, (tag)3, is imperfect and does not want to join, because it is more than 2 pairs away from the nearest SSR. With aggressive join on, this will be reported as (gat)5-(tag)3. With aggressive join off, this will be reported as (gat)5, as it meets the requirements for a perfect SSR.

>TWO IMPERFECTS, different 'wants': (gat)4, then 6 pairs of NRR, then (tag)3
atcgctaGATGATGATGATccccccTAGTAGTAGtacgcta

The left repeat, (gat)4, will "want" to join because it is not more than 6 pairs away from the nearest SSR. The right repeat, (tag)3, will not "want" to join, because it is more than 2 pairs away from the nearest SSR. So, with aggressive join on, this will be reported as (gat)4-(tag)3. With aggressive join off, this will not be reported at all.

>TWO IMPERFECTS, both 'want': (gat)4, 2 NRR, (tag)3
atcgctaGATGATGATGATccTAGTAGTAGtacgcta

Since the intervening NRR is at or below the threshold for both sides, both sides want to join, and this will be reported as (gat)4-(tag)3 regardless of the aggressive join setting.

>TWO IMPERFECTS, neither 'want': (gat)4, 7 NRR, tag(3)
atcgctaGATGATGATGATcccccccTAGTAGTAGtacgcta

Since the intervening NRR exceeds the threshold for both sides, neither side "wants" to join, and this will not be reported regardless of the aggressive join setting.

Note that the quote marks used in the decriptions are single quotes ('), not double quotes ("). Double quote marks will confuse spreadsheets when you upload the summary .CSV file.

Once you are finished entering parameters, click the "Go to Step 2" button.

Step 2 - Upload a sequence

The page for step two presents a confirmation of all of your settings and two upload fields. You can bookmark the step 2 page to save your parameters. After confirming that the parameters are correct, upload a FASTA-formatted sequence by either entering it into the text box (typcially by cut-and-paste) or by clicking the "Browse..." button to choose a file, then click the "Process!" button. Please note that you cannot both enter data into the text box and upload a file simultaneously.

Please be aware that processing time grows exponentially with the size of the sequence(s) being processed. A 10MB file consisting of 10,000 1K sequences completes in about a minute. A 10MB file consisting of a single 10MB sequence takes about an hour and a half. If you have large sequences, you may want to consider splitting them into smaller pieces.

Step 3 - View results

For each sequence in which an SSR is reported, a summary showing the sequence count (the first sequence in the file is #1, the 2nd is #2, and so on), description line, and found SSRs is shown on the web page as each sequence completes. Once the whole file has completed processing, four files are made available for download:

sequence.all.* : Available as both a text file and an HTML file. It contains all sequences uploaded, in the order in which they were uploaded. It has been changed from the original in the following ways:
1. Output width has been set to "Output linewrap width", as specified in Step 1.
2. Alpha case has been set to UPPER or lower, as specified in Step 1.
3. Found SSRs are optionally case-inverted, as specified in Step 1.
4. The description line was optionally altered per "Description processing", as specified in Step 1.
5. All comments (defined as a semi-colon ';' and all characters following it on that line) are stripped out.
6. In the HTML file, found SSRs are shown in yellow background.
sequence.hits.* : Available as both a text and an HTML file. It contains all sequences in which SSRs were found. It has been modified in the same way as sequence.all.*.
sequence.miss.* : Available as both a text and an HTML file. It contains all sequences in which SSRs were not found. It has been modified in the same way as sequence.all.*.
sequence.csv : A "comma-separated value" file (suitable for spreadsheet import) containing a summary of all found SSRs. The fields are as follows:
1. Title : The description line, optionally altered per "Description processing", as specfied in Step 1.
2. Type : Either "p" (perfect)'; "i,s" (imperfect, symmetrical (all motifs are identical); or "i,a" (imperfect, asymmetrical).
3. Motif : For perfect SSRs, the found motif. For imperfect SSRs, this is left blank.
4. Motif Length : For perfect SSRs, the length of the found motif. For imperfect SSRs, this is left blank.
5. # Repeats : For perfect SSRs, the number of repeats of the single motif. For imperfects SSRs, the total number of repeats of all motifs combined.
6. Compound Motif : The SSR in (motif)count-(motif)count-motif(count)... format.
7. SSR Number : Indicates that the listed SSR is the n^th SSR found in that sequence.
8. Start : Numeric position of the beginning of the found SSR in its containing sequence (1-indexed).
9. End : Numeric position of the end of the found SSR in its containing sequence (1-indexed).
10. Sequence Length : Total length of the sequence (1-indexed).

Miscellaneous

You can bookmark the "Step 2" page to save your favorite parameters.
You can remove all parameters from "Imperfect SSR Minimums" to look for only perfect SSRs.
You can remove all parameters from "Perfect SSR Minimums" to look for only imperfect SSRs.
Comments (the semicolon ';' and everything after it on that line) are stripped.
Processing times grow exponentially with the size of sequences.

Trouble Shooting

If SSRs from many sequences are listed as having been found in one sequence, check the file format. Usually this is caused by a lack of "beginning-of-record" markers ('>').
Double-quote marks (") in the description line can cause unpredictable results when trying to upload the .CSV file into a spreadsheet. If you run into this, you don't need to re-process your sequences - you can simply edit the .CSV file in a text editor such as notepad and replace the offending double-quotes (") with single-quotes (').
Be aware that the program reports some nucleotides as belonging to multiple SSRs. For a simple example, consider the strand TATATATATAATAATA: it is reported as (ta)5-(ata)3. If we underline the (ta)5 and bold the (ata)3, we can see the overlap as follows: TATATATATAATAATA. Searching for long motifs increases the liklihood of this occurring. It is up to the researcher to determine which, if any, of the SSRs are important.
KNOWN BUGS (this is, after all, an imperfect SSR finder):
- As of 2008-07-28, all known bugs have been fixed. If you find a bug, please send me an email at dan.R.stienekeDon'tSpamMe@usda.gov.com

U.S. DEPARTMENT OF AGRICULTURE

Northwest Irrigation and Soils Research: Kimberly, ID