What is a repeat
A serial repeat (tandem repeat, periodicity) is a string repeated contiguously. For example, ataata is a repeat (ata repeated twice). gggg is another repeat (g repeated four times). acaca is a repeat too (ac repeated two and a half times). As the last example shows, the number of repeated copies need not to be integer, it can be fractional.
Period and exponent
The length of repeated string is called the period of the repeat and the number of repeated copies (equivalently, the ratio of the whole length to the period) is called the exponent. Thus, the exponent of acaaca and acaca is respectively 2 and 2.5 (period 3 and 2, respectively).
What is a maximal repeat
Consider string ctattatatg. One of the repeats it contains is tata. However, it is natural to include to it nucleotide t that occurs immediately to the right, and to rather speak of repeat tatat. This is the idea of maximal repeats: those are repeats occurring in a given string, which are extended to the right/left as much as possible without "breaking the periodicity".
mreps and maximal repeats
If the resolution is not specified, mreps finds all exact maximal repeats in the subject string (according to the algorithm described in [2]). For example, running
mreps -s gcctatttatttatttggt
yields the following maximal repeat contained in gcctatttatttatttggt:
from -> to : size <per.> [exp.] err-rate sequence --------------------------------------------------------------------------------------------- 4 -> 16 : 13 <4> [3.25] 0.000 tatt tatt tatt t --------------------------------------------------------------------------------------------- RESULTS: There is 1 repeat in the processed sequence
Formally, there are other maximal repeats in this sequence (like ttt for example), but mreps does not output them as they are too short and are considered to "occur by chance". However, mreps has a special -allowsmall option to prevent filtering out short statistically non-significant repeats.
Approximate repeats
mreps is able to find not only exact repeats, but also approximate repeats, a necessary feature for DNA analysis. To make mreps compute approximate repeats, the user has to specify the resolution parameter.
To compute approximate repeats, mreps computes so-called runs of k-mismatch tandem repeats, according to the algorithm described in [3]. Those are repeats that allow at most k mismatches between two adjacent periods. These structures are then used as "raw material" and are further processed in order to obtain more biologically relevant representation of the repeats, in particular to eliminate redundancy and to filter out repeats that are not statistically significant.
Here is an example of mreps output:
> mreps -res 1 -s gggctaagtttgagtttaagaa from -> to : size <per.> [exp.] err-rate sequence --------------------------------------------------------------------------------------------- 5 -> 20 : 16 <6> [2.67] 0.100 taagtt tgagtt taag --------------------------------------------------------------------------------------------- RESULTS: There is 1 repeat in the processed sequence
Resolution
The resolution parameter (-res in the example above) specifies the "degree of fuzziness" of repeats that mreps is able to find. The larger the resolution is, the more degenerate repeats can be computed. When the resolution is increased, new degenerate repeats can be computed and at the same time, repeats that are computed for smaller resolution values are still found, but may be extended to longer ones. Actually, the program loops over all resolution values up to the one specified by the user, and collects all repeats found at each iteration. Of course, the "price to pay" is the execution time that gets longer with the increase of the resolution.
In practice, the "good" resolution value depends on the periods the user is interested in. For small periods (up to 10-15), resolution 5 is usually sufficient to find all meaningful repeats; for bigger periods the resolution can be increased and can go up to 50 or so.
Error rate
The "measure of quality" of repeats, used by mreps, is the error rate, which is the ratio of the number of mismatches to the overall length of the repeat minus period. A mismatch is accounted when two letters located within distance p (period) are distinct, with the exception that if the letter p positions to the right is the same as the one p positions to the left, then these two mismatches are accounted for one. For example, the above-mentioned repeat taagtt tgagtt taag is considered to have only one error, although there two mismatched pairs: (g at position 8 mismatches with both a at position 2 and a at position 14. This is why its error rate is 1/(16-6)=0.1.
Statistical significance
By default, mreps does not output repeats that are considered to be "non-significant", i.e. those which have a high probability to appear in a "random DNA sequence". There are two reasons for a repeat to be "non-significant". One is due to its small length, and another is due to its "bad quality", i.e. high error rate. This is why mreps has two different significance filters -- a length filter supressing repeats of small length, and a quality filter suppressing repeats of high error rate. However, in some situations the user might want to have short repeats in the output too. For this purpose, there exists the -allowsmall option that swithes off the length filter (the quality filter is still applied, however).
Explanation of mreps options
- Format: '-fasta' option allows DNA sequences in FASTA format
- Resolution: the "degree of fuzziness" of repeats to report
- Allowsmall: Outputs repeats of all length including small ones
- Minimal size: minimal size of repeats to report
- Maximal size: maximal size of repeats to report
- Minimal period: minimal period of repeats to report
- Maximal period: maximal period of repeats to report
- Exponent: minimal exponent of repeats to report
- From: start position of the fragment to process
- To: end position of the fragment to process
- Window: size of sliding window
TRY IT!
If you have more questions, learning by examples is the best solution. Try it!