What is a repeat

A serial repeat (tandem repeat, periodicity) is a string repeated contiguously. For example, ataata is a repeat (ata repeated twice). gggg is another repeat (g repeated four times). acaca is a repeat too (ac repeated two and a half times). As the last example shows, the number of repeated copies need not to be integer, it can be fractional.

Period and exponent

The length of repeated string is called the period of the repeat and the number of repeated copies (equivalently, the ratio of the whole length to the period) is called the exponent. Thus, the exponent of acaaca and acaca is respectively 2 and 2.5 (period 3 and 2, respectively).

What is a maximal repeat

Consider string ctattatatg. One of the repeats it contains is tata. However, it is natural to include to it nucleotide t that occurs immediately to the right, and to rather speak of repeat tatat. This is the idea of maximal repeats: those are repeats occurring in a given string, which are extended to the right/left as much as possible without "breaking the periodicity".

mreps and maximal repeats

If the resolution is not specified, mreps finds all exact maximal repeats in the subject string (according to the algorithm described in [2]). For example, running

mreps -s gcctatttatttatttggt

yields the following maximal repeat contained in gcctatttatttatttggt:

   from   ->       to  :         size    <per.>  [exp.]          err-rate       sequence
       4  ->        16 :         13      <4>     [3.25]          0.000          tatt tatt tatt t
RESULTS: There is 1 repeat in the processed sequence

Formally, there are other maximal repeats in this sequence (like ttt for example), but mreps does not output them as they are too short and are considered to "occur by chance". However, mreps has a special -allowsmall option to prevent filtering out short statistically non-significant repeats.

Approximate repeats

mreps is able to find not only exact repeats, but also approximate repeats, a necessary feature for DNA analysis. To make mreps compute approximate repeats, the user has to specify the resolution parameter.

To compute approximate repeats, mreps computes so-called runs of k-mismatch tandem repeats, according to the algorithm described in [3]. Those are repeats that allow at most k mismatches between two adjacent periods. These structures are then used as "raw material" and are further processed in order to obtain more biologically relevant representation of the repeats, in particular to eliminate redundancy and to filter out repeats that are not statistically significant.

Here is an example of mreps output:

> mreps -res 1 -s gggctaagtttgagtttaagaa

   from   ->       to  :         size    <per.>  [exp.]          err-rate       sequence
       5  ->        20 :         16      <6>     [2.67]          0.100          taagtt tgagtt taag
RESULTS: There is 1 repeat in the processed sequence


The resolution parameter (-res in the example above) specifies the "degree of fuzziness" of repeats that mreps is able to find. The larger the resolution is, the more degenerate repeats can be computed. When the resolution is increased, new degenerate repeats can be computed and at the same time, repeats that are computed for smaller resolution values are still found, but may be extended to longer ones. Actually, the program loops over all resolution values up to the one specified by the user, and collects all repeats found at each iteration. Of course, the "price to pay" is the execution time that gets longer with the increase of the resolution.

In practice, the "good" resolution value depends on the periods the user is interested in. For small periods (up to 10-15), resolution 5 is usually sufficient to find all meaningful repeats; for bigger periods the resolution can be increased and can go up to 50 or so.

Error rate

The "measure of quality" of repeats, used by mreps, is the error rate, which is the ratio of the number of mismatches to the overall length of the repeat minus period. A mismatch is accounted when two letters located within distance p (period) are distinct, with the exception that if the letter p positions to the right is the same as the one p positions to the left, then these two mismatches are accounted for one. For example, the above-mentioned repeat taagtt tgagtt taag is considered to have only one error, although there two mismatched pairs: (g at position 8 mismatches with both a at position 2 and a at position 14. This is why its error rate is 1/(16-6)=0.1.

Statistical significance

By default, mreps does not output repeats that are considered to be "non-significant", i.e. those which have a high probability to appear in a "random DNA sequence". There are two reasons for a repeat to be "non-significant". One is due to its small length, and another is due to its "bad quality", i.e. high error rate. This is why mreps has two different significance filters -- a length filter supressing repeats of small length, and a quality filter suppressing repeats of high error rate. However, in some situations the user might want to have short repeats in the output too. For this purpose, there exists the -allowsmall option that swithes off the length filter (the quality filter is still applied, however).

Explanation of mreps options


If you have more questions, learning by examples is the best solution. Try it!