Retries

Rose Advanced Tutorial: Retries

Retries

Introduction

This part of the Rose user guide walks you through using cylc retries.

This allows tasks to be automatically resubmitted after failure, after a certain delay, and even with different behaviour.

Purpose

Retries can be useful for tasks that may occasionally fail due to external events, and are routinely fixable when they do - an example would be a task that is dependent on a system that experiences temporary outages.

If a task fails, the cylc retry mechanism can resubmit it after a pre-determined delay. An environment variable, $CYLC_TASK_TRY_NUMBER is incremented and passed into the task - this means you can write your task script so that it changes behaviour accordingly.

Example

Our example suite will simulate trying to roll doubles using two dice.

Create a new suite (or just a new directory somewhere - e.g. in your homespace) containing a blank rose-suite.conf and a suite.rc file with the following contents:

Example suite.rc

[cylc]
    UTC mode = True # Ignore DST
[scheduling]
    [[dependencies]]
        graph = start => roll_doubles => win

Runtime

We'll add some standard information in the [runtime] section:

[runtime]
    [[start]]
    [[win]]

Task Runtime

We need to add a rolling doubles task - add these lines to the end of your suite.rc file:

    [[roll_doubles]]
        script = """
sleep 10
RANDOM=$$  # Seed $RANDOM
DIE_1=$((RANDOM%6 + 1))
DIE_2=$((RANDOM%6 + 1))
echo "Rolled $DIE_1 and $DIE_2..."
if (($DIE_1 == $DIE_2)); then
    echo "doubles!"
else
    exit 1
fi
        """

Running It without Retries

Let's see what happens when we run the suite as it is.

Make sure you are in the root directory of your suite.

Run the suite using:

rose suite-run

Results

Unless you're lucky, the suite should fail at the roll_doubles task.

We need to tell cylc to retry it a few times - replace the line [[roll_doubles]] in the suite.rc file with:

    [[roll_doubles]]
        [[[job]]]
            execution retry delays = 5*PT6S

This means that if the roll_doubles task fails, cylc expects to retry running it 5 times before finally failing. Each retry will have a delay of 6 seconds.

Explanation

execution retry delays can have varying amounts (e.g. execution retry delays = PT15S, PT10M, PT1H, PT3H to perform the first retry after 15 seconds, the second after 10 minutes, then an hour, then three hours).

We've chosen 6 seconds because it's relatively easy to observe for this example.

Running It with Retries

Stop the running suite and re-run the suite using:

rose suite-run

Results

What you should see is cylc retrying the roll_doubles task. Hopefully, it will succeed (about a 1 in 3 chance of every task failing) and the suite will continue.

If you go to the suite output (run rose suite-log in your root suite directory), you can see the separate retry instances of the task.

Altering Behaviour

We can alter the behaviour of the task based on the number of retries, using $CYLC_TASK_TRY_NUMBER:

        script = """
sleep 10
RANDOM=$$  # Seed $RANDOM
DIE_1=$((RANDOM%6 + 1))
DIE_2=$((RANDOM%6 + 1))
echo "Rolled $DIE_1 and $DIE_2..."
if (($DIE_1 == $DIE_2)); then
    echo "doubles!"
elif (($CYLC_TASK_TRY_NUMBER >= 2)); then
    echo "look over there! ..."
    echo "doubles!"  # Cheat!
else
    exit 1
fi
        """

Results

If your suite is still running, stop it. Run it again using:

rose suite-run

This time, the task should definitely succeed before the third retry.

Further Reading

For more information, see the cylc User Guide.