rose_arch
This built-in application provides a generic solution to configure
site-specific archiving of suite files for use with
rose task-run.
The application is normally configured in a rose-app.conf
. Global
settings may be specified in an rose_arch[arch]
section. Each archiving target will have its own [arch:TARGET]
section for specific settings, where TARGET
would be a URI to
the archiving location on your site-specific archiving system. Settings
in a [arch:TARGET]
section would override those in the global
rose_arch[arch]
section for the given TARGET
.
A target is considered compulsory, i.e. it must have at least one
source, unless it is specified with the syntax [arch:(TARGET)]
.
In which case, TARGET
is considered optional. The application will
skip an optional target that has no actual source.
The application provides some useful functionalities:
- Incremental mode: store the archive target settings, checksums of
source files and the return code of archive command. In a retry, it
would only redo targets that did not succeed in the previous attempts.
- Rename source files.
- Tar-Gzip or Gzip source files before sending them to the archive.
Invocation
In automatic selection mode, this built-in application will be invoked
automatically if a task has a name that starts with rose_arch*
.
This means that you can use Rose Arch with something like the example below
in your flow.cylc
:
[scheduling]
# ...
[[graph]]
P1 = """
all => the => tasks => rose_arch_archive
"""
[runtime]
# ...
[[rose_arch_archive]]
Examples
The following examples all form part of a single rose-app.conf
file:
General Settings
These settings are placed here to be inherited by other archive tasks in the
file: In this case we’ve set command format
which sets how we are going
to copy the files to the archive location.
We’ve also set prefixes for the source and target locations, so that we
don’t have repeatedly specify common locations.
# General settings
[arch]
command-format=foo put %(sources)s %(target)s
source-prefix=$ROSE_DATAC/
target-prefix=foo://hello/
Archive a file to a file
In this simplest use case rose arch is just moving a single file to another
location.
# Archive a file to a file
[arch:world.out]
source=hello/world.out
Archiving directories
You can archive files matched by one or more glob expressions to a directory:
# A single glob
[arch:worlds/]
source=hello/worlds/*
# Three globs
[arch:worlds/]
source=hello/worlds/* greeting/worlds/* hi/worlds/*
Missing files and directories
It’s also possibly to deal with a situation where one or more of the source
expressions might not return anything by putting brackets - ()
- around it:
# If there isn't anything in greeting/worlds/ Rose Arch continues
[arch:worlds/]
source=hello/worlds/* (greeting/worlds/*) hi/worlds/*
You can even tell Rose Arch that there may be nothing to archive, but to carry
on:
[arch:(black-box/)]
source=cats.txt dogs.txt
Zipping files
There are multiple ways of specifying that you want your archive to be
compressed:
You can infer compression from the target extension:
[arch:planet.gz]
source=hello/planet.out
or manually specify a compression program. (In this case the out.gz
is
not recognized by rose arch as an extension to be compressed.)
[arch:planet.out.gz]
compress=gz
source=hello/planet.out
For more details see rose_arch[arch]compress
Zipping directories
You can tar and zip entire directories - as with single files Rose Arch will
attempt to infer archive and compression from [arch:TARGET.extension]
if it
can:
[arch:galaxies.tar.gz]
source-prefix=hello/
source=galaxies/*
# File with multiple galaxies may be large, don't do its checksum
update-check=mtime+size
You might prefer to explicitly gzip each file in the source directory separately:
# Force gzip each source file
[arch:stars/]
source=stars/*
compress=gzip
Renaming files simply
You may wish to change the name of the archived files. By default the contents
of your app’a rose_arch[arch]source
and
$CYLC_TASK_CYCLE_TIME
are available to you as python formatting strings
%(name)s
and %(cycle)s
.
[arch:moons.tar.gz]
source=moons/*
rename-format=%(cycle)s-%(name)s
Warning
As %(name)s
can be a path is may not always make sense to
prepend %(cycle)s
to it - consider 01_/absolute/path/to/datafile
Renaming using a rename-parser
See rose_arch[arch]rename-parser
.
This allows you to parse the the name you give in rose_arch[arch]source
using
regular expressions for use in rename-format
.
This is handy if you set a path to rose_arch[arch]source
but want the target
to just be a name - imagine a case where you wanted to collect a group of files
with names in the form data_001.txt
:
[arch:Target]
source=/some/path/data*.txt
rename-parser=^//some//path//data_(?P<serial_number>[0-9]{3})(?P<name_tail>.*)$
rename-format=hello/%(cycle)s-%(name_head)s%(name_tail)s
Output
On completion, rose_arch
writes a status summary for each
target to the standard output, which looks like this:
0 foo:///fred/my-su173/output0.tar.gz [compress=tar.gz]
+ foo:///fred/my-su173/output1.tar.gz [compress=tar.gz, t(init)=2012-12-02T20:02:20Z, dt(tran)=5s, dt(arch)=10s, ret-code=0]
+ output1/earth.txt (output1/human.txt)
+ output1/venus.txt (output1/woman.txt)
+ output1/mars.txt (output1/man.txt)
= foo:///fred/my-su173/output2.tar.gz [compress=tar.gz]
! foo:///fred/my-su173/output3.tar.gz [compress=tar.gz]
The first column is a status symbol, where:
- 0
- An optional target has no real source, and is skipped.
- +
- A target is added or updated.
- =
- A target is not updated, as it was previously successfully updated with
the same sources.
- !
- Error updating this target.
If the first column and the second column are separated by a space character,
the second column is a target. If the first column and the second column are
separated by a tab character, the second column is a source in the target
above.
For a target line, the third column contains the compress scheme, the
initial time, the duration taken to transform the sources, the duration
taken to run the archive command and the return code of the archive
command. For a source line, the third column contains the original name of
the source.
Configuration
-
Rose App
rose_arch
-
Config
[arch]
(alternate: arch:TARGET)
-
Config
command-format
= FORMAT
-
A Pythonic printf
-style format string to construct the archive
command. It must contain the placeholders %(sources)s
and %(target)s
for substitution of the sources and the target
respectively.
-
Config
compress
= pax|tar|pax.gz|tar.gz|tgz|gz
If specified, compress source files scheme before sending them to the
archive. If not set Rose Arch will attempt to set a compression scheme
if the file extension of the target implies compression: For
example, setting target as [arch:example.tar]
is the same as
setting compress=tar
.
Each compression scheme works slightly differently:
Compression Scheme |
Behaviour |
pax or tar |
Sources will be placed in a TAR archive before
being sent to the target. |
pax.gz ,
tar.gz or
tgz |
Sources will be placed in a TAR-GZIP file
before being sent to the target. |
gz |
Each source file will be compressed by GZIP
before being sent to the target. |
-
Config
rename-format
If specified, the source files will be renamed according to the
specified format. The format string should be a Pythonic
printf
-style format string.
By default the following variables are available:
You may also use rename-parser
to generate further fields
from the input name.
Warning
As %(name)s
can be a path, so that
if rename-format="%(cycle)s_%(name)s"
you can have destination
paths such 02_path/to/some.file
, which are unlikely to work. If
you want to manipulate your source name in such cases
should use rename-parser
.
-
Config
rename-parser
Ignored if rename-format
is not specified.
Specify a regular expression to parse the name provided by source
,
using the Python regex syntax (?P<label>what you want to capture)
For example, a regular expression in the form:
^\/home\/data\/(?P<filename>myfile)(?P<serialnumber>[0-9]{3}).someExtension$
Will label the captured section using with the contents of <>
.
In this example you would then have %(filename)s
and
%(serialnumber)
to use in your rename-format
string.
-
Config
source
= NAME
-
Specify a list of source file names and/or globs
for matching source file names. List items are separated by spaces.
- File names with space or quote characters can be escaped using quotes
or backslashes, like in a shell.)
- Paths, if not absolute (beginning with a
/
), are
assumed to be relative to ROSE_SUITE_DIR
or to
$ROSE_SUITE_DIR/PREFIX
if source-prefix
is specified.
- If a name or glob is given in a pair of brackets,
e.g.``(hello-world.*)``, the source is considered optional and will
not cause a failure if it does not match any source file names.
Warning
If a target does not have ()
around it then is it compulsory
and if no matching source is found then the archiving of that file
will be considered a failure.
-
Config
source-edit-format
= FORMAT
Construct a command to edit or modify the content of source files
before archiving them. It uses a Pythonic printf
-style format
string to describe inputs and outputs.
It must contain the placeholders %(in)s
and %(out)s
for
substitution of the path to the source file and the path to the
modified source file (which will be created in a temporary working
directory).
For example you might wish to replace the word “Hello” with “Greet”
using sed:
source-edit-format=sed 's/Hello/Greet/g' %(in)s >%(out)s
-
Config
source-prefix
= PREFIX
Add a prefix to each value in a source declaration. A trailing
slash should be added for a directory. Paths are assumed to be
relative to ROSE_SUITE_DIR
. This setting serves two
purposes:
- It provides a way to avoid typing the name of the source directory
repeatedly.
- If you are using
rename-format
or if the target is
a compressed file your target’s %(name)s
will be the entirety
of what you set in source
, so you may wish to avoid
this being a full path.
-
Config
target-prefix
= PREFIX
Add a prefix to each target declaration. This setting provides
a way to avoid typing the same thing repeatedly. A trailing
slash (or whatever is relevant for the archiving system) should
be added for a directory.
-
Config
update-check
= mtime+size|md5|sha1|...
Specify the method for checking whether a source has changed
since the previous run. If the value is mtime+size, the
application will use the modified time and size of the source,
which is useful for large files, but is less correct. Otherwise,
the value, if specified, should be the name of a hash object in
Python’s hashlib, such as md5
(default), sha1
, etc.
In this mode, the application will use the checksum (based on
the specified hashing method) of the content of each source file
to determine if it has changed or not.