Installed in /admin/RunDatabase/python. Runs on any store node and is managed by corosync/pacemaker.
Unfortunately parts of the datamover are strongly coupled into dirac so setting up the environment is not trivial. There is a script called dmpython.sh which allows you to launch the datamover from the command line. You will probably want to set the -f option to run it in foreground mode and see all the debug output. Example:
./dmpython.sh DataMover.py --actions=verify --iterSleep=5 --runStartTime="2011.01.01 00:00:00.00"
This will launch the verification sub-process of the datamover, which recounts all events and calculates all necessary checksums.
For the delete and most of the BKK options you will have to run as root.
Output of
DataMover --help
This script triggers various actions on event files using the Run Database.
It can be safely run in multiple instances.
Usage: DataMover.py
[--actions=<comma separated list of actions>]
[--delay <float>]
[--count <integer>]
[--no <integer>]
[--runType <runType>]
[--partitionName <partName>]
[--partitionID <partID>]
[--runStartTime <YYYY.MM.DD hh:mm:ss.fff>]
[--runEndTime <YYYY.MM.DD hh:mm:ss.fff>]
[--stream <stream name>]
[--dryrun ]
[--iterSleep ]
Arguments:
actions: What threads should be run:
verify = verify file size, checksum, No of events
move = requests sent to the transfer agent
bkk = bookkeeping requests
check = reverse transitions based on timestamps
write = write random strin files and register then into the Run Db
delete = deletes files copied to CASTOR
autodelete = delete files which have been copied to CASTOR and are older than two days
test = test thread. Not interesting
autoclose = try to close open files of ended runs
autoend = automatically clean up open runs of partitions that might have not been closed by the ecs
iterSleep: the delay between each iteration (in seconds)
count: number of iterations
no: the number of files processed in one iteration. Default 1
runType: the type of runs targeted (e.g. CCRC, PHYSICS, etc)
partitionID: the id of the run partition
partitionName: name of the run partition (e.g. LHCb, RICH, etc)
runStartTime: See above for format. Fields can be omitted starting from right to left.
Runs with started AFTER 'runStartTime' will be considered.
runEndTime: See above for format. Fields can be omitted starting from right to left.
Only runs ended BEFORE 'runEndTime' will be considered.
stream: the name of the event stream
dryrun: only prints the affected runs/files. Nothing is actually changed
Not all options make sens for all actions and are ignored if not usefull.
Other default values are taken from the RunDatabase_Defines file.
Output options:
-f -> stdout/stderr (foreground mode)
-l <logfile> -> prints to <logfile>
-s -> syslog using the UUCP facility
At the moment the Datamover is logging via syslog into /var/log/datamover.error, /var/log/datamover.debug and /var/log/datamover.info
The Datamover is also logging into the run datbase (table rundbdatamover). For each action (e.g. move, bkk_run, etc.) there exists one entry. You can get these entrys with /admin/RunDatabase/python/RunDatabase_LogViewer.py
. There is also a shorter command available on the plus nodes: rdblog (for run database log).
Example of usage:
[lhcbprod@store02:~/p] $ ./RunDatabase_LogViewer.py 45099
ID Trys Type Time if last action Message
--------------------------------------------------------------------------------------------------------------------------------------------
045099_0000000003.raw 0 move 2009-03-05 17:17:27 Transfer to CASTOR was successful.
045099_0000000001.raw 0 move 2009-03-05 17:18:37 Transfer to CASTOR was successful.
045099_0000000002.raw 0 move 2009-03-05 17:19:37 Transfer to CASTOR was successful.
045099_0000000004.raw 0 move 2009-03-05 17:36:10 Request to transfer the file to CASTOR was successfully send.
[dsonnick@plus09:~] $ rdblog 45099
ID Trys Type Time if last action Message
--------------------------------------------------------------------------------------------------------------------------------------------
045099_0000000003.raw 0 move 2009-03-05 17:17:27 Transfer to CASTOR was successful.
045099_0000000001.raw 0 move 2009-03-05 17:18:37 Transfer to CASTOR was successful.
045099_0000000002.raw 0 move 2009-03-05 17:19:37 Transfer to CASTOR was successful.
045099_0000000004.raw 0 move 2009-03-05 17:36:10 Request to transfer the file to CASTOR was successfully send.
045099_0000000005.raw 0 move 2009-03-05 17:40:24 Request to transfer the file to CASTOR was successfully send.
045099_0000000006.raw 0 move 2009-03-05 17:43:36 Request to transfer the file to CASTOR was successfully send.
045099_0000000007.raw 0 move 2009-03-05 17:48:56 Request to transfer the file to CASTOR was successfully send.
045099_0000000008.raw 0 move 2009-03-05 17:51:23 Request to transfer the file to CASTOR was successfully send.
045099_0000000009.raw 0 move 2009-03-05 17:58:07 Request to transfer the file to CASTOR was successfully send.
You can either provide a run numer or a filename. If you provide the run number all related files will be shown.
If you do not get any ouput this means that the datamover has not done anything related to this run or file. The reason for this can be that the datamover is not runing, the datamover has an error, the run is not targeted as ONLINE, the run has some error which prevents the datamover from touching it or that the datamover is busy and you have to wait.
Notice for admins: you have to do svn commit in /admin/RunDatabase/python and svn update in /group/online/rundb/RunDatabase/python after an change. rdbt and rdblog are only links to the scripts in this directory.
admin/RunDatabase/python/RunDatabase_XmlViewer.py <runid> or <filename>
Send the output to Zoltan Mathe to check the validity of the XML Request.
The following is the list of the states of the files which are successfully processed:
- FILE_OPENED - 1
- FILE_CLOSED - 2
- FILE_VERIFIED - 4
- FILE_DIRAC_SEND - 50
- FILE_DIRAC_REPLIED - 70
- FILE_IN_CASTOR - 60
- FILE_IN_BKK - 30
- FILE_MIGRATED - 11
- FILE_DELETED - 9
Files are moved to castor as soon as they are verified. If a run is closed and and all files in that run have been verified, the run is entered into the BKK. This is done in the new version before all files have been transfered to castor.
There is not automatic retry on transfer requests. See the Tools section for a set of tools to use in case of problems, if file transfers need to be re-requested.
Data Flow explained
Data at first is written by the writer daemons that are running on the store nodes. When the writer has finished writing (which is a science by itself) it will mark the file it has written as closed. As soon as the file is closed the verify process will kick in.
Verify recounts the number of events inside the file and calculates an md5 and adler32 checksum for the file before requesting the copy process to castor. Since SNFS does not really have a local FS cache, each of these processes would usually read the file from disk again. To speed things up, there is a dedicated ramdisk file system on ever store node in /verify, which has space for 3 files. The verify process copies the files there and then runs all the checks on it. After it has finished, the file is deleted from the ram-disk again and a request is sent to the DIRAC agent to copy the file to CASTOR. There can be problems if the Datamover is shut down, while the verify job is still runing. Files that have not been deleted are taking up space and new files can not be copied into it anymore. In this case you have to check the state of these files and then remove them from the temp file system.
After the file has been verified, a request for the transfer to castor is sent to the DIRAC agent that is running on lbdirac (store06). Depending on the mood of DIRAC, the file will be either transfered or not.
Typical problems here are:
- The files are in CASTOR, but they have not been migrated to tape yet.
- The files are actually in CASTOR and on tape but DIRAC does not send the confirmation.
- DIRAC refused to copy the file to castor (It will be stuck in VERIFIED state) (check the DM log file for the DIRAC response)
- DIRAC agents are down and not reachable
In all the above cases you should contact the Offline and notify them of the problem.
After the file has been successfully migrated to tape, DIRAC will send the notification back to the database via the XMLRPC.server. We then send a request to DIRAC to enter the file into the BKK.
Typical problems here are:
- The interface for the BKK has changed.
- DIRAC agents are down.
File Deletion
If there is less than 10TB of space left on /daqarea, an automated delete job will start deleting old files until there are about 40% of disk space left. Unfortunately the criteria for the selection of when a file is released for deletion are a bit complicated. TODO: Explain the rules.
Typical problems:
- People have copied data onto the daqarea, which the RUNDB does not know about.
- Files that are marked as deleted have not been deleted.
The first case is relatively easy solved by checking for directories outside the daq tree.
The second case is a bit more tricky and unfortunately there is no tool yet to find these kind of files. For the time being you'll have to manually check the DB, which files the RUNDB thinks are still on disk and then check with 'find' which files are actually there.
In any case, if we are running out of disk space and you can not solve the problem immediately, you can manually delete either the files that have already been transfered to CASTOR (at least state migrated) or delete the files that have been written by the sub-detectors. This should give you some breathing space to investigate further what is going on.
|
|
State |
Number Representation |
Meaning |
Set By |
RUN_ACTIVE |
1 |
Run is active, data is being written |
Controls via DIMRPC |
RUN_ENDED |
2 |
Run ended |
Controls via DIMRPC |
RUN_CREATED |
5 |
Run was created |
Controls via DIMRPC |
RUN_IN_BKK |
6 |
Run including its files is in Bookkeeping |
DataMover --action=bkkrun |
|
|
|
|
|
State |
Number Representation |
Meaning |
Set By |
FILE_OPENED |
1 |
File exists and data is being written |
writerd via XMLRPC |
FILE_CLOSED |
2 |
File is closed but Run may still be active |
writerd via XMLRPC |
FILE_VERIFIED |
4 |
File has been verified |
DataMover --action=verify |
FILE_DELETED |
9 |
File was deleted |
DataMover --action=autodelete |
FILE_ERROR |
10 |
File is in error, interaction required |
DataMover OR Dirac via XMLRPC |
FILE_MIGRATED |
11 |
Replica Request for this file was send |
Dirac via XMLRPC |
FILE_BKK_START |
12 |
Bookkeeping request for this file was sent |
|
FILE_IN_BKK |
30 |
File was entered into the bookkeeping |
|
FILE_REPLICA_SEND |
40 |
Request to insert the file into the bookkeeping was send |
|
FILE_DIRAC_SEND |
50 |
File transfer request was send |
DataMover --action=move |
FILE_IN_CASTOR |
60 |
File was copied to CASTOR |
|
FILE_DIRAC_REPLIED |
70 |
DIRAC claims that the file was copyied to CASTOR |
Dirac via XMLRPC |
FILE_DIRAC_UNSUCCESSFUL |
100 |
Transfer to CASTOR was not successfull (definitely) |
|
|
|
|
Over time several tools were developed to fix inconsistencies that arise in the database.
Close File tool
Careful!!! Files are usually not closed when a run ends. This is to accomodate the fact, that during a run change, events are still processed in the farm and might arrive much later. Do not use this for files that are going to CASTOR/OFFLINE that are not at least a day old. If the file is closed by hand and the writer decides to add some more events afterwards, DIRAC will not accept the file anymore.
If the wirter is shot down by the run control though, files are not closed and there are runs in state 'Ended' but with open files. To close these files manually, do:
./dmpython.sh ./RunDatabase_CloseRunFiles.py -r <runid>
Close Run tool
From time to time the Run Control makes a little mistake and forgets to close a run. This tool will update the state of the run from Running to Closed:
./dmpython.sh ./RunDatabase_CloseRun.py -r <runid>
The following tools should not be used without prior agreement with the Offline/Dirac people
Manual Run Move tool
This tool takes the files of a particular run and (if they are verified) sends the transfer request to Dirac. This is usefull, if there was a problem during the transfer and the original transfer request has been lost.
./dmpython.sh DataMover_MoveParticular.py --runID <runid>
Manual Run BKK tool
The following is broken as of 3/11/16
This tool manually sends a bkk request for a run, if all the files of the run have been verified.
./dmpython DataMover_BKKParticular.py --runID <runid>
Logrotation is configured for the DataMover on store02 and store03:
/var/log/datamover.info {
weekly
nocompress
rotate 3
create 0664 lhcbprod z5
}
/var/log/datamover.debug {
weekly
nocompress
rotate 3
create 0664 lhcbprod z5
}
/var/log/datamover.error {
weekly
nocompress
rotate 3
create 0664 lhcbprod z5
}
/var/log/xmlrpc {
weekly
nocompress
rotate 3
create 0664 lhcbprod z5
}
/var/log/dimrpc {
weekly
nocompress
rotate 3
create 0664 lhcbprod z5
}
uucp.* /var/log/storage
local0.info /var/log/datamover.info
local0.debug /var/log/datamover.debug
local0.err /var/log/datamover.err
local2.* /var/log/dimrpc
local1.* /var/log/xmlrpc
- File is not transferred to CASTOR: check if file state == 2, check if file exists (the Data Mover will try to close open files 4 hours after a run has finished)
- Run is not registered : Check if all files have been transfered to CASTOR, check the run state == 2, check if all parameters of runs and files exist
-- NikoNeufeld - 22 Oct 2009
./dmpython.sh DataMover_MoveParticular.py --runID <runid>