You are here: TWiki> Online Web>AdminGuide>DataMover (2016-11-03, neufeld)
Tags:
create new tag
, view all tags

DataMover

Installed in /admin/RunDatabase/python. Runs on any store node and is managed by corosync/pacemaker.

Unfortunately parts of the datamover are strongly coupled into dirac so setting up the environment is not trivial. There is a script called dmpython.sh which allows you to launch the datamover from the command line. You will probably want to set the -f option to run it in foreground mode and see all the debug output. Example:

./dmpython.sh DataMover.py --actions=verify --iterSleep=5 --runStartTime="2011.01.01 00:00:00.00"

This will launch the verification sub-process of the datamover, which recounts all events and calculates all necessary checksums.

For the delete and most of the BKK options you will have to run as root.

Options and Arguments

Output of DataMover --help

This script triggers various actions on event files using the Run Database.
It can be safely run in multiple instances.
Usage: DataMover.py
          [--actions=<comma separated list of actions>]
          [--delay <float>]
          [--count <integer>]
          [--no <integer>]
          [--runType <runType>]
          [--partitionName <partName>]
          [--partitionID <partID>]
          [--runStartTime <YYYY.MM.DD hh:mm:ss.fff>]
          [--runEndTime <YYYY.MM.DD hh:mm:ss.fff>]
          [--stream <stream name>]
          [--dryrun ]
          [--iterSleep ] 
Arguments:
    actions: What threads should be run:
             verify     = verify file size, checksum, No of events
             move       = requests sent to the transfer agent
             bkk        = bookkeeping requests
             check      = reverse transitions based on timestamps
             write      = write random strin files and register then into the Run Db
             delete     = deletes files copied to CASTOR
             autodelete = delete files which have been copied to CASTOR and are older than two days
             test       = test thread. Not interesting
             autoclose  = try to close open files of ended runs
             autoend    = automatically clean up open runs of partitions that might have not been closed by the ecs
    iterSleep: the delay between each iteration (in seconds)
    count: number of iterations
    no: the number of files processed in one iteration. Default 1
    runType: the type of runs targeted (e.g. CCRC, PHYSICS, etc)
    partitionID: the id of the run partition
    partitionName: name of the run partition (e.g. LHCb, RICH, etc)
    runStartTime: See above for format. Fields can be omitted starting from right to left.
                  Runs with started AFTER 'runStartTime' will be considered.
    runEndTime: See above for format. Fields can be omitted starting from right to left.
                Only runs ended BEFORE 'runEndTime' will be considered.
    stream: the name of the event stream
    dryrun: only prints the affected runs/files. Nothing is actually changed
Not all options make sens for all actions and are ignored if not usefull.
Other default values are taken from the RunDatabase_Defines file.

Output options:
        -f           -> stdout/stderr (foreground mode)
        -l <logfile> -> prints to <logfile>
        -s           -> syslog using the UUCP facility

 

Checking logs:

At the moment the Datamover is logging via syslog into /var/log/datamover.error, /var/log/datamover.debug and /var/log/datamover.info

The Datamover is also logging into the run datbase (table rundbdatamover). For each action (e.g. move, bkk_run, etc.) there exists one entry. You can get these entrys with /admin/RunDatabase/python/RunDatabase_LogViewer.py . There is also a shorter command available on the plus nodes: rdblog (for run database log).

Example of usage:

[lhcbprod@store02:~/p] $ ./RunDatabase_LogViewer.py 45099
ID                      Trys    Type            Time if last action     Message
--------------------------------------------------------------------------------------------------------------------------------------------
045099_0000000003.raw   0       move            2009-03-05 17:17:27     Transfer to CASTOR was successful.
045099_0000000001.raw   0       move            2009-03-05 17:18:37     Transfer to CASTOR was successful.
045099_0000000002.raw   0       move            2009-03-05 17:19:37     Transfer to CASTOR was successful.
045099_0000000004.raw   0       move            2009-03-05 17:36:10     Request to transfer the file to CASTOR was successfully send.



[dsonnick@plus09:~] $ rdblog 45099
ID                      Trys    Type            Time if last action     Message
--------------------------------------------------------------------------------------------------------------------------------------------
045099_0000000003.raw   0       move            2009-03-05 17:17:27     Transfer to CASTOR was successful.
045099_0000000001.raw   0       move            2009-03-05 17:18:37     Transfer to CASTOR was successful.
045099_0000000002.raw   0       move            2009-03-05 17:19:37     Transfer to CASTOR was successful.
045099_0000000004.raw   0       move            2009-03-05 17:36:10     Request to transfer the file to CASTOR was successfully send.
045099_0000000005.raw   0       move            2009-03-05 17:40:24     Request to transfer the file to CASTOR was successfully send.
045099_0000000006.raw   0       move            2009-03-05 17:43:36     Request to transfer the file to CASTOR was successfully send.
045099_0000000007.raw   0       move            2009-03-05 17:48:56     Request to transfer the file to CASTOR was successfully send.
045099_0000000008.raw   0       move            2009-03-05 17:51:23     Request to transfer the file to CASTOR was successfully send.
045099_0000000009.raw   0       move            2009-03-05 17:58:07     Request to transfer the file to CASTOR was successfully send.

You can either provide a run numer or a filename. If you provide the run number all related files will be shown.

If you do not get any ouput this means that the datamover has not done anything related to this run or file. The reason for this can be that the datamover is not runing, the datamover has an error, the run is not targeted as ONLINE, the run has some error which prevents the datamover from touching it or that the datamover is busy and you have to wait.

Notice for admins: you have to do svn commit in /admin/RunDatabase/python and svn update in /group/online/rundb/RunDatabase/python after an change. rdbt and rdblog are only links to the scripts in this directory.

Checking the XML requests which have been sent

admin/RunDatabase/python/RunDatabase_XmlViewer.py <runid> or <filename>

Send the output to Zoltan Mathe to check the validity of the XML Request.

 

State flow:

The following is the list of the states of the files which are successfully processed:

 

  1. FILE_OPENED - 1
  2. FILE_CLOSED - 2
  3. FILE_VERIFIED - 4
  4. FILE_DIRAC_SEND - 50
  5. FILE_DIRAC_REPLIED - 70
  6. FILE_IN_CASTOR - 60
  7. FILE_IN_BKK - 30
  8. FILE_MIGRATED - 11
  9. FILE_DELETED - 9
Files are moved to castor as soon as they are verified. If a run is closed and and all files in that run have been verified, the run is entered into the BKK. This is done in the new version before all files have been transfered to castor.

There is not automatic retry on transfer requests. See the Tools section for a set of tools to use in case of problems, if file transfers need to be re-requested.

Data Flow explained

Data at first is written by the writer daemons that are running on the store nodes. When the writer has finished writing (which is a science by itself) it will mark the file it has written as closed. As soon as the file is closed the verify process will kick in.

Verify recounts the number of events inside the file and calculates an md5 and adler32 checksum for the file before requesting the copy process to castor. Since SNFS does not really have a local FS cache, each of these processes would usually read the file from disk again. To speed things up, there is a dedicated ramdisk file system on ever store node in /verify, which has space for 3 files. The verify process copies the files there and then runs all the checks on it. After it has finished, the file is deleted from the ram-disk again and a request is sent to the DIRAC agent to copy the file to CASTOR. There can be problems if the Datamover is shut down, while the verify job is still runing. Files that have not been deleted are taking up space and new files can not be copied into it anymore. In this case you have to check the state of these files and then remove them from the temp file system.

After the file has been verified, a request for the transfer to castor is sent to the DIRAC agent that is running on lbdirac (store06). Depending on the mood of DIRAC, the file will be either transfered or not.

Typical problems here are:

  • The files are in CASTOR, but they have not been migrated to tape yet.
  • The files are actually in CASTOR and on tape but DIRAC does not send the confirmation.
  • DIRAC refused to copy the file to castor (It will be stuck in VERIFIED state) (check the DM log file for the DIRAC response)
  • DIRAC agents are down and not reachable
In all the above cases you should contact the Offline and notify them of the problem.

After the file has been successfully migrated to tape, DIRAC will send the notification back to the database via the XMLRPC.server. We then send a request to DIRAC to enter the file into the BKK.

Typical problems here are:

  • The interface for the BKK has changed.
  • DIRAC agents are down.
File Deletion

If there is less than 10TB of space left on /daqarea, an automated delete job will start deleting old files until there are about 40% of disk space left. Unfortunately the criteria for the selection of when a file is released for deletion are a bit complicated. TODO: Explain the rules.

Typical problems:

  • People have copied data onto the daqarea, which the RUNDB does not know about.
  • Files that are marked as deleted have not been deleted.
The first case is relatively easy solved by checking for directories outside the daq tree.

The second case is a bit more tricky and unfortunately there is no tool yet to find these kind of files. For the time being you'll have to manually check the DB, which files the RUNDB thinks are still on disk and then check with 'find' which files are actually there.

In any case, if we are running out of disk space and you can not solve the problem immediately, you can manually delete either the files that have already been transfered to CASTOR (at least state migrated) or delete the files that have been written by the sub-detectors. This should give you some breathing space to investigate further what is going on.

 

Use Cases

use_cases_xmlrpc_server3.png

Run state translation

 

State Number Representation Meaning Set By
RUN_ACTIVE 1 Run is active, data is being written Controls via DIMRPC
RUN_ENDED 2 Run ended Controls via DIMRPC
RUN_CREATED 5 Run was created Controls via DIMRPC
RUN_IN_BKK 6 Run including its files is in Bookkeeping DataMover --action=bkkrun

 

File state translation

State Number Representation Meaning Set By
FILE_OPENED 1 File exists and data is being written writerd via XMLRPC
FILE_CLOSED 2 File is closed but Run may still be active writerd via XMLRPC
FILE_VERIFIED 4 File has been verified DataMover --action=verify
FILE_DELETED 9 File was deleted DataMover --action=autodelete
FILE_ERROR 10 File is in error, interaction required DataMover OR Dirac via XMLRPC
FILE_MIGRATED 11 Replica Request for this file was send Dirac via XMLRPC
FILE_BKK_START 12 Bookkeeping request for this file was sent  
FILE_IN_BKK 30 File was entered into the bookkeeping  
FILE_REPLICA_SEND 40 Request to insert the file into the bookkeeping was send  
FILE_DIRAC_SEND 50 File transfer request was send DataMover --action=move
FILE_IN_CASTOR 60 File was copied to CASTOR  
FILE_DIRAC_REPLIED 70 DIRAC claims that the file was copyied to CASTOR Dirac via XMLRPC
FILE_DIRAC_UNSUCCESSFUL 100 Transfer to CASTOR was not successfull (definitely)  

 

Useful Tools

Over time several tools were developed to fix inconsistencies that arise in the database.

Close File tool

Careful!!! Files are usually not closed when a run ends. This is to accomodate the fact, that during a run change, events are still processed in the farm and might arrive much later. Do not use this for files that are going to CASTOR/OFFLINE that are not at least a day old. If the file is closed by hand and the writer decides to add some more events afterwards, DIRAC will not accept the file anymore.

If the wirter is shot down by the run control though, files are not closed and there are runs in state 'Ended' but with open files. To close these files manually, do:

./dmpython.sh ./RunDatabase_CloseRunFiles.py -r <runid>

 

Close Run tool

From time to time the Run Control makes a little mistake and forgets to close a run. This tool will update the state of the run from Running to Closed:

./dmpython.sh ./RunDatabase_CloseRun.py -r <runid>

The following tools should not be used without prior agreement with the Offline/Dirac people

Manual Run Move tool

This tool takes the files of a particular run and (if they are verified) sends the transfer request to Dirac. This is usefull, if there was a problem during the transfer and the original transfer request has been lost.

./dmpython.sh DataMover_MoveParticular.py --runID <runid>

 

Manual Run BKK tool

The following is broken as of 3/11/16

This tool manually sends a bkk request for a run, if all the files of the run have been verified.

./dmpython DataMover_BKKParticular.py --runID <runid>

 

Logrotation

Logrotation is configured for the DataMover on store02 and store03:

/var/log/datamover.info {
    weekly
    nocompress
    rotate 3
    create 0664 lhcbprod z5
}

/var/log/datamover.debug {
    weekly
    nocompress
    rotate 3
    create 0664 lhcbprod z5
}

/var/log/datamover.error {
    weekly
    nocompress
    rotate 3
    create 0664 lhcbprod z5
}

/var/log/xmlrpc {
    weekly
    nocompress
    rotate 3
    create 0664 lhcbprod z5

}
/var/log/dimrpc {
    weekly
    nocompress
    rotate 3
    create 0664 lhcbprod z5
}

 

/etc/syslog.conf

uucp.*                          /var/log/storage
local0.info                     /var/log/datamover.info
local0.debug                    /var/log/datamover.debug
local0.err                      /var/log/datamover.err
local2.*                        /var/log/dimrpc
local1.*                        /var/log/xmlrpc

 

FAQ

  • File is not transferred to CASTOR: check if file state == 2, check if file exists (the Data Mover will try to close open files 4 hours after a run has finished)
  • Run is not registered : Check if all files have been transfered to CASTOR, check the run state == 2, check if all parameters of runs and files exist
-- NikoNeufeld - 22 Oct 2009
./dmpython.sh DataMover_MoveParticular.py --runID <runid>
Topic attachments
I Attachment Action Size Date Who Comment
PNGpng state_chart.png manage 9.1 K 2009-04-23 - 12:43 UnknownUser State chart of runs and files in the LHCb online system
PNGpng use_cases_xmlrpc_server3.png manage 12.1 K 2009-04-23 - 12:45 UnknownUser Use cases for the Run Database
Topic revision: r18 - 2016-11-03 - neufeld
 

TWIKI.NET
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback