Redirecting to ARDA pages...

ATLAS Distributed Data Management Monitoring

Monitoring
Main page
Current Transfers
Internal Tier 0
ARDA Dashboard
Help

Links
Dataset browser
DDM Wiki
DDM Savannah
LCG Service Challenges

These pages are now unsupported. Please use the ARDA dashboard for monitoring information.

Most callbacks have been turned off for this monitoring, only errors and file done events are displayed. The throughput plots are not updated anymore.

Explanation of the transfer monitoring pages

These pages show pretty much all the information we can possibly show about what is happening with DQ2 data movement. The site services based on the VO boxes at Tier 1 sites regularly send back information on a file by file basis on events that have happened, such as a file transfer completing or an error looking up a source file. These services operate as a state machine, with an agent for each file state processing the file and moving it to another state. The possible states are shown below:

StateExplanation
UNKNOWN_SOURCE_SURLSThe files have been picked up by the site services but no source files have been found yet
KNOWN_SOURCE_SURLSSource filenames have been resolved
ASSIGNEDThe file has been assigned a tool to use to copy it
PENDINGThe file is currently being copied, or a request has been made to the tool to copy it
ATTEMPT_DONEThe tool has reported success, but the copy has not been validated yet or registered in the destination file catalog
VALIDATEDThe copy has been validated
FILE_DONEThe file has been registered successfully in the destination file catalog. This is the final state of a successful transfer. The site services are finished with this file and will delete it from the local VO box database
HOLD_NO_REPLICASNo source replicas were found for the file among the given sources
HOLD_NO_MORE_REPLICASOne or more source replicas were found and transfer attempted but all failed and there are no more replicas left to try
HOLD_NO_TOOL_AVAILABLEThere is no file transfer tool available to copy between the source and destination sites and/or protocols
HOLD_FAILED_SUBMITThe file failed to be submitted to the file transfer tool
HOLD_CANCELLEDThe file transfer was cancelled from outside the request itself
HOLD_FAILED_POLLINGWhen the status of a transfer was checked, an error occurred
HOLD_FAILED_VALIDATIONAn error occurred when trying to validate the copied file
HOLD_MAX_ATTEMPTS_REACHEDThe maximum number of retries for a file transfer has been reached
HOLD_FAILED_RETRY_ANALYSISAn error occurred when trying to decide whether or not to retry a file transfer
HOLD_FAILED_CLEAN_ATTEMPTAn error occurred when trying to clean up after a failed attempt
HOLD_FAILED_REGISTRATION Registration to the destination file catalog failed

These states along with the number of files in each state is what you get by clicking on 'status'. Clicking then on the state name lists all the files in that state, and you can click on each file to find out all the information we have about the file.

HOLD states and VALIDATED states are final states - for HOLD states unless there is a built in retry policy (in the case of HOLD_FAILED_REGISTRATION where we try forever until the registration is successful), some manual intervention is required to retry or cancel the attempt to fulfil the subscription. The other states are transient and an agent will eventually pick up the files and move them to another state.

Errors are also logged within the monitoring framework. These are not states and they may or may not lead to a HOLD state, for example if there is an error reading one remote LRC, the file may be found in another one and copied successfully, however we still report the error message. The possible errors are explained below:

ErrorExplanation
REMOTE_FILE_LOOKUP_ERRORAn error occurred when searching for a source replica on a remote catalog
DATASET_REMOTE_LRC_LOOKUP_ERRORAn error occurred when searching for source replicas on a remote catalog (error is reported per dataset)
LOCAL_FILE_LOOKUP_ERRORAn error occurred when checking if a file exists on the local file catalog
DATASET_LOCAL_LRC_LOOKUP_ERRORAn error occurred when checking if a file exists on the local file catalog (error is reported per dataset)
FILE_TRANSFER_ERRORThere was an error with a file transfer
QUERY_FILES_IN_DATASET_ERRORThere was an error when querying DQ2 central catalogs for the files in a dataset
ADD_DATASET_REPLICA_COMPLETE_ERRORAn error occurred when registering the dataset as complete at the destination in the DQ2 central catalogs

To find if anything at all is happening at a site click on 'last 100 events'. Note that all times are in UTC. If you expect to see some data movement and there is nothing recent there, there may be a problem with the site services or the monitoring. You can click on 'datasets' to see if your subscription has been picked up yet by the site services, this may take a few minutes after you enter the subscription for it to appear here. If your dataset is here you can click on it to see what state the files are in.

To get a picture of the overall success rate click on 'success/failure by file'. This shows the number of successes (FILE_DONE) and errors that were reported in the last hour and the last day. Clicking on 'throughput' gives in text what is reported on the plots on the main page. The most recent errors and the last errors can be seen by clicking on 'last 2h errors' and 'last 100 errors' respectively.

Back to main page


Questions/suggestions? Email atlas-dq2-support@cern.ch