Retrying Failed Operation

About this task

If an offload, recall, or abort operation for a volume fails, the volume tierjobstatus command shows one of the following statuses:

FailureFatal	Indicates failure is fatal and CLDB cannot retry the operation again.
FailureRetriable	Indicates failure to offload; however, CLDB will try again if the job is not manually restarted again or aborted.

CLDB tries the operation again (up to 5 times by default) after a specific wait time (of 30 minutes by default) for the following errors:

EAGAIN
ETIMEDOUT
ENETUNREACH
ENETDOWN
ECONNRESET

The RetryCount field value in the volume tierjobstatus command output shows the number of times CLDB has retried so far. For example:

# maprcli volume tierjobstatus -name testvol -json
{
 "timestamp":1503308792266,
 "timeofday":"2017-08-21 09:46:32.266 GMT+0000",
 "status":"OK",
 "total":1,
 "data":[
  {
   "offload":{
    "state":"FailureRetriable, RetryCount: 5",
    "startTime":"2017-08-21 09:07:17.506 GMT+0000",
    "endTime":"2017-08-21 09:08:49.799 GMT+0000",
    "gateway":"10.10.102.68:8660"
   }
  }
 ]
}

If the offload or recall operation for an individual file fails, the file tierstatus or hadoop mfs command returns one of the following:

Code	Message	Description
0	HAS_LOCAL_DATA	Indicates that the file is not yet fully offloaded.
1	NO_LOCAL_DATA	Indicates that the file was completely offloaded.
2	OP_FAIL	Indicates that the operation to retrieve the status failed.
3	INVALID_FILE	Indicates that the file does not exist.
4	FILE_NOT_TIERED	Indicates that the file is not in a tiered volume.
5	FILE_EMPTY	Indicates that the file specified for offload is an empty file.
6	NO_GATEWAY	Indicates that no MAST Gateways are available for offload operation.
7	OP_TIMEOUT	Indicates that there was no response from the MAST Gateway (maybe as a result of an error) during the offload or recall operation.
8	FTOS_SUCCESS	Indicates that the file was successfully offloaded or recalled.
9	FTOS_ABORTED	Indicates that the file offload or recall operation was aborted.
10	FTOS_ABORT_IN_PROGRESS	Indicates that the file offload or recall job is being aborted.
11	FTOS_TRANSFER_IN_PROGRESS	Indicates that the file offload is in progress.
12	FTOS_REQ_QUEUED	Indicates that the file offload is scheduled, but has not yet started.
13	FTOS_JOB_NOT_AVAILABLE	Indicates that the job ID specified with the tierjobstatus command is not available.

When a file-level offload or recall operation fails, CLDB does not retry the operation. For failed file-level:

Offload operation, you can run the command to retry the operation. For more information, see Offloading a File to a Tier Using the CLI and REST API. Alternatively, if the volume that the file is associated with has a data offload schedule, the file data is automatically offloaded based on the rules associated with the volume.
Recall or abort operation, you can run the command again to retry the operation if the error returned is not EIO.

You can configure the number of times CLDB retries and the interval between retries using the CLI.

Configuring the Number of Retries

Procedure

Set the value for the cldb.gateway.retry.count parameter, whose default value is 5, to configure the number of times that CLDB tries again. For example, to configure CLDB to retry to offload, recall, or abort at least 10 times, run the following command:

# maprcli config save -values {"cldb.gateway.retry.count":"10"}

Configuring the Interval Between Retries

Procedure

Set the value for the cldb.gateway.retry.waittime.seconds parameter, whose default value is 1800 seconds (30 minutes), to configure the amount of time CLDB waits between retries. For example, to configure CLDB to wait for up to 4 hours (14400 seconds), run the following command:

# maprcli config save -values {"cldb.gateway.retry.waittime.seconds":"14400"}