Recovering from Disk Failure

Lists the disk errors and their resolution.

Most software failures can be remedied by running the fsck utility, which scans the storage pool to which the disk belongs and reports errors. For hardware failures, remove the failed disk and replace it according to the procedure in Removing and Replacing Disks.

The following are the types of failures and the recommended courses of action:
I/OTimeOut Error
Failure Reason: The default value for mfs.disk.io.timeout parameter is 60 seconds. The time to declare an IO as stuck is 3 times the value of this parameter (3 x mfs.disk.io.timeout). The disk will be taken offline even if a single IO has not completed.
Action:
  1. Check if the disks are good and still reliable.
  2. If disks are good, increase the value of the mfs.io.disk.timeout parameter in the /opt/mapr/conf/mfs.conf file. Otherwise, replace the disks.
No Such Device
Failure Reason: The $INSTALL_DIR/conf/disktab file contains "/MissingDisk" or references a disk path not found in /proc/partitions file.
Action: Run mrdisk <device path> to determine whether a disk is formatted for file system. Also, check the device paths in $INSTALL_DIR/conf/disktab file. The disktab file contains the disk path and disk GUID that is used to load the disks in the file system. If the disk paths have been renamed, fix them or run disksetup -X command to regenerate the disktab from /proc/partitions. Alternatively, restart the file system to resolve disk name changes.

If the problem still persists, contact HPE Ezmeral Data Fabric support.

ENODEV: MissingDisk# Error: disktab file contains a /MissingDisk# entry
Failure Reason: A disk corresponding to a GUID is missing and the corresponding disk path in the disktab file belongs to another disk. When an attempt is made to automatically fix the disktab file, this entry is replaced with /MissingDisk# path.
Action: If a disk corresponding to a GUID is permanently lost, remove the line corresponding to it in the disktab file. Alternatively, run maprcli disk remove _MissingDisk# command, where # corresponds to the disk number, and restart the file system.
EIO Error
Failure Reason: I/O error. This could be due to a bad block or disk. The system will offline the SP after one final attempt to complete the IO.
Action: Check /var/log/messages for errors from the disk drivers.
CRC Error
Failure Reason: This could be due to a bad block or bit flip on the disk. The SP will be taken offline immediately.
Action: Run fsck -n <sp> -d to perform a CRC (Cyclic Redundancy Check) on the data blocks in the storage pool, then bring it back online.

To load all the SPs to the list of SPs, run:

mrconfig disk load or mrconfig sp load
To bring back all SPs online, run:
mrconfig sp refresh
To bring specific SPs back online, run:
mrconfig sp online <sp path>
SlowDisk Error
Failure Reason: The default value for the mfs.disk.io.timeout parameter is 60 seconds. The time to declare an IO as slow is equal to the value of this parameter (1 x mfs.disk.io.timeout). Thirty or more slow IO completions in a short span of time (5 seconds) on the same disk is recorded as a slow event. The SP will be taken offline if 3 such events are recorded within an hour.
NOTE After an hour, HPE Ezmeral Data Fabric filesystem will reset tracking (to 0).
Action:
  1. Check if the disks are good and still reliable.
  2. If disks are good, increase the value of the mfs.io.disk.timeout parameter in the /opt/mapr/conf/mfs.conf file. Otherwise, replace the disks.
GUID of disk mismatches with the one in $INSTALL_DIR/conf/disktab
Failure Reason: Possible that disk names have changed.
Action: After a node restart, the operating system can reassign the drive labels (for example, /sda), resulting in drive labels no longer matching the entries in the disktab file. The disktab file contains the disk path and disk GUID that is used to load the disks in the file system. Run $INSTALL_DIR/server/disksetup -X to update the disktab file by looking up the disks in /proc/partitions and make the disk paths match the GUIDs.
Unknown Error
Failure Reason: Any reason
Action: Contact HPE Ezmeral Data Fabric support.