Monitoring Nodes

Explains how to monitor nodes using either the Control System or the CLI.

You can check the health of the nodes on the cluster in the Control System, organized by service or by topology, or by using the CLI.

Note: The metrics collection infrastructure must be installed during installation to visualize the graphs and charts. If the metrics collection infrastructure is not installed, perform an Incremental Install to install the metrics collection infrastructure.
Note: The Nodes page is not available on the Kubernetes version of the Control System.

Monitoring Node Health Using the Control System

To monitor the health of nodes:
  1. Log in to the Control System and click:
    • Overview to view the health of the nodes in the Node Health pane.
    • Nodes to view the health of the nodes in the Node Health pane.
  2. Select one of the following from the drop-down menu in the Node Health pane.
    • By Service to organize the display of nodes by services.

      This is the default view in the Overview page. This view contains the list of services and the nodes on which the service is running () and is down ().

      Note: The color of the node (which reflects the status of the service) is even when a service is stopped (not running) on the node.
    • By Topology to view the display of nodes by topology.

      This is the default view in the Nodes page. This view contains the list of topologies and the health of the nodes (as shown in the following table) in the topology.

      Indicates the node is healthy.
      Indicates the node is degraded and/or may need attention. A node is considered to be in degraded state if:
      • There is no heartbeat from the data-fabric filesystem/NFS node for over 60 seconds.
      • One or more services are down on the node.
      • One or more alarms are raised on the node.
      Indicates the node is in maintenance mode.
      Indicates critical issue(s) on the node. A node is considered to be in critical state if:
      • There is no heartbeat from the node for more than 5 minutes.
      • All data-fabric files system disks on the node are dead or are offline.
      • All containers on the node are being re-replicated because either the node was removed, unregistered, or there was no heartbeat from the node for more than 1 hour.
      • File server is dead/inactive because there is no heartbeat for a long time.
      • NFS server on node is dead.
      • Data Fabric install directory is full.
      • Node reported high data-fabric filesystem memory usage.

Monitoring Node Resource Utilization from the Control System

Log in to the Control System and click Nodes to view the nodes that consumed the most CPU and memory (in percentage) in the Current Resource Utilization pane. The shade of the bubble indicates node resource utilization with the darker shade indicating the nodes that are nearing disk capacity.

Monitoring Node Health Using the CLI or REST API

You can check general health of the nodes with the following command:

maprcli node heatmap -cluster <cluster>

This command displays a heatmap for the nodes on the specified cluster; a subset of the output can also be visualized on the Control System. For complete reference information, see node heatmap.