All Documents
Current Document

Content is empty

If you don't find the content you expect, please try another search term

Documentation

Cluster monitoring

Last updated:2022-01-11 13:43:09

The Kingsoft Cloud Elasticsearch Service (KES) console provides real-time and historical monitoring data for you to monitor cluster and node resources, such as storage, CPU, and memory. You can obtain the real-time running status of KES clusters based on these metrics and remove risks in real time to ensure the stable running of the clusters.

  1. Log in to the KES console.

  2. In the cluster list, find the cluster to monitor and click Monitoring in the Operation column to go to the Cluster Monitoring page. Alternatively, click the name of the target cluster to go to the Cluster Details page. In the left navigation pane, click Cluster Monitoring to go to the Cluster Monitoring page.

Cluster status

image.png

Metric description

Monitoring metric Description Details
Service status The status of the KES cluster, which can be Green, Yellow, or Red.
Green: The cluster is normal.
Yellow: Alarms of the cluster are reported, and specific replica shards are unavailable.
Red: Exceptions occur on the cluster, and specific primary shards are unavailable.
If the cluster status is Yellow, the search result is complete. However, the high availability of the cluster is affected, resulting in high risks of data loss. You must investigate, locate, and fix issues in a timely manner to prevent data loss.
If the cluster status is Red, some data has been lost, and the search operation returns only a part of data. If a write request is allocated to a lost shard, an exception is returned. You must locate and fix abnormal shards in a timely manner.
Cluster query QPS Total number of queries sent per second by the cluster. QPS is determined by the number of primary shards for a query index. If a query index has five primary shards, one query request equals 5 QPS. If the QPS sharply increases, the CPU or heap memory usage or load_1m may be high, leading to downgraded processing capabilities of cluster nodes.
Document write QPS Total number of written documents per second. If the QPS increases sharply, the CPU or heap memory usage or load_1m may be high, leading to downgraded processing capabilities of cluster nodes.

Node status

image.png

Metric description

Monitoring metric Description Details
CPU usage (%) The percentage of CPU usage of each node. The statistic data of this metric is collected every 60s. High CPU usage may lead to downgraded processing capabilities of cluster nodes. If this metric remains high, you can scale out the cluster node to improve the load capacity of the node.
Disk usage (%) The percentage of disk usage of each node. The statistic data of this metric is collected every 60s. The disk usage of a node must be less than 85% to prevent impact on services. Clear useless indexes in a timely manner. To scale out the cluster, you can increase the disk capacity of a single node or increase the number of nodes.
Heap memory usage (%) The percentage of heap memory usage of each node. The statistic data of this metric is collected every 60s. High heap memory usage will affect the ES cluster service and automatically trigger garbage collection (GC). If the heap memory usage is too high, out of memory (OOM) will occur.
load_1m The load of the cluster nodes within 60s. The value of this metric must be smaller than the number of CPU cores of the KES cluster nodes. Taking a single-core KES cluster node as an example. The value of this metric is as follows:
< 1: No process is waiting.
= 1: The system cannot provide extra resources for running more processes.
> 1: Processes are congested and waiting for resources. If the value is too high, we recommend that you reduce the cluster load or increase the cluster node specifications.
Total GC running duration Accumulated GC duration within 60s. If the GC duration is excessively long, the node is short of memory. We recommend that you increase the node memory to balance the load in the current node, or increase the number of nodes to balance the load in the cluster.
Rejected requests The number of write rejects and query rejects within 60s. If the CPU, memory, and disk usage is too high, the cluster write and query rejects may increase. Typically, this occurs when the current cluster configuration cannot meet the read and write operation requirements. If the value is too high, we recommend that you improve the cluster node configuration or the processing capability of cluster nodes.
On this page
Pure ModeNormal Mode

Pure Mode

Click to preview the document content in full screen
Feedback