Integrated Monitoring

1. What types of resources can be monitored?

Cocktail Cloud utilizes over 200 metrics for resources and states in a multi-cluster environment, providing more than 100 monitoring panels.

Each panel is arranged in views for clusters, ingresses, ETCD, nodes, and namespaces. Additionally, an alarm/event page is provided to review alarms/events chronologically and maximize the visualization of the user platform's status.

2. Where can monitoring information be accessed?

Monitoring information in Cocktail Cloud can be accessed in the left-hand [Monitoring] menu. Sub-menus include clusters, ingresses, ETCD, nodes, GPUs, namespaces, and alarms/events.

[Screen] Integrated Monitoring Dashboard Main

3. How to check the cluster status?

The platform offers up-to-date status information at the cluster level. Key status information provided in the cluster view includes

  • Number of API server calls per second

  • CPU usage

  • Disk usage

  • Disk I/O speed

  • Memory usage

  • Restarted Pod tracking

  • Average request time over the last 10 minutes

  • Pod executions by status

  • Top 5 Pods with high CPU usage

  • Top 5 Pods with high memory usage

[Screen] Cluster Monitoring

4. Checking Ingress Status

Ingress exposes HTTP and HTTPS paths from outside the cluster to internal services. It provides configuration options for externally accessible URLs, load balancing traffic, SSL/TLS termination, and name-based virtual hosting. Ingress plays a crucial role in the network area of services, making multidimensional monitoring essential.

The status information provided in the integrated dashboard's Ingress view includes

  • Ingress controller requests

  • Ingress controller connections

  • Ingress controller request success rate

  • Recent Ingress configuration reload success and failure

  • Ingress controller request trends

  • Ingress controller success rate trends

  • Network I/O trends

  • Average memory usage trends

  • Average CPU usage trends

[Screen] Ingress Monitoring

5. Checking ETCD Status

The status information provided in the integrated dashboard's ETCD view includes

  • Presence of ETCD leader

  • Number of recent leader changes

  • Number of recent leader change proposal failures

  • RPC ratio

  • Database usage

  • Node disk processing speed

  • Overall disk processing speed

  • Client traffic In/Out

  • ETCD server-specific processing status

  • Network usage

  • Snapshot processing speed

[Screen] ETCD Monitoring

6. Checking Node Status

The status information provided in the integrated dashboard's Node view includes

  • Cluster CPU usage frequency

  • Cluster memory usage

  • Cluster disk usage

  • Cluster network usage

  • Recent changes and current values of the file system's free space ratio

  • List of file systems and their usage

[Screen] Node Monitoring

7. Checking GPU Status

The status information provided in the integrated dashboard's GPU view includes

  • Average GPU utilization

  • GPU usage trends

  • Average GPU memory utilization

  • GPU memory usage trends

  • GPU temperature and power

  • GPUs/MIGs

  • Timeslicing

[Screen] GPU Monitoring

8. Checking Namespace Status

The status information provided in the integrated dashboard's Namespace view includes

  • Number of containers

  • Namespace creation time

  • Total number of Pods in the namespace

  • Namespace PVC status

  • Namespace CPU allocation

  • Namespace memory allocation

  • Number of Pods running in the namespace

[Screen] Namespace Monitoring

9. Checking Notification/Event History

The monitoring metrics displayed in the integrated dashboard are delivered through dashboard, SMS, and E-mail channels based on user configurations. Users can filter and view metrics by cluster, namespace, and major resource groups.

In the dashboard, events occurring in the past hour can be reviewed, and accumulated events per minute are provided with detailed event descriptions, enabling quick identification of the cause based on event content alone.

Each event is categorized into five levels of importance, and real-time notifications are sent through SMS or E-mail (or both) according to user preferences. Users can filter and view recently occurring events and notifications, with the option to retain data for up to one year based on user settings.

[Screen] Alerts

Last updated

Was this helpful?