Integrated Monitoring

Last updated 1 year ago

Was this helpful?

Integrated Monitoring

1. What types of resources can be monitored?

Cocktail Cloud utilizes over 200 metrics for resources and states in a multi-cluster environment, providing more than 100 monitoring panels.

Each panel is arranged in views for clusters, ingresses, ETCD, nodes, and namespaces. Additionally, an alarm/event page is provided to review alarms/events chronologically and maximize the visualization of the user platform's status.

2. Where can monitoring information be accessed?

Monitoring information in Cocktail Cloud can be accessed in the left-hand [Monitoring] menu. Sub-menus include clusters, ingresses, ETCD, nodes, GPUs, namespaces, and alarms/events.

3. How to check the cluster status?

The platform offers up-to-date status information at the cluster level. Key status information provided in the cluster view includes

Number of API server calls per second
CPU usage
Disk usage
Disk I/O speed
Memory usage
Restarted Pod tracking
Average request time over the last 10 minutes
Pod executions by status
Top 5 Pods with high CPU usage
Top 5 Pods with high memory usage

4. Checking Ingress Status

Ingress exposes HTTP and HTTPS paths from outside the cluster to internal services. It provides configuration options for externally accessible URLs, load balancing traffic, SSL/TLS termination, and name-based virtual hosting. Ingress plays a crucial role in the network area of services, making multidimensional monitoring essential.

The status information provided in the integrated dashboard's Ingress view includes

Ingress controller requests
Ingress controller connections
Ingress controller request success rate
Recent Ingress configuration reload success and failure
Ingress controller request trends
Ingress controller success rate trends
Network I/O trends
Average memory usage trends
Average CPU usage trends

5. Checking ETCD Status

The status information provided in the integrated dashboard's ETCD view includes

Presence of ETCD leader
Number of recent leader changes
Number of recent leader change proposal failures
RPC ratio
Database usage
Node disk processing speed
Overall disk processing speed
Client traffic In/Out
ETCD server-specific processing status
Network usage
Snapshot processing speed

6. Checking Node Status

The status information provided in the integrated dashboard's Node view includes

Cluster CPU usage frequency
Cluster memory usage
Cluster disk usage
Cluster network usage
Recent changes and current values of the file system's free space ratio
List of file systems and their usage

7. Checking GPU Status

The status information provided in the integrated dashboard's GPU view includes

Average GPU utilization
GPU usage trends
Average GPU memory utilization
GPU memory usage trends
GPU temperature and power
GPUs/MIGs
Timeslicing

8. Checking Namespace Status

The status information provided in the integrated dashboard's Namespace view includes

Number of containers
Namespace creation time
Total number of Pods in the namespace
Namespace PVC status
Namespace CPU allocation
Namespace memory allocation
Number of Pods running in the namespace

9. Checking Notification/Event History

The monitoring metrics displayed in the integrated dashboard are delivered through dashboard, SMS, and E-mail channels based on user configurations. Users can filter and view metrics by cluster, namespace, and major resource groups.

In the dashboard, events occurring in the past hour can be reviewed, and accumulated events per minute are provided with detailed event descriptions, enabling quick identification of the cause based on event content alone.

Each event is categorized into five levels of importance, and real-time notifications are sent through SMS or E-mail (or both) according to user preferences. Users can filter and view recently occurring events and notifications, with the option to retain data for up to one year based on user settings.

PreviousSecurity NextAPI Token Issuance

Last updated 1 year ago

Was this helpful?