Integrated Monitoring
Last updated
Last updated
ⓒ2023. Acornsoft Corp. All rights reserved.
Cocktail Cloud utilizes over 200 metrics for resources and states in a multi-cluster environment, providing more than 100 monitoring panels.
Each panel is arranged in views for clusters, ingresses, ETCD, nodes, and namespaces. Additionally, an alarm/event page is provided to review alarms/events chronologically and maximize the visualization of the user platform's status.
Monitoring information in Cocktail Cloud can be accessed in the left-hand [Monitoring] menu. Sub-menus include clusters, ingresses, ETCD, nodes, GPUs, namespaces, and alarms/events.
The platform offers up-to-date status information at the cluster level. Key status information provided in the cluster view includes
Number of API server calls per second
CPU usage
Disk usage
Disk I/O speed
Memory usage
Restarted Pod tracking
Average request time over the last 10 minutes
Pod executions by status
Top 5 Pods with high CPU usage
Top 5 Pods with high memory usage
Ingress exposes HTTP and HTTPS paths from outside the cluster to internal services. It provides configuration options for externally accessible URLs, load balancing traffic, SSL/TLS termination, and name-based virtual hosting. Ingress plays a crucial role in the network area of services, making multidimensional monitoring essential.
The status information provided in the integrated dashboard's Ingress view includes
Ingress controller requests
Ingress controller connections
Ingress controller request success rate
Recent Ingress configuration reload success and failure
Ingress controller request trends
Ingress controller success rate trends
Network I/O trends
Average memory usage trends
Average CPU usage trends
The status information provided in the integrated dashboard's ETCD view includes
Presence of ETCD leader
Number of recent leader changes
Number of recent leader change proposal failures
RPC ratio
Database usage
Node disk processing speed
Overall disk processing speed
Client traffic In/Out
ETCD server-specific processing status
Network usage
Snapshot processing speed
The status information provided in the integrated dashboard's Node view includes
Cluster CPU usage frequency
Cluster memory usage
Cluster disk usage
Cluster network usage
Recent changes and current values of the file system's free space ratio
List of file systems and their usage
The status information provided in the integrated dashboard's GPU view includes
Average GPU utilization
GPU usage trends
Average GPU memory utilization
GPU memory usage trends
GPU temperature and power
GPUs/MIGs
Timeslicing
The status information provided in the integrated dashboard's Namespace view includes
Number of containers
Namespace creation time
Total number of Pods in the namespace
Namespace PVC status
Namespace CPU allocation
Namespace memory allocation
Number of Pods running in the namespace
The monitoring metrics displayed in the integrated dashboard are delivered through dashboard, SMS, and E-mail channels based on user configurations. Users can filter and view metrics by cluster, namespace, and major resource groups.
In the dashboard, events occurring in the past hour can be reviewed, and accumulated events per minute are provided with detailed event descriptions, enabling quick identification of the cause based on event content alone.
Each event is categorized into five levels of importance, and real-time notifications are sent through SMS or E-mail (or both) according to user preferences. Users can filter and view recently occurring events and notifications, with the option to retain data for up to one year based on user settings.