With the increasing trend of hybrid cloud, a unified, loosely coupled, rapid response monitoring and alarming scheme is urgently needed in the production environment.
Telemetry cannot be extended to unified monitoring of infrastructure such as underlying infrastructure (e.g. physical resources on hosts), Kubernetes resources; With the increase of cluster size, carry out rapid alarm response is also a key issue.
We propose an architecture, through optimization of prometheus-operator, uniformly collecting and managing all kinds of resources in the multi-cloud scene, to solve the problem of persistence of monitoring data. We can alert the fault in a second level, and provide automatic deployment schemes of monitoring.
At present, we have applied it to a production environment with 500 hosts.
- Prometheus Arch
- Prometheus operator
- Kubernetes
- Telemetry
- Docker
- Ansible