Evaluating monitoring solutions; Prometheus, Thanos, Mimir, Victoria Metrics
--
Hierarchical federation
One (federated) prometheus scraping (child) prometheus for metrics. Child prometheus has low retention whereas the federated prometheus stores long term data on disk
Pros
- Simple setup
- Easy to maintain
- Quick to identify and remediate issues as small architecture
Cons
- Federated prometheus becomes a bottleneck
- Complicated to implement high availability
- Vertical scaling to alleviate resource issues
- No option to store long term metrics in object storage for retrieval
Resource consumption
The following dashboards showcase resource consumption based on 1million time series ingested into child prometheus at a rate of 5.5k samples per second
Federated prometheus
Global prometheus that scrapes child prometheus for metric aggregation. The child prometheus has 1 million time series
Overview
Child prometheus
Child prometheus exposing a /federate endpoint exposing metrics. It has 1 million time series stored
Grafana Mimir
Grafana Mimir project forked from Cortex. Various different methods you can use to ingest data but for this PoC I chose to use prometheus with remote write.
Pros
- Easy to set up
- Baked in High Availability as all components are stateless
- Has caching component enabled by default
- Can deploy two seperate modes: monolithic and microservices
- Exposes http cardinality API
- Setup in one command. Minimal configuration
- Runbooks available: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/
Cons
- As it stands, only helm and jsonnet deployment available
- Complicated setup with many components
- Fairly new project — limited resources online for guidance
- Lack of documentation
- Not widely adopted by community
Microservice mode
Resource consumption
These metrics showcase 1 million time series being stored in prometheus that has remote write enabled to grafana mimir.
Prometheus with remote write
Mimir write components
The Writes resources dashboard shows CPU, memory, disk, and other resource utilization metrics. The dashboard isolates each service on the write path into its own section and displays the order in which a write request flows.
Overview
Ingester
Distributor
Mimir Read components
The Reads resources dashboard shows CPU, memory, disk, and other resources utilization metrics. The dashboard isolates each service on the read path into its own section and displays the order in which a read request flows.
Overview
Query frontend
Query scheduler
Querier
Store Gateway
Errors
Grafana not able to query ingesters:
["expanding series: too many unhealthy instances in the ring"](internal: rpc error: code = Code(500) desc = {"status":"error","errorType":"internal","error":"expanding series: too many unhealthy instances in the ring"})
ts=2023-08-09T15:51:51.83051639Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/Push duration=7.904781ms err="rpc error: code = Code(400) desc = user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order). The affected sample has timestamp 2023-08-09T15:49:29.549Z and is from series {**name**=\\"avalanche_metric_mmmmm_74_0\\", cluster=\\"monitoring-london\\", container=\\"avalanche\\", cycle_id=\\"149\\", endpoint=\\"web\\", instance=\\"10.244.1.4:9001\\", job=\\"avalanche\\", label_key_kkkkk_0=\\"label_val_vvvvv_0\\", label_key_kkkkk_1=\\"label_val_vvvvv_1\\", label_key_kkkkk_2=\\"label_val_vvvvv_2\\", label_key_kkkkk_3=\\"label_val_vvvvv_3\\", label_key_kkkkk_4=\\"label_val_vvvvv_4\\", label_key_kkkkk_5=\\"label_val_vvvvv_5\\", label_key_kkkkk_6=\\"label_val_vvvvv_6\\", label_key_kkkkk_7=\\"label_val_vvvvv_7\\", label_key_kkkkk_8=\\"label_val_vvvvv_8\\", label_key_kkkkk_9=\\"label_val_vvvvv_9\\", namespace=\\"applications\\", pod=\\"avalanche-86f998f95-bhzpc\\", prometheus=\\"monitoring/prometheus\\", prometheus_replica=\\"prometheus-prometheus-0\\", series_id=\\"28\\", service=\\"avalanche\\"}" msg=gRPC************************Distributor:************************
r="rpc error: code = Code(400) desc = failed pushing to ingester: user=anonymous: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator.”Prometheus failing to push to ingester due to limits:
ts=2023-08-09T14:40:46.978Z caller=dedupe.go:112 component=remote level=error remote_name=b63d33 url=http://mimir-nginx.monitoring.svc:80/api/v1/push msg="non-recoverable error" count=1377 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=anonymous: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator.”
Monolithic mode
Runs all the components in a single binary. This can be horizontally scaled to allow for redundancy. This setup has a prometheus instance remote writing to mimir instances. The prometheus instance has 1 million time series ingested.
Mimir instances
Memory
Prometheus instance
Errors
**mimir instance docker logs**
ts=2023-08-10T23:03:40.829794946Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/Push duration=2.861639ms err="rpc error: code = Code(400) desc = user=demo: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator." msg=gRPC
ts=2023-08-10T23:03:40.830739896Z caller=push.go:130 level=error user=demo msg="push error" err="rpc error: code = Code(400) desc = failed pushing to ingester: user=demo: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator."
Thanos
Pros
- Widely adopted by community
- Stable and run in significant amount of production environments
- High availability global view
- Long term retention of metrics in object storage
Cons
- Complicated setup with many microservices
- Have to maintain prometheus and thanos components
Concerns:
- Only component in permutive that is written in jsonnet
- Queries can be slow using thanos due to latency. Consider using query-frontend
- Consider adding distributed tracing to thanos to identify bottlenecks
Stress testing Thanos
./thanosbench stress --workers 50 <thanos-store-endpoint>
Concerns
- Increased error spike but does not detail what gRPC errors are being emitted
Managed service
Google offers a managed prometheus service (https://cloud.google.com/stackdriver/docs/managed-prometheus) which offloads the maintenance to them. Unfortunately, given the amount of metrics we’re ingesting, the costs seem too high to consider.
Our Ingestion rate p/s
************************************ingestion rate (from 10/08/22)
---------------------------------------------************************************
prometheus-ssd (a) => 120k p/s
prometheus-sdk (b) => 600k p/s
prometheus-api (c) => 250k p/s
prometheus-infra (d) => 70k p/s
ingestion rate (igr) => (a+b+c+d) => 1.04m p/s___time series____________
prometheus-ssd (i) => 4.3m
prometheus-sdk (j) => 12.5m
prometheus-api (k) => 7m
prometheus-infra (l) => 1.9mtotal time series (ts) => (i+j+k+l) => 25.7mm = million
p/s = per secondEstimated cost: £42,562.26 per month
source: https://cloud.google.com/products/calculator#id=25cc6ca2-7dec-4cf0-96aa-f5ac1a4fd5fb
VictoriaMetrics
Pros
- Less CPU/Memory used for components
- Stellar documentation:
- Support is great
Cons
- No remote storage
- Paid service albeit open source service is available but limited
- Scored low correctness for promQL vendor tests: https://promlabs.com/blog/2020/08/06/comparing-promql-correctness-across-vendors/
FAQ: https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ
Conclusion
Hierarchal federation is the simplest method to achieve but once but you will inevibitly hit scalability issues once you start ingesting millions of time series into global view
Grafana Mimir provides high scalability with the option of remote storage. It also provides http apis for exposing cardinality tools. Though it can scale well, there is a significant lack of documentation/adoption so it may be difficult to troubleshoot errors in the future
Thanos provides high availability with long term storage and is trusted by the open source community. It has been a mature project for many years.
VictoriaMetrics claims to boast significant resource efficiency and cost compared to Mimir (https://victoriametrics.com/blog/mimir-benchmark/).
Google Managed Service satisfies all technical requirements and is maintained by Google’s SRE team. Unfortunately due to us currently ingesting significant amount of data, the price calculator is quoting £42k a month.
For now, thanos is a stable option but if we have approved funding, the recommendation is to move to google managed service.