Evaluating monitoring solutions; Prometheus, Thanos, Mimir, Victoria Metrics

Senna Semakula-Buuza
7 min readAug 29

--

Quick summary comparing monitoring options

Hierarchical federation

One (federated) prometheus scraping (child) prometheus for metrics. Child prometheus has low retention whereas the federated prometheus stores long term data on disk

Pros

  • Simple setup
  • Easy to maintain
  • Quick to identify and remediate issues as small architecture

Cons

  • Federated prometheus becomes a bottleneck
  • Complicated to implement high availability
  • Vertical scaling to alleviate resource issues
  • No option to store long term metrics in object storage for retrieval

Resource consumption

The following dashboards showcase resource consumption based on 1million time series ingested into child prometheus at a rate of 5.5k samples per second

Federated prometheus

Global prometheus that scrapes child prometheus for metric aggregation. The child prometheus has 1 million time series

Overview

child prometheus (yellow), federated prometheus memory allocations (MB)
Federated prometheus memory and cpu consumption

Child prometheus

Child prometheus exposing a /federate endpoint exposing metrics. It has 1 million time series stored

Child prometheus memory and cpu consumption

Grafana Mimir

Grafana Mimir project forked from Cortex. Various different methods you can use to ingest data but for this PoC I chose to use prometheus with remote write.

Pros

Cons

  • As it stands, only helm and jsonnet deployment available
  • Complicated setup with many components
  • Fairly new project — limited resources online for guidance
  • Lack of documentation
  • Not widely adopted by community

Microservice mode

Resource consumption

These metrics showcase 1 million time series being stored in prometheus that has remote write enabled to grafana mimir.

Prometheus with remote write

Mimir write components

The Writes resources dashboard shows CPU, memory, disk, and other resource utilization metrics. The dashboard isolates each service on the write path into its own section and displays the order in which a write request flows.

Overview

Ingester

Distributor

Mimir Read components

The Reads resources dashboard shows CPU, memory, disk, and other resources utilization metrics. The dashboard isolates each service on the read path into its own section and displays the order in which a read request flows.

Overview

Query frontend

Query scheduler

Querier

Store Gateway

Errors

Grafana not able to query ingesters:
["expanding series: too many unhealthy instances in the ring"](internal: rpc error: code = Code(500) desc = {"status":"error","errorType":"internal","error":"expanding series: too many unhealthy instances in the ring"})
ts=2023-08-09T15:51:51.83051639Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/Push duration=7.904781ms err="rpc error: code = Code(400) desc = user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order). The affected sample has timestamp 2023-08-09T15:49:29.549Z and is from series {**name**=\\"avalanche_metric_mmmmm_74_0\\", cluster=\\"monitoring-london\\", container=\\"avalanche\\", cycle_id=\\"149\\", endpoint=\\"web\\", instance=\\"10.244.1.4:9001\\", job=\\"avalanche\\", label_key_kkkkk_0=\\"label_val_vvvvv_0\\", label_key_kkkkk_1=\\"label_val_vvvvv_1\\", label_key_kkkkk_2=\\"label_val_vvvvv_2\\", label_key_kkkkk_3=\\"label_val_vvvvv_3\\", label_key_kkkkk_4=\\"label_val_vvvvv_4\\", label_key_kkkkk_5=\\"label_val_vvvvv_5\\", label_key_kkkkk_6=\\"label_val_vvvvv_6\\", label_key_kkkkk_7=\\"label_val_vvvvv_7\\", label_key_kkkkk_8=\\"label_val_vvvvv_8\\", label_key_kkkkk_9=\\"label_val_vvvvv_9\\", namespace=\\"applications\\", pod=\\"avalanche-86f998f95-bhzpc\\", prometheus=\\"monitoring/prometheus\\", prometheus_replica=\\"prometheus-prometheus-0\\", series_id=\\"28\\", service=\\"avalanche\\"}" msg=gRPC************************Distributor:************************
r="rpc error: code = Code(400) desc = failed pushing to ingester: user=anonymous: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator.”
Prometheus failing to push to ingester due to limits:
ts=2023-08-09T14:40:46.978Z caller=dedupe.go:112 component=remote level=error remote_name=b63d33 url=http://mimir-nginx.monitoring.svc:80/api/v1/push msg="non-recoverable error" count=1377 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=anonymous: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator.”

Monolithic mode

Runs all the components in a single binary. This can be horizontally scaled to allow for redundancy. This setup has a prometheus instance remote writing to mimir instances. The prometheus instance has 1 million time series ingested.

Mimir instances

Memory

Memory usage (MB)
memory allocations per second (MB)

Prometheus instance

Memory allocations
Memory allocations per second
CPU allocations per second

Errors

**mimir instance docker logs**
ts=2023-08-10T23:03:40.829794946Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/Push duration=2.861639ms err="rpc error: code = Code(400) desc = user=demo: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator." msg=gRPC
ts=2023-08-10T23:03:40.830739896Z caller=push.go:130 level=error user=demo msg="push error" err="rpc error: code = Code(400) desc = failed pushing to ingester: user=demo: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator."

Thanos

Pros

  • Widely adopted by community
  • Stable and run in significant amount of production environments
  • High availability global view
  • Long term retention of metrics in object storage

Cons

  • Complicated setup with many microservices
  • Have to maintain prometheus and thanos components

Concerns:

  • Only component in permutive that is written in jsonnet
  • Queries can be slow using thanos due to latency. Consider using query-frontend
  • Consider adding distributed tracing to thanos to identify bottlenecks

Stress testing Thanos

./thanosbench stress --workers 50 <thanos-store-endpoint>
Increase in error spike when load testing thanos store with 50 goroutines

Concerns

  • Increased error spike but does not detail what gRPC errors are being emitted
Running 500 goroutines OOM killed thanos store

Managed service

Google offers a managed prometheus service (https://cloud.google.com/stackdriver/docs/managed-prometheus) which offloads the maintenance to them. Unfortunately, given the amount of metrics we’re ingesting, the costs seem too high to consider.

Our Ingestion rate p/s

************************************ingestion rate (from 10/08/22)
---------------------------------------------************************************
prometheus-ssd (a) => 120k p/s
prometheus-sdk (b) => 600k p/s
prometheus-api (c) => 250k p/s
prometheus-infra (d) => 70k p/s
ingestion rate (igr) => (a+b+c+d) => 1.04m p/s___time series____________
prometheus-ssd (i) => 4.3m
prometheus-sdk (j) => 12.5m
prometheus-api (k) => 7m
prometheus-infra (l) => 1.9m
total time series (ts) => (i+j+k+l) => 25.7mm = million
p/s = per second
Estimated cost: £42,562.26 per month
source: https://cloud.google.com/products/calculator#id=25cc6ca2-7dec-4cf0-96aa-f5ac1a4fd5fb

VictoriaMetrics

Pros

  • Less CPU/Memory used for components
  • Stellar documentation:
  • Support is great

Cons

FAQ: https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ

Conclusion

Hierarchal federation is the simplest method to achieve but once but you will inevibitly hit scalability issues once you start ingesting millions of time series into global view

Grafana Mimir provides high scalability with the option of remote storage. It also provides http apis for exposing cardinality tools. Though it can scale well, there is a significant lack of documentation/adoption so it may be difficult to troubleshoot errors in the future

Thanos provides high availability with long term storage and is trusted by the open source community. It has been a mature project for many years.

VictoriaMetrics claims to boast significant resource efficiency and cost compared to Mimir (https://victoriametrics.com/blog/mimir-benchmark/).

Google Managed Service satisfies all technical requirements and is maintained by Google’s SRE team. Unfortunately due to us currently ingesting significant amount of data, the price calculator is quoting £42k a month.

For now, thanos is a stable option but if we have approved funding, the recommendation is to move to google managed service.

--

--