A Tour Of Infrastructure Dashboards

Observability, monitoring, and alerting are essential ingredients when it comes to ensuring that you are able to know when something is wrong, fix it right away, and prevent it from occurring again. For monitoring infrastructure I’m using Grafana Stack, including the following open source, free tools:

Here is a look at some of the most useful dashboards I’m using today.

Finally, here is a diagram of the architecture components, and what connects with what.

NGINX logs
NGINX logs
Logs To Database
Logs To Database
Promtail
Promtail
Loki
Loki
Grafana Server
Grafana Server
Prometheus
Prometheus
Observability Stack
Observability Stack
varnishadm json logs
varnishadm json logs
ptid, url, status, ttlb
ptid, url, status, t…
flask logs
flask logs
postgresql logs
postgresql logs
user, query, duration
user, query, duration
Postgresql Database
Postgresql Database
Varnish Cache
Varnish Cache
Flask Origin
Flask Origin
Varnish Metrics Exporter
Varnish Metrics Exporter
Postgres Metrics Exporter
Postgres Metrics Exporter
Logrotate
Logrotate
Node Exporter
Node Exporter
Analyst
Analyst
Alerting
Alerting
Because Logs can be pumped into postgres, it is possible to show log information within the web ui. For example, helping the client understand why they are being rate limited, or how soon until they are rate limited.
Because Logs can be pumped into postgres, it is possible to s…
NGINX
NGINX
NGINX Prometheus Exporter
NGINX Prometheus Exporter
Text is not SVG – cannot display

1) Computer System Metrics

I have 9 computers in my home network, and 1 in the cloud. I am monitoring CPU, Disk utilization and IO, RAM utilization, Network utilization, and temperature. This data is collected by Prometheus.

2) Page Views

I collect Varnish Logs via varnishncsa in JSON format. I literally collect ALL the fields you can, from client IP, handling (hit vs miss), TTFB, TTLB, whether or not the client sends a cookie. Because I want to do more analysis than what you can query quickly with inefficient log queries, I load the logs into Postgres. What you are looking at is total site traffic categorized by likelihood of that traffic coming from a bot, and a zoom in on the requests coming from Googlebot; including cache hit vs miss and server side TTLB.

3) Varnish Metrics

Here we can see a variety of metrics related to requests / second, cache hit ratio, cache utilization, synthetic responses, etc. This is really useful for gauging backend health, spikes in traffic, and how well the caching strategy is conforming to the usage pattern.

4) NGINX Metrics

Behind the first caching layer which is Varnish is a layer of NGINX. This is really useful because unlike Varnish, NGINX has a persistent cache that will survive a restart of the computer. Additionally, NGINX supports DNS based backends, backend TLS, and integrates nicely with Varnish for cache invalidation. Here we can monitor total requests, connections, and up status.

5) Scraper

Here is where I track the progress of jobs that collect data on 1 million URLs daily for My Company List. You can see how long different jobs are taking, if anything is stuck, and if the overall progress is on track to finish before the day ends. I have reduced jobs to do a single task each, which has helped identify exactly where bottlenecks are. For example, there is separate tracking for downloading, parsing, and upserting data.

6) Crawler

Crawler is a cache prewarming robot that traverses a domain refreshing the cache. It also takes records of non 200 responses, linking structure, internal or external broken links, and client side ttlb.

7) IP Cycler

The backend of the project runs on my home network on not the world’s best router or internet connection. I am using Keep Connect to power cycle the router when the internet goes out, which restores the connection. However invariably the port forwarding stops working upon router reboot, so I need to open up the router UI and release and then renew the public IP. Because TP-Link has put in safeguards to prevent bots from changing router settings, I needed to use Selenium and Python to achieve this result via browser automation.

8) Postgres Active Connections

Here we can keep an eye on Postgres, the number of connections, duration of different queries, and if any queries are blocked.

9) Postgres Metrics

This is similar to above but the data source is Prometheus instead of directly from Postgres. By using Postgresql Exporter, we can see a time lapse of data points that are only available as a snapshot when querying pg_stat tables directly.

10) Alerts

As issues come up, make alerts to let you know when something is wrong, but aren’t too sensitive as to alert you too often for issues that aren’t severe enough to justify it.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *