Let's Connect

A dashboard of real-time line charts and metrics on a dark screen, representing self-hosted server monitoring without an expensive observability platform

Server monitoring without expensive tools is doable: for one or two boxes you do not need a $200-a-month observability platform billing you per host and per ingested gigabyte. The fix is three free, self-hosted pieces that catch the incidents that actually wake you up. Netdata gives you real-time per-server metrics from a single install. Uptime Kuma watches your endpoints from outside and pushes an alert to Telegram or Slack the moment one goes down. A small healthcheck endpoint in your app returns 200 only when the database and cache are reachable, so Uptime Kuma knows the difference between 'the box is up' and 'the box is actually serving requests'. I have run exactly this stack on production droplets for years, and it has caught a full disk at 3am long before a customer noticed.

Why not just buy a SaaS observability platform?

Because the pricing model is built for fleets, not for the two-server side project or the single VPS running a client's app. The hosted platforms charge per host, per active series, and per gigabyte of logs ingested, and the bill scales with traffic in exactly the moments you least want a surprise. I have watched a modest Laravel app's log volume triple during a campaign and turn a predictable invoice into a four-figure one. For a small box, the data never leaves the machine you already pay for, the agent is a few percent of CPU, and there is no per-metric tax. The trade-off is that you own the upgrades and the backups of your monitoring stack, which is a fair price for one or two servers. If the cost of your wider AWS footprint is what is actually hurting, the monitoring bill is rarely the biggest line item, and I dug into the real culprits in my notes on reducing your AWS bill.

Which signals actually matter on a small box?

Most dashboards drown you in 300 charts so you stare at none of them. On a single server, four signals catch nearly every incident I have ever been paged for. Watch these and ignore the rest until you have a reason not to:

  • CPU steal (st in top): time the hypervisor stole from your VM to serve a noisy neighbour. Sustained steal above ~10% on a shared instance means your CPU numbers lie and your latency is someone else's fault, not your code's.
  • Memory and swap: free RAM alone is fine; active swapping is the killer. Once a box is paging to disk, every request slows and the OOM killer is one bad query away from terminating your app or your database.
  • Disk free: the most common 3am page I get. Logs, an unrotated Nginx access log, or Docker image layers fill the root volume, writes start failing, and the database refuses connections. Alert at 80% used, not at 99%.
  • 5xx rate: the only signal here that reflects what users see. A spike in 500s usually precedes everything else by minutes and is your earliest honest warning that something broke.

CPU steal is the one people miss. On a $5 shared instance, a neighbour running a crypto miner can quietly halve your effective CPU, and without the steal metric you will spend an afternoon profiling code that was never slow.

How do I get real-time metrics in one install?

Netdata. One command and you have a live dashboard at port 19999 showing per-second CPU, memory, swap, disk, network, and — out of the box — CPU steal, all without writing a single config file. It auto-detects Nginx, MySQL, PostgreSQL, Redis, and Docker if they are running and starts charting them. Use the official kickstart installer rather than the distro package, because the apt version on Ubuntu lags badly behind and misses collectors:

install-netdata.sh
# Ubuntu 22.04 / 24.04 LTS — official kickstart installer
wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh
sh /tmp/netdata-kickstart.sh --stable-channel --disable-telemetry

# Dashboard is now live on port 19999
systemctl status netdata

# Do NOT expose 19999 to the public internet.
# Bind it to localhost and reach it over an SSH tunnel:
#   ssh -L 19999:localhost:19999 you@your-server
# then open http://localhost:19999 in your browser

That last comment is not optional advice. An open Netdata port leaks your entire infrastructure topology, software versions, and traffic patterns to anyone who scans it. Bind it to 127.0.0.1 in /etc/netdata/netdata.conf and tunnel in over SSH, or put it behind an authenticated reverse proxy. Treat the monitoring endpoint with the same suspicion as any other service, the same way I lay out in my server hardening basics.

When Netdata is not enough

Netdata keeps a rolling window in RAM, so by default you see roughly the last hour at full resolution and a few days at reduced resolution. The moment you want long-term history, or you are watching more than two or three hosts and want one pane of glass, switch to Prometheus scraping each box plus Grafana for dashboards and retention. It is more moving parts to run and back up, so I only reach for it when the single-host real-time view genuinely stops answering my questions. For one or two servers, Netdata alone is the right amount of machinery.

A grid of real-time monitoring graphs and gauges on a dark dashboard, illustrating Netdata's per-second per-server metrics view
Netdata's default view: per-second CPU, memory, disk and network from a single install, no config file required.

How do I get alerted when the site goes down?

Netdata tells you a metric crossed a threshold, but it runs on the box, so if the whole server dies it cannot page you. You need a watcher running somewhere else. Uptime Kuma is a self-hosted endpoint monitor that hits your URLs on an interval and pushes to Telegram, Slack, Discord, email, or a webhook the instant a check fails or recovers. Run it on a different host or a cheap separate instance — never on the server it is watching — and the simplest way is a one-line Docker container:

run-uptime-kuma.sh
# Run on a SEPARATE host from the one you are monitoring.
docker run -d \
  --name uptime-kuma \
  --restart=unless-stopped \
  -p 3001:3001 \
  -v uptime-kuma-data:/app/data \
  louislam/uptime-kuma:1

# Open http://your-host:3001, create the admin account, then add a
# monitor of type HTTP(s) pointing at https://yourapp.com/healthz
# Set the expected status code to 200 and the interval to 60 seconds.

Point Uptime Kuma at a healthcheck route, not your homepage. A homepage can return 200 while the database is on fire because it serves a cached page. A dedicated /healthz endpoint should return 200 only when the things the app depends on are actually reachable — database and cache — and a 503 otherwise, so a failed check means something real.

routes/web.php
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Route;

Route::get('/healthz', function () {
    $checks = [];

    // Can we actually reach the database?
    try {
        DB::connection()->getPdo()->query('SELECT 1');
        $checks['database'] = 'ok';
    } catch (\Throwable $e) {
        $checks['database'] = 'fail';
    }

    // Can we reach the cache (Redis)?
    try {
        Cache::store()->put('healthz', '1', 5);
        $checks['cache'] = Cache::store()->get('healthz') === '1' ? 'ok' : 'fail';
    } catch (\Throwable $e) {
        $checks['cache'] = 'fail';
    }

    $healthy = ! in_array('fail', $checks, true);

    // 200 only when every dependency is reachable; 503 otherwise.
    return response()->json($checks, $healthy ? 200 : 503);
});

Keep the dependency list short and the timeouts tight. A healthcheck that itself hangs on a slow database connection turns a degraded service into a hard outage in the eyes of your monitor, which is the opposite of what you want. Check only the dependencies whose absence means the app genuinely cannot serve a request.

What about logs and error spikes?

Metrics tell you a box is unhealthy; logs tell you why. For a small setup you do not need a full ELK or Loki cluster on day one. At minimum, ship your application and Nginx logs somewhere durable off the box so a disk-full incident does not also destroy the evidence, and alert on the rate of errors rather than on individual lines. A single 500 is noise. Fifty 500s in a minute is an incident, and that rate spike is what should trigger a notification. This is where disciplined error handling pays off directly — if your app catches, classifies, and logs exceptions consistently, the 5xx rate becomes a clean signal instead of a wall of stack traces. My PHP error handling patterns post covers the structure I use so that errors are countable rather than just loud.

Monitoring you do not look at is not monitoring, it is theatre. Four signals you check beat four hundred you ignore.Md Raihan Hasan

Putting the stack together

The whole thing fits on the back of an envelope: Netdata on each server for live metrics and CPU steal, Uptime Kuma on a separate host hitting a real /healthz endpoint and pushing to Telegram, application and Nginx logs shipped off the box with an alert on error-rate spikes. That is metrics, uptime, and alerting for the cost of a small instance and an afternoon of setup. None of it phones home, none of it bills per gigabyte, and all of it is yours to back up and upgrade. For a startup, a side project, or a handful of client servers, that combination has caught every real incident I have had — disk full, a database that stopped accepting connections, a deploy that started returning 500s — minutes before anyone outside would have noticed. Start with the four signals that matter, alert on rates instead of single events, and only graduate to Prometheus and Grafana the day a single host genuinely stops answering your questions.