When metrics aren't the whole story
In the previous post, we set up metrics collection with Prometheus. We can now see what is happening on our devices — a sensor is failing, the WiFi signal is degrading, or the main loop is running slow. But metrics alone don't tell us why. When a device reboots, the uptime metric resets — but what caused the reboot? When an entity fails, the _failed metric goes to 1 — but what error occurred?
This is where logs come in. While metrics are great at telling you something is wrong, logs are what you reach for when you need to understand the context around it. ESPHome devices already produce logs — you've probably seen them scrolling by in the ESPHome dashboard or through esphome logs. The problem is that they're ephemeral: once the device reboots, they're gone. And if you weren't looking at the right moment, you missed them.
Loki is a log aggregation system built by the Grafana team. It's designed to be efficient and easy to operate — it indexes logs by labels (much like Prometheus does with metrics) rather than by their full content. This makes it a natural companion to a Prometheus-based monitoring stack.
Configuring logging
ESPHome's logger component controls what gets logged and at which verbosity. By default, it logs at DEBUG level, which is quite verbose. The available levels, from most to least verbose, are: VERY_VERBOSE, VERBOSE, DEBUG, INFO, WARN, ERROR, NONE.
For production use, DEBUG is usually too noisy — you'd be shipping a lot of data that rarely helps. INFO is a good baseline:
logger:
level: INFOWhere this gets more interesting is per-component log levels. You might want most of the system at INFO, but keep a specific component at DEBUG if you're investigating something — or silence a particularly chatty one:
logger:
level: INFO
logs:
component: ERROR
mqtt: INFO
sensor: WARNAccessing logs
ESPHome devices produce logs, but to get them into something like Loki we first need a way to get them off the device. There are a few options.
Web server event stream
If you already have the web_server component enabled (which you do if you followed the previous post), your devices expose a /events endpoint that streams log messages as Server-Sent Events. This is the same mechanism the ESPHome dashboard uses to display live logs.
You can try it out with a simple curl:
curl http://rack-fan.example.com/eventsThis will stream log lines to your terminal as they're produced. It's handy for quick debugging, but not practical for log aggregation — it requires something to actively connect to each device and consume the stream. It's pull-based, which means you need to maintain a list of devices to scrape — similar to the Prometheus service discovery challenge from the previous post.
Syslog
A better approach for log aggregation is syslog. ESPHome has a syslog component that sends log messages over UDP to a syslog receiver. This is push-based — the device sends logs as they're produced, without anything needing to poll it. It relies on the udp component for transport:
udp:
- id: syslog_client
addresses: your-syslog-receiver.example.com
port: 1514
syslog:
udp_id: syslog_client
time_id: homeassistant_timeAggregating in Loki
Now that our devices are shipping syslog, we need something on the receiving end to collect, process, and forward those logs to Loki. Grafana Alloy is a good fit — it uses a pipeline model where components receive data, process it, and forward it to the next stage. Here's an example pipeline:
// 1. Receive syslog messages from ESPHome devices
loki.source.syslog "esphome" {
listener {
address = "0.0.0.0:1514"
protocol = "udp"
syslog_format = "rfc3164"
labels = {
job = "esphome",
}
}
relabel_rules = loki.relabel.esphome.rules
forward_to = [
loki.process.esphome.receiver,
]
}
// 2. Extract the device name from the syslog hostname
loki.relabel "esphome" {
rule {
source_labels = ["__syslog_message_hostname"]
target_label = "node"
}
forward_to = []
}
// 3. Extract the log level from ESPHome's format ([I], [W], [E]...)
loki.process "esphome" {
stage.regex {
expression = "^\\[(?P<level_letter>[CDIWEV])\\]"
}
stage.template {
source = "level"
template = `{{ if eq .level_letter "D" }}debug{{ else if eq .level_letter "I" }}info{{ else if eq .level_letter "W" }}warning{{ else if eq .level_letter "E" }}error{{ else if eq .level_letter "V" }}trace{{ else if eq .level_letter "C" }}info{{ end }}`
}
stage.structured_metadata {
values = {
level = "",
}
}
forward_to = [
loki.write.default.receiver,
]
}
// 4. Ship to Loki
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}Let's walk through the pipeline:
loki.source.sysloglistens for UDP syslog messages in RFC3164 format and attaches ajob="esphome"label to everything it receives.loki.relabelmaps the syslog hostname (which ESPHome sets to the device's node name) to anodelabel, keeping it consistent with what we set up for Prometheus metrics in the previous post. Note thatrelabel_rulesis referenced by the source — it applies the rules before forwarding, even though the block is defined separately. Theforward_to = []on the relabel block itself is required by Alloy but unused, since the source references it viarelabel_rulesrather than chaining throughforward_to.loki.processextracts the log level from ESPHome's single-character format ([D],[I],[W],[E],[V],[C]) and maps it to a standard level name. This is stored as structured metadata rather than a label — Loki and Grafana can filter on it, but it doesn't increase label cardinality.loki.writeships the processed logs to Loki.
Labels
A note on labels: Loki indexes by labels, not by log content — so the labels determine how efficiently you can query. The pipeline above uses two labels:
node: the device name (e.g.,rack-fan,bedroom-light)job:esphome(keeps it consistent with your Prometheus job name)
Keep the label set small. In Loki, high cardinality creates too many small streams, which hurts both write and query performance. Things like log levels or component names belong in the log line itself (or as structured metadata, like we did above), where you can filter them at query time with LogQL.
Alternatives to Alloy
Alloy isn't the only option. Since ESPHome speaks standard syslog, anything that can receive syslog can work as a first hop:
- Fluentd or Fluent Bit — can receive syslog and forward to a variety of backends, including Loki, Elasticsearch, or cloud-hosted solutions.
- Vector — similar to Alloy in its pipeline model, with syslog as a source and many supported sinks.
- Graylog — a full log management platform with built-in syslog support, if you'd rather have an all-in-one solution.
The ESPHome and syslog configuration from this post remains the same regardless of what you use on the receiving end.
Querying logs
With logs flowing into Loki, you can query them from Grafana using LogQL. If you went with one of the alternatives mentioned above, the query language will likely differ, but the concepts should apply just the same. The basics of LogQL are straightforward — you select a stream by labels, then optionally filter the log lines.
All logs from a specific device:
{job="esphome", node="rack-fan"}Only errors across all devices — using the structured metadata level we extracted in the Alloy pipeline:
{job="esphome"} | level="error"Logs from a specific component:
{job="esphome", node="rack-fan"} |= "[sensor]"LogQL also supports more advanced filtering. For example, to find all WiFi disconnection events:
{job="esphome"} |~ "WiFi .*(disconnected|connection lost)"The big picture
With both metrics and logs in place, the observability story for our ESPHome devices is looking much more complete. Metrics tell us what is happening — a device went down, a sensor is failing, memory is running low. Logs tell us why — the I2C bus had a timeout, WiFi authentication failed, or OTA triggered a reboot.
The two also reinforce each other. An alert from Prometheus gives you the when and what; Loki gives you the why. And because both use the same node label, cross-referencing between them is straightforward.
In the next post, we'll close the loop by setting up continuous delivery — automating firmware builds and deployments so that our devices stay up to date without manual intervention.

Comments