Setting up system monitoring using grafana, influxdb and telegraf
Introduction
There are so many options available for log aggregation and monitoring your system. In this section we are going to look at tech stack called telegraf, influxdb and grafana.
Telegraf is a log / metric collection and aggregation agent. It is the component which sits on target machine (your target host where your mission critical applications sits) and continuously send metrics data to your other component called influxdb.
InfluxDb is a time-series database that telegraf writes metrics to and typically where all your data is stored. It is like any other database, where you can use a client (influxb) to connect to it and start writing queries and pull out the information relevant to you.
Grafana is a front end web applications written using react that shows all your metric information. All you need to do is select your database, write up some query and pick your widget.
Setting up on a Windows environment.
Now that we are acquinted with the tool. Lets' get setup our application monitoring on a Windows environment.
Install influxdb
Setting up influxdb is pretty straight forward. After you install influxdb, ensure the instance is running and note down the ip address.
Setting up telegraf.
Telegraf is relatively easy to install. Just follow the installation instruction below :-
wget https://dl.influxdata.com/telegraf/releases/telegraf-1.13.2_windows_amd64.zip
unzip telegraf-1.13.2_windows_amd64.zip
Configuring telegraf - specifying what we want to send
The critical element about telegraf is configurations. This stand alone executable comes with capability to collect general metrics like memory, cpu usage, network traffic and plugins too. Plugins might include dockers monitoring (this is useful when you would like to monitor docker based application running on the host).
Key configuration that we normally do here are :-
a) Specify influxdb database server DNS.
b) Metrics of interest - we specify metrics of interest by editing a file call telegraf.conf. We can even defined how often we would like to send these data across.
c) agent logging behaviour
You can specify different database to store all your metrics dat and telegraf will create that automatically for you. Normally it is better to do that, for performance reason.
Setting up Grafana
Grafana can be run as a docker container, stand-alone application (MacOs/Win/Debian/Ubuntu) or in the cloud. Grafana comes in two different version a free and an enterprise edition.
To setup stand alone version of Grafana, you can follow instructions given in the website.
Building your dashboard
Telegraf will do all the hard work of polling metrics to your influxdb. To build our simple dashboard, we are going to show processor usage over time and it is going to look something like this.
In the main page, click on "+" to create a dashboard. The idea is to
1. Get data for a widget - by running query, hence the "Add Query" command below. This is where you specify your query and it is a SQL like query (with a twist - as your might need some understanding about influxdb query).
So click on "Add query" and insert the following text as shown in diagram below :-
SELECT mean("usage_system") AS "system", mean("usage_iowait") AS "iowait" , mean("usage_user") as "user", mean("usage_idle") as "idle" FROM "docker_telegraf"."autogen"."cpu" WHERE host = '$hostname' AND "cpu"='cpu-total' GROUP BY time($timeinterval) FILL(previous)
Sometimes the query can be complicated, you can always use "Query Inspector" to debug or troubleshoot this. when you have your final query, just click on the "query inspector" located on your top right and you can a small window popup.
As you can see from query above, timeinterval and $hostname are variables and defined by the user. It can take on a simple string value like "ABC" or a list of strings.
2. Choose your visualization - which could be a panel, graph, stat (which is a panel) or other cool visualization. Click on the icon below (which is located on your far left)
Graph widget is the only widget that support alert at the moment. If you use stat widget, you can only show information but cannot generate alert.
Some notes:
Alert are saved in a scheduling engine in grafana and isolate from the UI.
Query that uses variable are not supported in alert. You cannot specify a variable in your query if you planning to use it for alert.
3. Generate an alert - by defining what will trigger this alert.
To create an alert, click on the Alarm bell icon and the click "Create Alert".
When an alert is triggered, there are many different notification channels that you can choose from like Microsoft team, email or event slack. Let's go ahead and defined some alert worth looking at.
This is what the "alert" screen looks like diagram below. Let's try to understand how the query alert window is structure.
The purple box, allow us to configure what happens if there is no data available, for example a telegraf agent went down. Please note this is the first criteria that gests evaluated. If no data, there is no point to evaluate other boxes.
The yellow box allow us to defined monitoring frequency over a period of time. In this case, we are saying monitor every 1 minute for the period of 7m before checking if rule matches (green box). If yes, fire an alert.
The green box is our criteria. If it matches our criteria, then fire an alert.
There are many details not included in this post to keep it simple. Check out our next post about how alerts in Grafana get triggered.
The final output from a dashbaord could look something like this :-
What's next
We will setup Grafana to monitor data / metrics from Azure monitor using Kusto query language.
Comments