Guide and Explanation for Metrics in Dropwizard

Every application should have a series of continuously updated metrics. This allows for realtime insights and alerts into a service. The Java web framework, Dropwizard, which I’ve written about before, comes with the venerable Metrics library and about 75 already defined metrics baked in. All you need to do is persist and interpret these metrics, while adding your own metrics.

Below, I walk through some of the default metrics, so you know what they are, and which ones to monitor. The intended audience for this article are developers writing Dropwizard applications or those who administrate Dropwizard applications.

Counters

Counters are the easiest to understand, they are a key-value pair where each value is a number that is incremented or decremented. I would not recommend relying on any of the shown counter metrics for much of anything because they only represent the state of the program at an instant, which “lose” information between each time the metric is queried. For instance, an app samples metrics every second and services 10 requests per second and can complete each request in 10ms will always show 0 and occasionally 1 for the following counters.

Reasoning: If the requests were served one after right after the other serially, then the total time to serve responses is 100ms (10 requests * 10ms). 100ms is 0.1 of 1 second, so there’d be about a 10% chance of a counter showing 1. Since requests don’t come all at once and they are processed in parallel, this calculation isn’t entirely accurate but the conclusion remains the same.

In the end, these counters may be misleading because of underestimation. If these counters show rapid growth, then you’re in trouble.

io.dropwizard.jetty.MutableServletContextHandler.active-dispatches: How many requests are being currently being handled by threads dedicated to Jetty.

io.dropwizard.jetty.MutableServletContextHandler.active-requests: The number of requests currently being serviced at that instant. When a request first comes into the application, this counter is incremented and after the server has handled the request, this is decremented. active-suspended + active-dispatches = active-requests

io.dropwizard.jetty.MutableServletContextHandler.active-suspended: How many requests are currently suspended. An app that doesn’t take advantage of Jersey’s Asynchronous Server API will always have a value of 0 and active-requests=active-dispatches.

To clarify the difference between active-requests and active-dispatches, once a requests is suspended, it is no longer considered dispatched and dispatch is decremented. Once the suspended request resumes, dispatch is incremented again until the response is served.

Gauges

Gauges are slightly different than counters. Instead of incrementing or decrementing, guages report a single value, but this value is not necessarily a number, though most of the time they are.

Direct and mapped buffers

There are two types of buffer metrics: direct and mapped, and each of them have a count, capacity, and used. These metrics are probably one the least useful metrics for the average application, but if an application uses a lot of these buffers it may be good to monitor them.

Memory

Monitoring the JVM memory usage is important to know when memory leaks occur or when more memory should be allocated to the service. To understand the metrics better, it is worth noting the difference between heap and non-heap memory. Heap memory is where class and array instances are kept. Non-heap memory is where metadata about classes and methods, as well as where JIT optimizations are stored. Dropwizard contains metrics on both sets of memory, but the one that you need to pay the most attention to is heap.used, as this metric will show if there are any memory leaks because if this metric grows unbounded then a leak exists.

For convenience and accuracy, I am going to quote official documentation heavily on interpretting different metric types. Quotes in this section come from the MemoryMXBean and MemoryUsage javadocs.

init

represents the initial amount of memory (in bytes) that the Java virtual machine requests from the operating system for memory management during startup. The Java virtual machine may request additional memory from the operating system and may also release memory to the system over time. The value of init may be undefined.

A useless metric. It is just the amount of memory asked for on start up. It has nothing to do with the application. I say this because the java version that I’m running several dropwizard applications on defaults to a heap size of 185MB, which as I’ll soon show, is a significant overestimation.

To see the default init heap size:

java -XX:+PrintFlagsFinal -version 2>&1 | grep InitialHeapSize

used

represents the amount of memory currently used (in bytes).

This metric will vary greatly based on the application, but for applications that don’t allocate that much, expect around 30MB of heap usage and non-heap. I have an application using 25MB of heap, yet since it initialized with 185MB, there is a large swath of free memory.

A related metric is the usage metric, which measures used over the max amount of memory the program can grow to. This is another metric that should be watched with a careful eye. With the example numbers that I have given, usage is 1% for heap and 25% for non-heap.

comitted

represents the amount of memory (in bytes) that is guaranteed to be available for use by the Java virtual machine. The amount of committed memory may change over time (increase or decrease). The Java virtual machine may release memory to the system and committed could be less than init. committed will always be greater than or equal to used.

Pretty self-explanatory, but it is the amount of memory that the application has to allocate. I see committed metrics of 155MB for heap and 30MB for non-heap.

max

represents the maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The amount of used and committed memory will always be less than or equal to max if max is defined. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true (for example, when the system is low on virtual memory).

I see max metrics of 2.5GB for heap and 100MB for non-heap, but this will obviously differ depending on the machine that dropwizard is deployed.

To see the default max heap size:

java -XX:+PrintFlagsFinal -version 2>&1 | grep MaxHeapSize

Threads

There are several metrics related to threads, the state they are in and how many are in that state.

The states are:

new: A thread that has not yet started is in this state.
runnable: A thread executing in the Java virtual machine is in this state.
blocked: A thread that is blocked waiting for a monitor lock is in this state.
waiting: A thread that is waiting indefinitely for another thread to perform a particular action is in this state.
timed_waiting: A thread that is waiting for another thread to perform an action for up to a specified waiting time is in this state.
terminated: A thread that has exited is in this state.
daemon: A daemon thread does not prevent the JVM from shutting down (once all non-daemon threads have exited, the JVM can exit)

For more information on these states, visit the official documentation

If there are any deadlocked threads, they will appear as part of an array in the format of %s locked on %s (owned by %s):%n%s where thread, lock, and stacktrace info are concatenated. You’ll know if there are any deadlocked threads as the deadlock healthcheck will fail. A deadlock is a sign of a serious bug, so if you see it, fix it!

Queued Thread Pool

org.eclipse.jetty.util.thread.QueuedThreadPool.dw.jobs: “Number of jobs queued waiting for a thread”. This metric you want as low as possible as you don’t want requests waiting to be served. You may get the occasional 1, but that is when the metric grabbed it’s value, during the miniscule amount of time a single job is in the queue, and it is nothing to worry about, as all requests have to be served from the queue. To put in perspective, taking the average over 24 hours of this metrics in a production service yielded 0.0015

org.eclipse.jetty.util.thread.QueuedThreadPool.dw.size: “The number of threads currently in the pool”. These are the threads in the (Jetty) web server that are servicing requests.

org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization: the number of threads currently being used in the pool (eg. how many threads currently in the pool are not idle)

org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization-max: the number of threads currently being used compared to what the maximum number of threads the pool can grow to.

Someone posted on dropwizard-user and were wondering if they should be concerned that their utilization metric was close to one even though the server was essesntially standing idle. They posted the following numbers.

{
  "org.eclipse.jetty.util.thread.QueuedThreadPool.dw.size": {
    "value": 13
  },
  "org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization": {
    "value": 0.9230769230769231
  },
  "org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization-max": {
    "value": 0.01171875
  }
}

The answer is no, they should not be worried about it because the max utilization is low. Let’s put these numbers into perspective.

A size of 13 means that the thread pool has 13 threads available for use.
Since the current utilization of the thread pool is currently 0.923, we know 12 threads are in use.
A thread pool close to 100% usage can be scary, but we know that the thread pool can grow. If twelve threads are only utilizing ~1% of the maximum a thread pool can grow to, then we know that the thread pool can hold a maximum of 1024 threads, which is the dropwizard default.

If the idle server started to receive requests, the thread pool utilization would drop as more threads are allocated, but the max utilization would increase.

Meters

Meter metrics let you know how many times an event has occurred, what the mean rate is, as well as weighted moving averages of 1, 5, and 15 minutes. The reason for the weight moving averages, as said in the metrics documentation:

[The mean rate] represents the total rate for your application’s entire lifetime (e.g., the total number of requests handled, divided by the number of seconds the process has been running), it doesn’t offer a sense of recency.

Just by looking at the mean rate, one wouldn’t be able to discern, for instance, that an app handles drastically different amounts of traffic depending on the time.

There are a few logging metrics. There are five different logging levels: (TRACE, DEBUG, INFO, WARN and ERROR), and a single aggregate meter metric (ALL) that is the summation of all the logging levels. I recommend keeping on eye on the counts for WARN and ERROR. Ideally, both should be zero, else I’ll grepping the log file for more information. Occasionally, Looking at ALL could let you know how spammy the logger is, and if you log to a file, it could sneakily steal all the disk space.

Jetty

The are a couple Jetty meters. The are meters for each response status code (eg. 1xx-responses, 2xx-responses, etc). All 5xx-responses should be investigated, but we wary of 4xx-responses, as clients may think they are hitting a valid URL with valid data, but still receive an error, so this can be used to monitor assumptions.

The other meters deal with asynchronous requests. Whenever a request has been asynchronously dispatched, async-dispatches is marked. You can think of it as how often async requests are served.

The other metric is async-timeouts, which is how many times a request has timed out after being suspended. By default, there is not a timeout defined when using Jersey’s asynchronous API.

Jetty Timers

Timers are the most descriptive type of metric with rate and duration information. The metrics here that Jetty defines are novelty than anything as the metrics take into account requests to all endpoints, so short and long endpoints are mushed together for a useless metric.

Metrics Output

Creating and understanding metrics is only half the battle; there has to be a way to consume metrics. One way is to curl /metrics on the admin port of the application, and parse what you are looking for, but that is tedious and error prone. A better approach is to send metrics to Graphite, which can aggregate the metrics you want.

Also if it fits your purpose, I recommend using the CSV reporter. Since there are about 75 metrics, there will be 75 different files (it is possible to exclude metrics) that will continuously grow, even on restart, so the CSV reporter is only really useful when you have a single service and sufficient disk space. On the contrary, the Graphite reporter consists of fixed size databases so make sure you config Graphite correctly!

That said, here are some tips on working with metrics in CSV form.

The CSV metrics, by default, output in UNIX timestamp, which is not human readable. To convert the unix timestamp into a more a human readable format and output nice looking table:

awk -F, '{sub($1, strftime("%c", $1)); print}' "${CSV_FILE}" | column -s',' -t

The following command will find all the metrics since the application last started. This command works by traversing the file backwards, attempting to find a count that is greater than the previous count. Once a greater count is encountered, it stops printing.

head -n1 csv &&
tac csv | awk -F, 'BEGIN { c = -1 }
{if (c == -1 || c >= $2) { c = $2; print; } else {exit;}}' | tac