Guide and Explanation for Metrics in Dropwizard
Published on:Table of Contents
Every application should have a series of continuously updated metrics. This allows for realtime insights and alerts into a service. The Java web framework, Dropwizard, which I’ve written about before, comes with the venerable Metrics library and about 75 already defined metrics baked in. All you need to do is persist and interpret these metrics, while adding your own metrics.
Below, I walk through some of the default metrics, so you know what they are, and which ones to monitor. The intended audience for this article are developers writing Dropwizard applications or those who administrate Dropwizard applications.
Counters
Counters are the easiest to understand, they are a key-value pair where each value is a number that is incremented or decremented. I would not recommend relying on any of the shown counter metrics for much of anything because they only represent the state of the program at an instant, which “lose” information between each time the metric is queried. For instance, an app samples metrics every second and services 10 requests per second and can complete each request in 10ms will always show 0 and occasionally 1 for the following counters.
Reasoning: If the requests were served one after right after the other
serially, then the total time to serve responses is 100ms (10 requests * 10ms). 100ms is 0.1 of 1 second, so there’d be about a 10% chance of a
counter showing 1
. Since requests don’t come all at once and they are
processed in parallel, this calculation isn’t entirely accurate but the
conclusion remains the same.
In the end, these counters may be misleading because of underestimation. If these counters show rapid growth, then you’re in trouble.
io.dropwizard.jetty.MutableServletContextHandler.active-dispatches: How many requests are being currently being handled by threads dedicated to Jetty.
io.dropwizard.jetty.MutableServletContextHandler.active-requests: The number of
requests currently being serviced at that instant. When a request first comes
into the application, this counter is incremented and after the server has
handled the request, this is decremented. active-suspended + active-dispatches = active-requests
io.dropwizard.jetty.MutableServletContextHandler.active-suspended: How many
requests are currently suspended. An app that doesn’t take advantage of
Jersey’s Asynchronous Server API will always have a value of 0 and
active-requests=active-dispatches
.
To clarify the difference between active-requests
and active-dispatches
,
once a requests is suspended, it is no longer considered dispatched and
dispatch is decremented. Once the suspended request resumes, dispatch is
incremented again until the response is served.
Gauges
Gauges are slightly different than counters. Instead of incrementing or decrementing, guages report a single value, but this value is not necessarily a number, though most of the time they are.
Direct and mapped buffers
There are two types of buffer metrics: direct and mapped, and each of them have a count, capacity, and used. These metrics are probably one the least useful metrics for the average application, but if an application uses a lot of these buffers it may be good to monitor them.
Memory
Monitoring the JVM memory usage is important to know when memory leaks occur or
when more memory should be allocated to the service. To understand the metrics
better, it is worth noting the difference between heap and non-heap memory.
Heap memory is where class and array instances are kept. Non-heap memory is
where metadata about classes and methods, as well as where JIT optimizations
are stored. Dropwizard contains metrics on both sets of memory, but the one
that you need to pay the most attention to is heap.used
, as this metric will
show if there are any memory leaks because if this metric grows unbounded then
a leak exists.
For convenience and accuracy, I am going to quote official documentation heavily on interpretting different metric types. Quotes in this section come from the MemoryMXBean and MemoryUsage javadocs.
init
represents the initial amount of memory (in bytes) that the Java virtual machine requests from the operating system for memory management during startup. The Java virtual machine may request additional memory from the operating system and may also release memory to the system over time. The value of init may be undefined.
A useless metric. It is just the amount of memory asked for on start up. It has nothing to do with the application. I say this because the java version that I’m running several dropwizard applications on defaults to a heap size of 185MB, which as I’ll soon show, is a significant overestimation.
To see the default init heap size:
java -XX:+PrintFlagsFinal -version 2>&1 | grep InitialHeapSize
used
represents the amount of memory currently used (in bytes).
This metric will vary greatly based on the application, but for applications that don’t allocate that much, expect around 30MB of heap usage and non-heap. I have an application using 25MB of heap, yet since it initialized with 185MB, there is a large swath of free memory.
A related metric is the usage
metric, which measures used over the max amount
of memory the program can grow to. This is another metric that should be
watched with a careful eye. With the example numbers that I have given,
usage
is 1% for heap and 25% for non-heap.
comitted
represents the amount of memory (in bytes) that is guaranteed to be available for use by the Java virtual machine. The amount of committed memory may change over time (increase or decrease). The Java virtual machine may release memory to the system and committed could be less than init. committed will always be greater than or equal to used.
Pretty self-explanatory, but it is the amount of memory that the application
has to allocate. I see committed
metrics of 155MB for heap and 30MB for
non-heap.
max
represents the maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The amount of used and committed memory will always be less than or equal to max if max is defined. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true (for example, when the system is low on virtual memory).
I see max
metrics of 2.5GB for heap and 100MB for non-heap, but this will
obviously differ depending on the machine that dropwizard is deployed.
To see the default max heap size:
java -XX:+PrintFlagsFinal -version 2>&1 | grep MaxHeapSize
Threads
There are several metrics related to threads, the state they are in and how many are in that state.
The states are:
- new: A thread that has not yet started is in this state.
- runnable: A thread executing in the Java virtual machine is in this state.
- blocked: A thread that is blocked waiting for a monitor lock is in this state.
- waiting: A thread that is waiting indefinitely for another thread to perform a particular action is in this state.
- timed_waiting: A thread that is waiting for another thread to perform an action for up to a specified waiting time is in this state.
- terminated: A thread that has exited is in this state.
- daemon: A daemon thread does not prevent the JVM from shutting down (once all non-daemon threads have exited, the JVM can exit)
For more information on these states, visit the official documentation
If there are any deadlocked threads, they will appear as part of an array in
the format of %s locked on %s (owned by %s):%n%s
where thread, lock, and
stacktrace info are concatenated. You’ll know if there are any deadlocked
threads as the deadlock
healthcheck will fail. A deadlock is a sign of a
serious bug, so if you see it, fix it!
Queued Thread Pool
org.eclipse.jetty.util.thread.QueuedThreadPool.dw.jobs: “Number of jobs queued waiting for a thread”. This metric you want as low as possible as you don’t want requests waiting to be served. You may get the occasional 1, but that is when the metric grabbed it’s value, during the miniscule amount of time a single job is in the queue, and it is nothing to worry about, as all requests have to be served from the queue. To put in perspective, taking the average over 24 hours of this metrics in a production service yielded 0.0015
org.eclipse.jetty.util.thread.QueuedThreadPool.dw.size: “The number of threads currently in the pool”. These are the threads in the (Jetty) web server that are servicing requests.
org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization: the number of threads currently being used in the pool (eg. how many threads currently in the pool are not idle)
org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization-max: the number of threads currently being used compared to what the maximum number of threads the pool can grow to.
Someone posted on
dropwizard-user
and were wondering if they should be concerned that their utilization
metric
was close to one even though the server was essesntially standing idle. They
posted the following numbers.
{
"org.eclipse.jetty.util.thread.QueuedThreadPool.dw.size": {
"value": 13
},
"org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization": {
"value": 0.9230769230769231
},
"org.eclipse.jetty.util.thread.QueuedThreadPool.dw.utilization-max": {
"value": 0.01171875
}
}
The answer is no, they should not be worried about it because the max utilization is low. Let’s put these numbers into perspective.
- A size of 13 means that the thread pool has 13 threads available for use.
- Since the current utilization of the thread pool is currently 0.923, we know 12 threads are in use.
- A thread pool close to 100% usage can be scary, but we know that the thread pool can grow. If twelve threads are only utilizing ~1% of the maximum a thread pool can grow to, then we know that the thread pool can hold a maximum of 1024 threads, which is the dropwizard default.
If the idle server started to receive requests, the thread pool utilization would drop as more threads are allocated, but the max utilization would increase.
Meters
Meter metrics let you know how many times an event has occurred, what the mean rate is, as well as weighted moving averages of 1, 5, and 15 minutes. The reason for the weight moving averages, as said in the metrics documentation:
[The mean rate] represents the total rate for your application’s entire lifetime (e.g., the total number of requests handled, divided by the number of seconds the process has been running), it doesn’t offer a sense of recency.
Just by looking at the mean rate, one wouldn’t be able to discern, for instance, that an app handles drastically different amounts of traffic depending on the time.
There are a few logging metrics. There are five different logging levels:
(TRACE, DEBUG, INFO, WARN and ERROR), and a single aggregate meter metric
(ALL
) that is the summation of all the logging levels. I recommend keeping on
eye on the counts for WARN
and ERROR
. Ideally, both should be zero, else
I’ll grepping the log file for more information. Occasionally, Looking at ALL
could let you know how spammy the logger is, and if you log to a file, it could
sneakily steal all the disk space.
Jetty
The are a couple Jetty meters. The are meters for each response status code
(eg. 1xx-responses
, 2xx-responses
, etc). All 5xx-responses
should be
investigated, but we wary of 4xx-responses
, as clients may think they are
hitting a valid URL with valid data, but still receive an error, so this can be
used to monitor assumptions.
The other meters deal with asynchronous requests. Whenever a request has been
asynchronously dispatched, async-dispatches
is marked. You can think of it as
how often async requests are served.
The other metric is async-timeouts
, which is how many times a request has
timed out after being suspended. By default, there is not a
timeout
defined when using Jersey’s asynchronous API.
Jetty Timers
Timers are the most descriptive type of metric with rate and duration information. The metrics here that Jetty defines are novelty than anything as the metrics take into account requests to all endpoints, so short and long endpoints are mushed together for a useless metric.
Metrics Output
Creating and understanding metrics is only half the battle; there has to be a
way to consume metrics. One way is to curl /metrics
on the admin port of the
application, and parse what you are looking for, but that is tedious and error
prone. A better approach is to send metrics to
Graphite, which can
aggregate the metrics you want.
Also if it fits your purpose, I recommend using the CSV reporter. Since there are about 75 metrics, there will be 75 different files (it is possible to exclude metrics) that will continuously grow, even on restart, so the CSV reporter is only really useful when you have a single service and sufficient disk space. On the contrary, the Graphite reporter consists of fixed size databases so make sure you config Graphite correctly!
That said, here are some tips on working with metrics in CSV form.
The CSV metrics, by default, output in UNIX timestamp, which is not human readable. To convert the unix timestamp into a more a human readable format and output nice looking table:
awk -F, '{sub($1, strftime("%c", $1)); print}' "${CSV_FILE}" | column -s',' -t
The following command will find all the metrics since the application last started. This command works by traversing the file backwards, attempting to find a count that is greater than the previous count. Once a greater count is encountered, it stops printing.
head -n1 csv &&
tac csv | awk -F, 'BEGIN { c = -1 }
{if (c == -1 || c >= $2) { c = $2; print; } else {exit;}}' | tac
Comments
If you'd like to leave a comment, please email [email protected]
awesome and a much needed post about metrics .