Spring Boot 3 Observability with Grafana

This article will teach you how to configure observability for your Spring Boot applications. We assume that observability is understood as the interconnection between metrics, logging, and distributed tracing. In the end, it should allow you to monitor the state of your system to detect errors and latency.

There are some significant changes in the approach to observability between Spring Boot 2 and 3. Tracing will no longer be part of Spring Cloud through the Spring Cloud Sleuth project. The core of that project has been moved to Micrometer Tracing. You can read more about the reasons and future plans in this post on the Spring blog.

The main goal of that article is to give you a simple receipt for how to enable observability for your microservices written in Spring Boot using a new approach. In order to simplify our exercise, we will use a fully managed Grafana instance in their cloud. We will build a very basic architecture with two microservices running locally. Let’s take a moment to discuss our architecture in great detail.

Source Code

If you would like to try it by yourself, you may always take a look at my source code. In order to do that you need to clone my GitHub repository. It contains several tutorials. You need to go to the inter-communication directory. After that, you should just follow my instructions.

Spring Boot Observability Architecture

There are two applications: inter-callme-service and inter-caller-service. The inter-caller-service app calls the HTTP endpoint exposed by the inter-callme-service app. We run two instances of inter-callme-service. We will also configure a static load balancing between those two instances using Spring Cloud Load Balancer. All the apps will expose Prometheus metrics using Spring Boot Actuator and the Micrometer project. For tracing, we are going to use Open Telemetry with Micrometer Tracing and OpenZipkin. In order to send all the data including logs, metrics, and traces from our local Spring Boot instances to the cloud, we have to use Grafana Agent.

On the other hand, there is a stack responsible for collecting and visualizing all the data. As I mentioned before we will leverage Grafana Cloud for that. It is a very comfortable way since we don’t have to install and configure all the required tools. First of all, Grafana Cloud offers a ready instance of Prometheus responsible for collecting metrics. We also need a log aggregation tool for storing and querying logs from our apps. Grafana Cloud offers a preconfigured instance of Loki for that. Finally, we have a distributed tracing backend through the Grafana Tempo. Here’s the visualization of our whole architecture.

spring-boot-observability-arch

Enable Metrics and Tracing with Micrometer

In order to export metrics in Prometheus format, we need to include the micrometer-registry-prometheus dependency. For tracing context propagation we should add the micrometer-tracing-bridge-otel module. We should also export tracing spans using one of the formats supported by Grafana Tempo. It will be OpenZipkin through the opentelemetry-exporter-zipkin dependency.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>

We need to use the latest available version of Spring Boot 3. Currently, it is 3.0.0-RC1. As a release candidate that version is available in the Spring Milestone repository.

1
2
3
4
5
6
<parent>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-parent</artifactId>
  <version>3.0.0-RC1</version>
  <relativePath/>
</parent>

One of the more interesting new features in Spring Boot 3 is the support for Prometheus exemplars. Exemplars are references to data outside of the metrics published by an application. They allow linking metrics data to distributed traces. In that case, the published metrics contain a reference to the traceId. In order to enable exemplars for the particular metrics, we need to expose percentiles histograms. We will do that for http.server.requests metric (1). We will also send all the traces to Grafana Cloud by setting the sampling probability to 1.0 (2). Finally, just to verify it works properly we print traceId and spanId in the log line (3).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
spring:
  application:
    name: inter-callme-service
  output.ansi.enabled: ALWAYS

management:
  endpoints.web.exposure.include: '*'
  metrics.distribution.percentiles-histogram.http.server.requests: true # (1)
  tracing.sampling.probability: 1.0 # (2)

logging.pattern.console: "%clr(%d{HH:mm:ss.SSS}){blue} %clr(%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]){yellow} %clr(:){red} %clr(%m){faint}%n" # (3)

The inter-callme-service exposes the POST endpoint that just returns the message in the reversed order. We don’t need to add here anything, just standard Spring Web annotations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
@RestController
@RequestMapping("/callme")
class CallmeController(private val repository: ConversationRepository) {

   private val logger: Logger = LoggerFactory
       .getLogger(CallmeController::class.java)

   @PostMapping("/call")
   fun call(@RequestBody request: CallmeRequest) : CallmeResponse {
      logger.info("In: {}", request)
      val response = CallmeResponse(request.id, request.message.reversed())
      repository.add(Conversation(request = request, response = response))
      return response
   }

}

Load Balancing with Spring Cloud

In the endpoint exposed by the inter-caller-service we just call the endpoint from inter-callme-service. We use Spring RestTemplate for that. You can also declare Spring Cloud OpenFeign client, but it seems it does not currently support Micrometer Tracing out-of-the-box.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
@RestController
@RequestMapping("/caller")
class CallerController(private val template: RestTemplate) {

   private val logger: Logger = LoggerFactory
      .getLogger(CallerController::class.java)

   private var id: Int = 0

   @PostMapping("/send/{message}")
   fun send(@PathVariable message: String): CallmeResponse? {
      logger.info("In: {}", message)
      val request = CallmeRequest(++id, message)
      return template.postForObject("http://inter-callme-service/callme/call",
         request, CallmeResponse::class.java)
   }

}

In this exercise, we will use a static client-side load balancer that distributes traffic across two instances of inter-callme-service. Normally, you would integrate Spring Cloud Load Balancer with service discovery based e.g. on Eureka. However, I don’t want to complicate our demo with external components in the architecture. Assuming we are running inter-callme-service on 55800 and 55900 here’s the load balancer configuration in the application.yml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
spring:
  application:
    name: inter-caller-service
  cloud:
    loadbalancer:
      cache:
        ttl: 1s
      instances:
        - name: inter-callme-service
          servers: localhost:55800, localhost:55900

Since there is no built-in static load balancer implementation we need to add some code. Firstly, we have to inject configuration properties into the Spring bean.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@Configuration
@ConfigurationProperties("spring.cloud.loadbalancer")
class LoadBalancerConfigurationProperties {

   val instances: MutableList<ServiceConfig> = mutableListOf()

   class ServiceConfig {
      var name: String = ""
      var servers: String = ""
   }

}

Then we need to create a bean that implements the ServiceInstanceListSupplier interface. It just returns a list of ServiceInstance objects that represents all defined static addresses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class StaticServiceInstanceListSupplier(private val properties: LoadBalancerConfigurationProperties,
                                        private val environment: Environment): 
   ServiceInstanceListSupplier {

   override fun getServiceId(): String = environment
      .getProperty(LoadBalancerClientFactory.PROPERTY_NAME)!!

   override fun get(): Flux<MutableList<ServiceInstance>> {
      val serviceConfig: LoadBalancerConfigurationProperties.ServiceConfig? =
                properties.instances.find { it.name == serviceId }
      val list: MutableList<ServiceInstance> =
                serviceConfig!!.servers.split(",", ignoreCase = false, limit = 0)
                        .map { StaticServiceInstance(serviceId, it) }.toMutableList()
      return Flux.just(list)
   }

}

Finally, we need to enable Spring Cloud Load Balancer for the app and annotate RestTemplate with @LoadBalanced.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@SpringBootApplication
@LoadBalancerClient(value = "inter-callme-service", configuration = [CustomCallmeClientLoadBalancerConfiguration::class])
class InterCallerServiceApplication {

    @Bean
    @LoadBalanced
    fun template(builder: RestTemplateBuilder): RestTemplate = 
       builder.build()

}

Here’s the client-side load balancer configuration. We are providing our custom StaticServiceInstanceListSupplier implementation as a default ServiceInstanceListSupplier. Then we set RandomLoadBalancer as a default implementation of a load-balancing algorithm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class CustomCallmeClientLoadBalancerConfiguration {

   @Autowired
   lateinit var properties: LoadBalancerConfigurationProperties

   @Bean
   fun clientServiceInstanceListSupplier(
      environment: Environment, context: ApplicationContext):     
      ServiceInstanceListSupplier {
        val delegate = StaticServiceInstanceListSupplier(properties, environment)
        val cacheManagerProvider = context
           .getBeanProvider(LoadBalancerCacheManager::class.java)
        return if (cacheManagerProvider.ifAvailable != null) {
            CachingServiceInstanceListSupplier(delegate, cacheManagerProvider.ifAvailable)
        } else delegate
    }

   @Bean
   fun loadBalancer(environment: Environment, 
      loadBalancerClientFactory: LoadBalancerClientFactory):
            ReactorLoadBalancer<ServiceInstance> {
        val name: String? = environment.getProperty(LoadBalancerClientFactory.PROPERTY_NAME)
        return RandomLoadBalancer(
           loadBalancerClientFactory
              .getLazyProvider(name, ServiceInstanceListSupplier::class.java),
            name!!
        )
    }
}

Testing Observability with Spring Boot

Let’s see how it works. In the first step, we are going to run two instances of inter-callme-service. Since we set a static value of the listen port we need to override the property server.port for each instance. We can do it with the env variable SERVER_PORT. Go to the inter-communication/inter-callme-service directory and run the following commands:

1
2
$ SERVER_PORT=55800 mvn spring-boot:run
$ SERVER_PORT=55900 mvn spring-boot:run

Then, go to the inter-communication/inter-caller-service directory and run a single instance on the default port 8080:

1
$ mvn spring-boot:run

Then, let’s call our endpoint POST /caller/send/{message} several times with parameters, for example:

1
$ curl -X POST http://localhost:8080/caller/call/hello1

Here are the logs from inter-caller-service with the highlighted value of the traceId parameter:

spring-boot-observability-traceid

Let’s take a look at the logs from inter-callme-service. As you see the traceId parameter is the same as the traceId for that request on the inter-caller-service side.

logs

Here are the logs from the second instance of inter-callme-service:

logs

ou could also try the same exercise with Spring Cloud OpenFeign. It is configured and ready to use. However, for me, it didn’t propagate the traceId parameter properly. Maybe, it is the case with the current non-GA versions of Spring Boot and Spring Cloud.

Let’s verify another feature - Prometheus exemplars. In order to do that we need to call the /actuator/prometheus endpoint with the Accept header that is asking for the OpenMetrics format. This is the same header Prometheus will use to scrape the metrics.

1
$ curl -H 'Accept: application/openmetrics-text; version=1.0.0' http://localhost:8080/actuator/prometheus

As you see several metrics for the result contain traceId and spanId parameters. These are our exemplars.

spring-boot-observability-openmetrics

Install and Configure Grafana Agent

Our sample apps are ready. Now, the main goal is to send all the collected observables to our account on Grafana Cloud. There are various ways of sending metrics, logs, and traces to the Grafana Stack. In this article, I will show you how to use the Grafana Agent for that. Firstly, we need to install it. You can find detailed installation instructions depending on your OS here. Since I’m using macOS I can do it with Homebrew:

1
$ brew install grafana-agent

Before we start the agent, we need to prepare a configuration file. The location of that file also depends on your OS. For me it is $(brew --prefix)/etc/grafana-agent/config.yml. The configuration YAML manifests contain information on how we want to collect and send metrics, traces, and logs. Let’s begin with the metrics. Inside the scrape_configs section, we need to set a list of endpoints for scraping (1) and a default path (2). Inside the remote_write section, we have to pass our Grafana Cloud instance auth credentials (3) and URL (4). By default, Grafana Agent does not send exemplars. Therefore we need to enable it with the send_exemplars property (5).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
metrics:
  configs:
    - name: integrations
      scrape_configs:
        - job_name: agent
          static_configs:
            - targets: ['127.0.0.1:55800','127.0.0.1:55900','127.0.0.1:8080'] # (1)
          metrics_path: /actuator/prometheus # (2)
      remote_write:
        - basic_auth: # (3)
            password: <YOUR_GRAFANA_CLOUD_TOKEN>
            username: <YOUR_GRAFANA_CLOUD_USER>
          url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push # (4)
          send_exemplars: true # (5)
  global:
    scrape_interval: 60s

You can find all information about your instance of Prometheus in the Grafana Cloud dashboard.

Grafana Cloud dashboard

In the next step, we prepare a configuration for collecting and sending logs to Grafana Loki. The same as before we need to set auth credentials (1) and the URL (2) of our Loki instance. The most important thing here is to pass the location of log files (3).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
logs:
  configs:
    - clients:
        - basic_auth: # (1)
            password: <YOUR_GRAFANA_CLOUD_TOKEN>
            username: <YOUR_GRAFANA_CLOUD_USER>
          # (2)
          url: https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push
      name: springboot
      positions:
        filename: /tmp/positions.yaml
      scrape_configs:
        - job_name: springboot-json
          static_configs:
            - labels:
                __path__: ${HOME}/inter-*/spring.log # (3)
                job: springboot-json
              targets:
                - localhost
      target_config:
        sync_period: 10s

By default, Spring Boot logs only to the console and does not write log files. In our case, the Grafana Agent reads log lines from output files. In order to write log files, we need to set a logging.file.name or logging.file.path property. Since there are two instances of inter-callme-service we need to distinguish somehow their log files. We will use the server.port property for that. The logs inside files are stored in JSON format.

1
2
3
4
5
6
logging:
  pattern:
    console: "%clr(%d{HH:mm:ss.SSS}){blue} %clr(%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]){yellow} %clr(:){red} %clr(%m){faint}%n"
    file: "{\"timestamp\":\"%d{HH:mm:ss.SSS}\",\"level\":\"%p\",\"traceId\":\"%X{traceId:-}\",\"spanId\":\"%X{spanId:-}\",\"appName\":\"${spring.application.name}\",\"message\":\"%m\"}%n"
  file:
    path: ${HOME}/inter-callme-service-${server.port}

Finally, we will configure trace collecting. Besides auth credentials and URL of the Grafana Tempo instance, we need to enable OpenZipkin receiver (1).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
traces:
  configs:
    - name: default
      remote_write:
        - endpoint: tempo-eu-west-0.grafana.net:443
          basic_auth:
            username: <YOUR_GRAFANA_CLOUD_USER>
            password: <YOUR_GRAFANA_CLOUD_TOKEN>
      # (1)
      receivers:
        zipkin:

Then, we can start the agent with the following command:

1
$ brew services restart grafana-agent

Grafana agent contains a Zipkin collector that listens on the default port 9411. There is also an HTTP API exposed outside the agent on the port 12345 for verifying agent status.

Zipkin

For example, we can use Grafana Agent HTTP API to verify how many log files it is monitoring. To do that just call the endpoint GET agent/api/v1/logs/targets. As you see, for me, it is monitoring three files. So that’s what we exactly wanted to achieve since there are two running instances of inter-callme-service and a single instance of inter-caller-service.

spring-boot-observability-logs

Visualize Spring Boot Observability with Grafana Stack

One of the most important advantages of Grafana Cloud in our exercise is that we have all the required things configured. After you navigate to the Grafana dashboard you can display a list of available data sources. As you see, there are Loki, Prometheus, and Tempo already configured.

spring-boot-observability-datasources

By default, Grafana Cloud enables exemplars in the Prometheus data source. If you are running Grafana by yourself, don’t forget to enable it on your Prometheus data source.

Grafana

Let’s start with the logs. We will analyze exactly the same logs as in the section “Testing Observability with Spring Boot”. We will get all the logs sent by the Grafana Agent. As you probably remember, we formatted all the logs as JSON. Therefore, we will parse them using the Json parser on the server side. Thanks to that, we would be able to filter by all the log fields. For example, we can use the traceId label a filter expression: {job="springboot-json"} | json | traceId = 1bb1d7d78a5ac47e8ebc2da961955f87.

logs

Here’s a full list of logs without any filtering. The highlighted lines contain logs of two analyzed traces.

logs

In the next step, we will configure Prometheus metrics visualization. Since we enabled percentile histograms for the http.server.requests metrics, we have multiple buckets represented by the http_server_requests_seconds_bucket values. A set of multiple buckets _bucket with a label le which contains a count of all samples whose values are less than or equal to the numeric value contained in the le label. We need to count histograms for 90% and 60% percent of requests. Here are our Prometheus queries:

Prometheus

Here’s our histogram. Exemplars are shown as green diamonds.

spring-boot-observability-histogram

When you hover over the selected exemplar, you will see more details. It includes, for example, the traceId value.

logs

Finally, the last part of our exercise. We would like to analyze the particular traces with Grafana Tempo. The only thing you need to do is to choose the grafanacloud-*-traces data source and set the value of the searched traceId. Here’s a sample result.

logs

Final Thoughts

The first GA release of Spring Boot 3 is just around the corner. Probably one of the most important things you will have to handle during migration from Spring Boot 2 is observability. In this article, you can find a detailed description of the current Spring Boot approach. If you are interested in Spring Boot, it’s worth reading about its best practices for building microservices in this article.

Reference https://piotrminkowski.com/2022/11/03/spring-boot-3-observability-with-grafana/