Akka

Lightbend Telemetry is capable of capturing data for the following Akka related features.

Cinnamon Akka module dependency

After adding the Cinnamon Agent as described in the setup, make sure that you add the Cinnamon Akka module dependency to your build file:

sbt
libraryDependencies += Cinnamon.library.cinnamonAkka
Maven
<dependency>
  <groupId>com.lightbend.cinnamon</groupId>
  <artifactId>cinnamon-akka_2.12</artifactId>
  <version>2.10.0</version>
</dependency>
Gradle
dependencies {
  compile group: 'com.lightbend.cinnamon', name: 'cinnamon-akka_2.12', version: '2.10.0'
}

Actor metrics

The following metrics are recorded for instrumented actors, type of metric in parenthesis:

  • Running actors (counter) — the number of running actors (of an actor class or group).

  • Mailbox size (counter) — statistics for actor mailbox sizes.

  • Stash size (counter) — statistics for actor stash sizes.

  • Mailbox time (recorder) — statistics for the time that messages are in the mailbox.

  • Processed messages (rate) — the number of messages that actors have processed in the selected time frame.

  • Processing time (recorder) — statistics for the processing time of actors.

  • Sent messages (rate) — statistics for the number of sent messages per actor.

  • Dropped messages (rate) — statistics for the number of messages dropped from bounded mailboxes per actor.

All time related metrics use nano seconds unless specified otherwise.

Router metrics

The following router metrics are available:

  • Processed messages (rate) — the number of messages that routers have processed in the selected time frame.

  • Processing time (recorder) — statistics for the processing time of the router logic.

Note: Router metrics are only available for router actors, i.e. not availble when routers are used directly.

Note: Use the setting routers = off to disable router metrics from being created, see router exclude settings.

Actor remote metrics

The following remote metrics are recorded for instrumented actors, type of metric in parenthesis:

  • Sent messages (rate) — statistics for the number sent remote messages.

  • Sent message size (bytes) — statistics for remote sent message sizes.

  • Serialization time (recorder) — statistics for the time that serialization takes.

  • Received messages (rate) — statistics for the number received remote messages.

  • Received message size (bytes) — statistics for remote received message sizes.

  • Deserialization time (recorder) — statistics for the time that deserialization takes.

  • Node quarantine (event) — node quarantine event information.

  • Phi accrual value (gauge) — statistics for the Phi accrual failure detector. A Phi value represents the connection between two nodes; self and remote. A self node can have a connection to any number of remote nodes and each connection will have its own Phi value. Note that internally in Akka the Phi accrual value can become Double.Infinity. If this happens Cinnamon will convert this value to 1024*1024. The reason for this is that most visualizers cannot handle infinity. If you therefore see the value 1048576 (1024*1024) this means that the Phi value has reached infinity.

  • Phi accrual threshold value (gauge) — the configured Phi accrual threshold value.

All time related metrics use nano seconds unless specified otherwise.

Note: Timing of serialization/deserialization is turned off by default. To enable it, you need to add this setting to your configuration: “cinnamon.akka.remote.serialization-timing = on”

Note: Phi accrual metrics and node quarantine events are turned off by default. To enable them, you need to add this setting to your configuration: “cinnamon.akka.remote.failure-detector-metrics = on”

Actor selection

Actor configuration supports selecting and grouping actors for instrumentation by actor class, package, subtree, or instance, so that telemetry and metric aggregation can be tailored to the application. Details on how to configure actor telemetry can be found under actor configuration.

Actor events

Out of the ordinary events are automatically recorded for instrumented actors. Events may trigger a debug snapshot when using the OverOps plugin.

  • Actor failure — when an actor fails, i.e. throws an exception.
    Event information:
    actor-ref — the actor failing
    cause — the exception being thrown

  • Unhandled message — when an actor does not handle a message sent to it.
    Event information:
    actor-ref — the actor not handling the message
    message — the message not being handled
    sender — the sender of the message

  • Dead letter — when a message is sent to an actor that no longer exists.
    Event information:
    recipient — the intended recipient of the message message — the message being sent
    actor-ref — the actor sending the message

  • Log warning — when an actor logs a warning.
    Event information:
    actor-ref — the actor logging the warning
    warning — the warning being logged

  • Log error — when an actor logs an error.
    Event information:
    actor-ref — the actor logging the error
    error — the error being logged

Cluster events

These are the types of Akka clustering events that Cinnamon observes:

  • Current cluster state event — a one time event, per cluster node, containing information about the state of the cluster.

  • Domain events — cluster domain events like leader changed, role leader changed or cluster shutting down.
    Event information:
    event — type of domain event, e.g. LeaderChanged
    role — the role for the domain event if any

  • Member events — cluster member events like member up, unreachable, reachable, exited or removed.
    Event information:
    event — type of domain event, e.g. MemberUp
    version — node version
    member-status — member status, e.g. Joining
    previous-member-status — previous member status, e.g. Up

  • Singleton events — cluster singleton events with information about node, actor singleton class and name.
    Event information:
    event — type of singleton event, e.g. started
    class — singleton actor class
    actor — singleton actor name

  • Shard region events — cluster shard region started/stopped event with information about actor, node, type name and type of shard region (normal/proxy).
    Event information:
    event — type of shard region event, e.g. STARTED
    shard-region-actor — the actor controlling the shard region
    type-name — the entity type name of the shard region

  • Node unable to join event — cluster node unable to join event with information about seed nodes and number of attempts to join (only available for Akka version >= 2.4.17.)

Note: Cluster events are turned off by default. To enable them, you need to add these settings to your configuration: “cinnamon.akka.cluster.domain-events = on”, “cinnamon.akka.cluster.member-events = on”, “cinnamon.akka.cluster.singleton-events = on” and/or “cinnamon.akka.cluster.shard-region-info = on”

Cluster metrics

The following cluster metrics are recorded, type of metric in parenthesis:

  • Shard region delivered messages (rate) — statistics for the number of messages that have been delivered by the shard region actor (regardless of where the shard resides).

  • Shards regions per node (gauge) — number of shards regions per node.

  • Shards per region (gauge) — number of shards per shard region.

  • Shard entities per shard (gauge) — number of shard entities per shard.

  • Reachable nodes (counter) — number of reachable nodes in the cluster at any point in time (see definition.)

  • Unreachable nodes (counter) — number of unreachable nodes in the cluster at any point in time (see definition.).

Note: Cluster related metrics is turned off by default. To enable it, you need to add these settings to your configuration: “cinnamon.akka.cluster.shard-region-info = on”, cinnamon.akka.cluster.node-metrics = on

Definition of reachable and unreachable nodes

Reachable and unreachable nodes are based on observations from the failure detector and is orthogonal to the cluster member states. In other words, a member in, for example, status Up can be either reachable or unreachable.

Split Brain Resolver events

Running Split Brain Resolver (SBR), a plugin available with the Lightbend Production Suite, in your cluster will ensure better resilience. If you have SBR running, Lightbend Telemetry will automatically keep track of any activity therein. If there is a split in your cluster, events will be generated.

  • Split brain resolver events — Split brain resolver decision taken event with information about the decision, reachable and unreachable nodes.
    Event information:
    decision — type of SBR decision, e.g. DownUnreachable
    nodes — the nodes that this node finds reachable
    unreachable-nodes — the nodes that this node finds unreachable

Note: Split brain resolver events are turned off by default. To enable them, you need to add these settings to your configuration: “cinnamon.akka.cluster.split-brain-resolver-events = on”

Persistence metrics

The following Persistence metrics are recorded, type of metric in parenthesis. RecoveryPermitter is an internal Akka actor that keeps track of the number of recovery permits available, for more information see Recovery.

  • Persistence recovery time (recorder) — time of recovery for a persistent actor.

  • Persistence recovery failure time (recorder) — failure recovery time for a persistent actor (in case of failure.)

  • RecoveryPermitter used permits (recorder) — number of used permits.

  • RecoveryPermitter pending actors (recorder) — number of actors waiting for recovery.

  • RecoveryPermitter max permits value (gauge) — max number of permits in RecoveryPermitter (set via Akka configuration.)

Note: Persistence metrics is turned off by default. To enable it, you need to add these settings to your configuration: “cinnamon.akka.persistence.metrics = on”

Persistence events

The following Persistence events are generated:

  • Persistence recovery failure — Event created whenever a message replay fails.
    Event information:
    actor-ref — the actor trying to recover
    failure — the exception that occurred
    event — the event that failed if available
    recovery-failure-time — the time in nanoseconds

  • Persistence persist failure — Event created whenever a message persist fails.
    Event information:
    actor-ref — the actor failing to persist an event
    failure — the exception that occurred
    event — the event that failed if available
    sequence-number — the event sequence number

  • Persistence persist rejection — Event created whenever a message persist is rejected. Event information:
    actor-ref — the actor failing to persist an event
    failure — the exception that occurred
    event — the event that failed if available
    sequence-number — the event sequence number

Note: Persistence events is turned off by default. To enable it, you need to add these settings to your configuration: “cinnamon.akka.persistence.events = on”

Threshold events

Thresholds can be specified for some of the metrics. If the threshold is exceeded then an event is fired (and which will trigger OverOps debug snapshots). Alerts and integration with notification systems are available in OverOps. Thresholds are supported for:

  • Mailbox size — mailbox queue grows too large.
    Event information:
    actor-ref — the actor whose mailbox size has exceeded the limit
    message — the message being enqueued in the mailbox
    size — the mailbox size
    limit — the mailbox size limit

  • Stash size — stash queue grows too large.
    Event information:
    actor-ref — the actor whose stash size has exceeded the limit
    message — the message being stashed
    size — the stash size
    limit — the stash size limit

  • Mailbox time — message has been in the mailbox for too long.
    Event information:
    actor-ref — the actor whose mailbox time has exceeded the limit
    message — the message being dequeued from the mailbox
    nanos — the mailbox time
    threshold-nanos — the mailbox time limit

  • Processing time — message processing takes too long.
    Event information:
    actor-ref — the actor whose processing time has exceeded the limit
    message — the message that was just processed
    nanos — the processing time
    threshold-nanos — the processing time limit

  • Remote large message sent — a message larger than the threshold has been sent
    Event information:
    actor-ref — the actor who is sending the large message
    message-class — the message class of the large message
    size — the size in bytes of the large message
    recipient — the recipient of the large message

  • Remote large message received — a message larger than the threshold has been received
    Event information:
    actor-ref — the actor who is receiveing the large message
    message-class — the message class of the large message
    size — the size in bytes of the large message
    sender — the sender of the large message

For more information see metric thresholds configuration.

Stopwatch

Stopwatch provides a timer that follows asynchronous flows. A Stopwatch can be started in one actor and then flow through to others via message sends. You can use it to gather time metrics for “hot paths” within message flows that cross multiple actors. Intervals are marked programmatically with start and stop points within the application using an Akka extension Stopwatch API. Time metrics are recorded for Stopwatches and threshold events can be configured. For more details see the Stopwatch extension.

  • Stopwatch events — the stopwatch time limit was breached.
    Event information:
    current-nanos — the current stopwatch time in nanoseconds
    threshold-nanos — the stopwatch threshold time in nanoseconds

Dispatcher metrics

The following metrics can be recorded for instrumented dispatchers, type of metric in parenthesis:

Basic metrics

These are metrics that are built into the standard ForkJoinPool and ThreadPool ExecutorService implementations in Java and Scala. They are polled periodically by the instrumentation.

ForkJoinPool

  • Parallelism — the parallelism setting

  • Pool size (counter) — the current size of the thread pool

  • Active threads (counter) — an estimate of the number of threads running or stealing tasks

  • Running threads (counter) — an estimate of the number of threads not blocked in managed synchronization

  • Queued tasks (counter) — an estimate of the total number of tasks currently in queues

ThreadPool

  • Core pool size (counter) — the minimum size of the thread pool

  • Max pool size (counter) — the maximum size of the thread pool

  • Pool size (counter) — the current size of the thread pool

  • Active threads (counter) — an estimate of the number of threads running tasks

  • Processed tasks (counter) — an estimate of the number of processed tasks

Time metrics

Additional detailed time metrics for dispatchers.

  • Queue size (counter) — the number of tasks waiting to be processed

  • Queue time (recorder) — statistics for how long tasks are in the queue

  • Processing (counter) — how many tasks are being processed righ now

  • Processing time (recorder) — statistics for how long the processing takes

All time related metrics use nano seconds unless specified otherwise.

Dispatcher selection

Dispatcher configuration supports selecting which dispatchers should be instrumented, and what type of instrumentation should be performed for them, so that telemetry can be tailored to the application. Details on how to configure dispatcher telemetry can be found under dispatcher configuration.

Detailed information

For specific information of how to configure actors and dispatchers see: