Create Monitoring Strategy

Create Monitoring Strategy

To detect any problems that arise in AMPS or in the underlying hardware, it's important to develop and implement a monitoring strategy.

AMPS is designed to be able to work with your existing monitoring infrastructure, including systems such as ITRS Geneos, Grafana, DataDog, and so on.

The AMPS Monitoring Guide describes the metrics available through the Administrative Interface. A complete monitoring strategy would include metrics tailored to the use case, and would use alerting thresholds based on the environment and the guarantees provided by the application.

This chapter includes a suggested minimal set of metrics to be tracked in a monitoring system. A full monitoring strategy would likely include additional metrics that are relevant to the specific environment and AMPS features used for the application.

The metrics offered here are one suggested baseline set of metrics. Not every metric applies to every installation. For any given installation of AMPS, other metrics are likely also important. See the AMPS Monitoring Guide for details on the metrics available, and create a monitoring strategy that reflects your environment and how your application uses AMPS.

Event Logging

The AMPS error and event log contains an ordered log of significant events in AMPS. The detail provided depends on the verbosity at which the logging is configured.

For a production instance of AMPS, a logging level of info (or more verbose) is recommended.

Any event recorded with a severity level of error, critical, or emergency indicates that an operation has failed in way that an application may have received partial or incorrect data and should be investigated.

Events at an error or critical level do not necessarily mean that AMPS is not functioning as expected but should still be investigated. For example, if an application submits a command that AMPS doesn't recognize, that would be logged as an error level event since that application will not get the expected data, even though AMPS is correctly rejecting an unknown command. However, this event indicates that an application submitted an incorrect operation and is likely not functioning as expected, even though there is no issue in the AMPS server itself.

A robust monitoring strategy will monitor the event logs for events of error level and above so that those events can be investigated and corrected.

Baseline Host Metrics

Typically, a monitoring system will capture, at a minimum, the following metrics about host-level performance. Since these related to the underlying system rather than AMPS itself, many sites already collect the equivalent of these statistics by default.

This is not a complete list of statistics available for the host, but provides a starting point for developing your monitoring plan.

Base Metrics Path: /amps/host

Metric

Short Description

/memory/free

Amount of memory currently free.

/memory/in_use

Amount of memory in use.

/memory/swap_free

Amount of swap currently free.

/memory/swap_total

Total amount of swap.

/network/<if>/bytes_in

Total bytes in (by interface name).

/network/<if>/bytes_out

Total bytes out (by interface name).

/disks/<dev>/file_system_free_percent

Free space (by device).

/cpus/all/iowait_percent

Amount of CPU time waiting on I/O.

/cpus/all/idle_percent

Amount of CPU time idle.

Baseline Message Flow Metrics

The following metrics monitor overall message flow to the instance.

This is not a complete list of statistics available for message flow, but provides a starting point for developing your monitoring plan.

Base Metrics Path: /amps/instance/processors/all

Metric

Short Description

/messages_received_per_sec

Incoming messages per second from all sources.

/denied_reads

Outgoing messages denied due to entitlement (on valid subscription).

/denied_writes

Incoming messages denied due to entitlement (publishes).

/throttle_count

Number of times that the processor had to wait to add a message to the processing pipeline due to the instance reaching capacity limits on the number of in-progress messages.

This metric can indicate resource constraints on AMPS.

/last_active

The last active time for a message processor.

If this time grows, either there is no traffic to the instance, or there is a delay in processing.

SOW Topic Traffic Metrics

The following metrics monitor message flow for specific topics in the SOW (including Topics, Views, ConflatedTopics, and all replication models for Queues).

Depending on your application, of course, a given metric may not be relevant. (For example, if an application only uses queues, then the "update" metrics would not be relevant, since a message can be added to the queue or removed from the queue, but cannot be modified while in the queue.)

Base Metrics Path: /amps/instance/sow/<topic_name>!<message_type>

Metric

Short Description

/inserts_per_sec

Number of new records added per second.

/updates_per_sec

Number of records updated per second.

/deletes_per_sec

Number of records deleted per second.

/queries_per_sec

Number of queries of the topic per second.

/insert_count

Total count of new records added to the topic.

/delete_count

Total count of records removed from the topic.

/update_count

Total count of updates to records in the topic.

View-Specific Metrics

If your application uses views, the following metric, when combined with the general topic metrics above, can give you insight into queue activity and the processing load for the queue.

Base Metrics Path: /amps/instance/views/<topic_name>!<message_type>

Metric

Short Description

/queue_depth

The current number of pending updates for the view.

Comparing the queue depth in a statistics snapshot with maximum recorded values for update, insert, and delete for the previous intervals can provide a rough approximation of the current latency for this view. For example, if the maximum number of total updates per second, as calculated by the sum of updates_per_sec inserts_per_sec and deletes_per_sec for the topic is 15,000, a current queue depth of 1500 would be expected to be processed in 100ms or less.

Queue-Specific Metrics

If your application uses queues, the following minimal metrics monitor traffic for a queue. These should be monitored in addition to the general topic metrics above.

Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>

Metric

Short Description

/seconds_behind

Age of oldest unacknowledged message in the queue.

/queue_depth

Number of messages in the queue.

Replication Destination Metrics

The following metrics monitor traffic to an outgoing replication destination.

Base Metrics Path: /amps/instance/replication/<destination>

Metric

Short Description

/is_connected

Whether or not this destination is currently connected.

/seconds_behind

Oldest message in the transaction log not yet sent to the downstream instance.

Note: This is not an estimate of the time required to synchronize the downstream instance.

/messages_out_per_sec

Number of messages sent to the destination (per second).

Replicated Queue Metrics

The following metrics monitor traffic for a replicated queue. These should be monitored in addition to the general replication metrics above and the instance-specific metrics for the queue.

Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>

Metric

Short Description

/transferred_in

Number of messages that have had ownership transferred to this instance.

/transferred_out

Number of messages that this instance has previously owned, but granted ownership to another instance.

/owned

Number of messages currently owned.

Application Connection Metrics

The following metrics monitor network activity for a client connection.

Base Metrics Path: /amps/instance/clients/<client_id>

Metric

Short Description

/transport_rx_queue

Number of bytes in the transport receive queue for this connection.

/transport_tx_queue

Number of bytes in the transport send queue for this connection.

/bytes_out_per_sec

Number of bytes per second sent to the client.

/bytes_in_per_sec

Number of bytes per second received from the client.

/queue_depth_out

Number of messages buffered in AMPS for the client.

/queue_max_latency

Oldest message buffered in AMPS for the client.

Last updated

Copyright 2013-2024 60East Technologies, Inc.