Create Monitoring Strategy
Create Monitoring Strategy
To detect any problems that arise in AMPS or in the underlying hardware, it's important to develop and implement a monitoring strategy.
AMPS is designed to be able to work with your existing monitoring infrastructure, including systems such as ITRS Geneos, Grafana, DataDog, and so on.
The AMPS Monitoring Guide describes the metrics available through the Administrative Interface. A complete monitoring strategy would include metrics tailored to the use case, and would use alerting thresholds based on the environment and the guarantees provided by the application.
This chapter includes a suggested minimal set of metrics to be tracked in a monitoring system. A full monitoring strategy would likely include additional metrics that are relevant to the specific environment and AMPS features used for the application.
The metrics offered here are one suggested baseline set of metrics. Not every metric applies to every installation. For any given installation of AMPS, other metrics are likely also important. See the AMPS Monitoring Guide for details on the metrics available, and create a monitoring strategy that reflects your environment and how your application uses AMPS.
Event Logging
The AMPS error and event log contains an ordered log of significant events in AMPS. The detail provided depends on the verbosity at which the logging is configured.
For a production instance of AMPS, a logging level of info
(or more verbose) is recommended.
Any event recorded with a severity level of error
, critical
, or emergency
indicates that an operation has failed in way that an application may have received partial or incorrect data and should be investigated.
Events at an error
or critical
level do not necessarily mean that AMPS is not functioning as expected but should still be investigated. For example, if an application submits a command that AMPS doesn't recognize, that would be logged as an error
level event since that application will not get the expected data, even though AMPS is correctly rejecting an unknown command. However, this event indicates that an application submitted an incorrect operation and is likely not functioning as expected, even though there is no issue in the AMPS server itself.
A robust monitoring strategy will monitor the event logs for events of error
level and above so that those events can be investigated and corrected.
Baseline Host Metrics
Typically, a monitoring system will capture, at a minimum, the following metrics about host-level performance. Since these related to the underlying system rather than AMPS itself, many sites already collect the equivalent of these statistics by default.
This is not a complete list of statistics available for the host, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/host
Metric | Short Description |
| Amount of memory currently free. |
| Amount of memory in use. |
| Amount of swap currently free. |
| Total amount of swap. |
| Total bytes in (by interface name). |
| Total bytes out (by interface name). |
| Free space (by device). |
| Amount of CPU time waiting on I/O. |
| Amount of CPU time idle. |
Baseline Message Flow Metrics
The following metrics monitor overall message flow to the instance.
This is not a complete list of statistics available for message flow, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/instance/processors/all
Metric | Short Description |
| Incoming messages per second from all sources. |
| Outgoing messages denied due to entitlement (on valid subscription). |
| Incoming messages denied due to entitlement (publishes). |
| Number of times that the processor had to wait to add a message to the processing pipeline due to the instance reaching capacity limits on the number of in-progress messages. This metric can indicate resource constraints on AMPS. |
| The last active time for a message processor. If this time grows, either there is no traffic to the instance, or there is a delay in processing. |
SOW Topic Traffic Metrics
The following metrics monitor message flow for specific topics in the SOW (including Topics, Views, ConflatedTopics, and all replication models for Queues).
Depending on your application, of course, a given metric may not be relevant. (For example, if an application only uses queues, then the "update" metrics would not be relevant, since a message can be added to the queue or removed from the queue, but cannot be modified while in the queue.)
Base Metrics Path: /amps/instance/sow/<topic_name>!<message_type>
Metric | Short Description |
| Number of new records added per second. |
| Number of records updated per second. |
| Number of records deleted per second. |
| Number of queries of the topic per second. |
| Total count of new records added to the topic. |
| Total count of records removed from the topic. |
| Total count of updates to records in the topic. |
View-Specific Metrics
If your application uses views, the following metric, when combined with the general topic metrics above, can give you insight into queue activity and the processing load for the queue.
Base Metrics Path: /amps/instance/views/<topic_name>!<message_type>
Metric | Short Description |
| The current number of pending updates for the view. |
Comparing the queue depth in a statistics snapshot with maximum recorded values for update, insert, and delete for the previous intervals can provide a rough approximation of the current latency for this view. For example, if the maximum number of total updates per second, as calculated by the sum of updates_per_sec
inserts_per_sec
and deletes_per_sec
for the topic is 15,000, a current queue depth of 1500 would be expected to be processed in 100ms or less.
Queue-Specific Metrics
If your application uses queues, the following minimal metrics monitor traffic for a queue. These should be monitored in addition to the general topic metrics above.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric | Short Description |
| Age of oldest unacknowledged message in the queue. |
| Number of messages in the queue. |
Replication Destination Metrics
The following metrics monitor traffic to an outgoing replication destination.
Base Metrics Path: /amps/instance/replication/<destination>
Metric | Short Description |
| Whether or not this destination is currently connected. |
| The current point in the transaction log that has been acknowledged by this destination. This is calculated as the difference in seconds between the time that the last message acknowledged by the destination was written to the transaction log and the time that the most recent message was written to the transaction log. That is, if the last message that the destination has acknowledged was written to the local transaction log at AMPS rounds any value below Note: This is not an estimate of the time required to synchronize the downstream instance. |
| Number of messages sent to the destination (per second). |
Replicated Queue Metrics
The following metrics monitor traffic for a replicated queue. These should be monitored in addition to the general replication metrics above and the instance-specific metrics for the queue.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric | Short Description |
| Number of messages that have had ownership transferred to this instance. |
| Number of messages that this instance has previously owned, but granted ownership to another instance. |
| Number of messages currently owned. |
Application Connection Metrics
The following metrics monitor network activity for a client connection.
Base Metrics Path: /amps/instance/clients/<client_id>
Metric | Short Description |
| Number of bytes in the transport receive queue for this connection. |
| Number of bytes in the transport send queue for this connection. |
| Number of bytes per second sent to the client. |
| Number of bytes per second received from the client. |
| Number of messages buffered in AMPS for the client. |
| Oldest message buffered in AMPS for the client. |
Last updated