Create Monitoring Strategy
Create Monitoring Strategy
To detect any problems that arise in AMPS or in the underlying hardware, it's important to develop and implement a monitoring strategy.
AMPS is designed to be able to work with your existing monitoring infrastructure, including systems such as ITRS Geneos, Grafana, DataDog, and so on.
The AMPS Monitoring Guide describes the metrics available through the Administrative Interface. A complete monitoring strategy would include metrics tailored to the use case, and would use alerting thresholds based on the environment and the guarantees provided by the application.
This chapter includes a suggested minimal set of metrics to be tracked in a monitoring system. A full monitoring strategy would likely include additional metrics that are relevant to the specific environment and AMPS features used for the application.
The metrics offered here are one suggested baseline set of metrics. Not every metric applies to every installation. For any given installation of AMPS, other metrics are likely also important. See the AMPS Monitoring Guide for details on the metrics available, and create a monitoring strategy that reflects your environment and how your application uses AMPS.
Event Logging
The AMPS error and event log contains an ordered log of significant events in AMPS. The detail provided depends on the verbosity at which the logging is configured.
For a production instance of AMPS, a logging level of info
(or more verbose) is recommended.
Any event recorded with a severity level of error
, critical
, or emergency
indicates that an operation has failed in way that an application may have received partial or incorrect data and should be investigated.
Events at an error
or critical
level do not necessarily mean that AMPS is not functioning as expected but should still be investigated. For example, if an application submits a command that AMPS doesn't recognize, that would be logged as an error
level event since that application will not get the expected data, even though AMPS is correctly rejecting an unknown command. However, this event indicates that an application submitted an incorrect operation and is likely not functioning as expected, even though there is no issue in the AMPS server itself.
A robust monitoring strategy will monitor the event logs for events of error
level and above so that those events can be investigated and corrected.
Baseline Host Metrics
Typically, a monitoring system will capture, at a minimum, the following metrics about host-level performance. Since these related to the underlying system rather than AMPS itself, many sites already collect the equivalent of these statistics by default.
This is not a complete list of statistics available for the host, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/host
Metric
Short Description
/memory/free
Amount of memory currently free.
/memory/in_use
Amount of memory in use.
/memory/swap_free
Amount of swap currently free.
/memory/swap_total
Total amount of swap.
/network/<if>/bytes_in
Total bytes in (by interface name).
/network/<if>/bytes_out
Total bytes out (by interface name).
/disks/<dev>/file_system_free_percent
Free space (by device).
/cpus/all/iowait_percent
Amount of CPU time waiting on I/O.
/cpus/all/idle_percent
Amount of CPU time idle.
Baseline Message Flow Metrics
The following metrics monitor overall message flow to the instance.
This is not a complete list of statistics available for message flow, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/instance/processors/all
Metric
Short Description
/messages_received_per_sec
Incoming messages per second from all sources.
/denied_reads
Outgoing messages denied due to entitlement (on valid subscription).
/denied_writes
Incoming messages denied due to entitlement (publishes).
/throttle_count
Number of times that the processor had to wait to add a message to the processing pipeline due to the instance reaching capacity limits on the number of in-progress messages.
This metric can indicate resource constraints on AMPS.
/last_active
The last active time for a message processor.
If this time grows, either there is no traffic to the instance, or there is a delay in processing.
SOW Topic Traffic Metrics
The following metrics monitor message flow for specific topics in the SOW (including Topics, Views, ConflatedTopics, and all replication models for Queues).
Depending on your application, of course, a given metric may not be relevant. (For example, if an application only uses queues, then the "update" metrics would not be relevant, since a message can be added to the queue or removed from the queue, but cannot be modified while in the queue.)
Base Metrics Path: /amps/instance/sow/<topic_name>!<message_type>
Metric
Short Description
/inserts_per_sec
Number of new records added per second.
/updates_per_sec
Number of records updated per second.
/deletes_per_sec
Number of records deleted per second.
/queries_per_sec
Number of queries of the topic per second.
/insert_count
Total count of new records added to the topic.
/delete_count
Total count of records removed from the topic.
/update_count
Total count of updates to records in the topic.
View-Specific Metrics
If your application uses views, the following metric, when combined with the general topic metrics above, can give you insight into queue activity and the processing load for the queue.
Base Metrics Path: /amps/instance/views/<topic_name>!<message_type>
Metric
Short Description
/queue_depth
The current number of pending updates for the view.
Comparing the queue depth in a statistics snapshot with maximum recorded values for update, insert, and delete for the previous intervals can provide a rough approximation of the current latency for this view. For example, if the maximum number of total updates per second, as calculated by the sum of updates_per_sec
inserts_per_sec
and deletes_per_sec
for the topic is 15,000, a current queue depth of 1500 would be expected to be processed in 100ms or less.
Queue-Specific Metrics
If your application uses queues, the following minimal metrics monitor traffic for a queue. These should be monitored in addition to the general topic metrics above.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric
Short Description
/seconds_behind
Age of oldest unacknowledged message in the queue.
/queue_depth
Number of messages in the queue.
Replication Destination Metrics
The following metrics monitor traffic to an outgoing replication destination.
Base Metrics Path: /amps/instance/replication/<destination>
Metric
Short Description
/is_connected
Whether or not this destination is currently connected.
/seconds_behind
The current point in the transaction log that has been acknowledged by this destination.
This is calculated as the difference in seconds between the time that the last message acknowledged by the destination was written to the transaction log and the time that the most recent message was written to the transaction log.
That is, if the last message that the destination has acknowledged was written to the local transaction log at 12:00:01.100
(one second and 100 ms after 12:00) and the last message in the transaction log was written at 12:00:03.212
, the seconds behind shown in the current statistics would be approximately 2.112
. Acknowledgements are transmitted at a specific interval (1s by default) from the destination instance to the source instance.
AMPS rounds any value below 1
to 0
.
Note: This is not an estimate of the time required to synchronize the downstream instance.
/messages_out_per_sec
Number of messages sent to the destination (per second).
Replicated Queue Metrics
The following metrics monitor traffic for a replicated queue. These should be monitored in addition to the general replication metrics above and the instance-specific metrics for the queue.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric
Short Description
/transferred_in
Number of messages that have had ownership transferred to this instance.
/transferred_out
Number of messages that this instance has previously owned, but granted ownership to another instance.
/owned
Number of messages currently owned.
Application Connection Metrics
The following metrics monitor network activity for a client connection.
Base Metrics Path: /amps/instance/clients/<client_id>
Metric
Short Description
/transport_rx_queue
Number of bytes in the transport receive queue for this connection.
/transport_tx_queue
Number of bytes in the transport send queue for this connection.
/bytes_out_per_sec
Number of bytes per second sent to the client.
/bytes_in_per_sec
Number of bytes per second received from the client.
/queue_depth_out
Number of messages buffered in AMPS for the client.
/queue_max_latency
Oldest message buffered in AMPS for the client.
Last updated