Create Monitoring Strategy
Create Monitoring Strategy
To detect any problems that arise in AMPS or in the underlying hardware, it's important to develop and implement a monitoring strategy.
AMPS is designed to be able to work with your existing monitoring infrastructure, including systems such as ITRS Geneos, Grafana, DataDog, and so on.
The AMPS Monitoring Guide describes the metrics available through the Administrative Interface. A complete monitoring strategy would include metrics tailored to the use case, and would use alerting thresholds based on the environment and the guarantees provided by the application.
This chapter includes a suggested minimal set of metrics to be tracked in a monitoring system. A full monitoring strategy would likely include additional metrics that are relevant to the specific environment and AMPS features used for the application.
The metrics offered here are one suggested baseline set of metrics. Not every metric applies to every installation. For any given installation of AMPS, other metrics are likely also important. See the AMPS Monitoring Guide for details on the metrics available, and create a monitoring strategy that reflects your environment and how your application uses AMPS.
Event Logging
The AMPS error and event log contains an ordered log of significant events in AMPS. The detail provided depends on the verbosity at which the logging is configured.
For a production instance of AMPS, a logging level of info
(or more verbose) is recommended.
Any event recorded with a severity level of error
, critical
, or emergency
indicates that an operation has failed in way that an application may have received partial or incorrect data and should be investigated.
Events at an error
or critical
level do not necessarily mean that AMPS is not functioning as expected but should still be investigated. For example, if an application submits a command that AMPS doesn't recognize, that would be logged as an error
level event since that application will not get the expected data, even though AMPS is correctly rejecting an unknown command. However, this event indicates that an application submitted an incorrect operation and is likely not functioning as expected, even though there is no issue in the AMPS server itself.
A robust monitoring strategy will monitor the event logs for events of error
level and above so that those events can be investigated and corrected.
Baseline Host Metrics
Typically, a monitoring system will capture, at a minimum, the following metrics about host-level performance. Since these related to the underlying system rather than AMPS itself, many sites already collect the equivalent of these statistics by default.
This is not a complete list of statistics available for the host, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/host
Baseline Message Flow Metrics
The following metrics monitor overall message flow to the instance.
This is not a complete list of statistics available for message flow, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/instance/processors/all
SOW Topic Traffic Metrics
The following metrics monitor message flow for specific topics in the SOW (including Topics, Views, ConflatedTopics, and all replication models for Queues).
Depending on your application, of course, a given metric may not be relevant. (For example, if an application only uses queues, then the "update" metrics would not be relevant, since a message can be added to the queue or removed from the queue, but cannot be modified while in the queue.)
Base Metrics Path: /amps/instance/sow/<topic_name>!<message_type>
View-Specific Metrics
If your application uses views, the following metric, when combined with the general topic metrics above, can give you insight into queue activity and the processing load for the queue.
Base Metrics Path: /amps/instance/views/<topic_name>!<message_type>
Comparing the queue depth in a statistics snapshot with maximum recorded values for update, insert, and delete for the previous intervals can provide a rough approximation of the current latency for this view. For example, if the maximum number of total updates per second, as calculated by the sum of updates_per_sec
inserts_per_sec
and deletes_per_sec
for the topic is 15,000, a current queue depth of 1500 would be expected to be processed in 100ms or less.
Queue-Specific Metrics
If your application uses queues, the following minimal metrics monitor traffic for a queue. These should be monitored in addition to the general topic metrics above.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Replication Destination Metrics
The following metrics monitor traffic to an outgoing replication destination.
Base Metrics Path: /amps/instance/replication/<destination>
Replicated Queue Metrics
The following metrics monitor traffic for a replicated queue. These should be monitored in addition to the general replication metrics above and the instance-specific metrics for the queue.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Application Connection Metrics
The following metrics monitor network activity for a client connection.
Base Metrics Path: /amps/instance/clients/<client_id>
Last updated