Deployment Checklist
Introduction and Overview
This document presents a basic checklist for successfully deploying AMPS. The checklist is meant to describe the general processes and considerations for an AMPS instance. Depending on the specific needs of your installation, there may be additional factors to consider.
Every deployment of AMPS should consider the factors in the checklist below. With this list, you can check off each step as it is completed:
Completed? | Task |
---|---|
Ensure Sufficient Capacity for the system, including capacity testing as necessary. | |
Verify the recommended settings for host system and Apply System Configuration and add AMPS NUMA Configuration if necessary for the deployment. | |
Create a Maintenance Plan. | |
Create and implement a Monitoring Strategy. | |
Create a Patch and Upgrade Plan. | |
Create and test the support plan to define and verify the process for providing artifacts to 60East, responding to issues, and so on. |
Ensure Sufficient Capacity
Successful deployment of any system includes making sure that the server capacity and network capacity is sufficient to support the application.
60East recommends using the capacity planning metrics in the latest version of the AMPS User Guide to ensure sufficient memory, disk, and network capacity.
Notice that the advice in the AMPS User Guide also recommends allowing extra capacity to ensure that the system has "headroom" in the case of unexpected increases in traffic.
Capacity Testing
60East recommends testing capacity estimates by running the system in a test environment with hardware and networking as similar to the production environment as possible. In these tests, it is important to simulate peak production load (often 150-200% or more of the historical peak of the system) to ensure that the system can handle load with the expected SLA.
Work with the consumers of the application to ensure that the application is not doing unnecessary work (for example, publishing messages that no subscriber is expected to consume, or publishing fields that are not used by subscribers).
Capacity for Virtual Machines
There is no difference in the capacity planning guidance for AMPS instances hosted on a virtual machine as compared with AMPS instances hosted on physical hardware.
When the AMPS instance will be deployed on a virtual machine (VM), consider both the capacity needs of the host system as well as the capacity needs of all of the virtual machines hosted on the system. In particular, 60East recommends that the total maximum capacity of all virtual machines hosted on the system, including "headroom" is less than the total capacity available on the host.
In cases where the physical hardware is overprovisioned for the demands of the virtual machines that are hosted on that hardware, performance can suffer. In an overprovisioning situation, AMPS and other applications often see performance degradation due to the physical hardware being unable to meet the demands of the VMs that the hardware hosts.
60East does not recommend overprovisioning systems when using virtual machines. Failing to provide enough capacity for all of the VMs on the physical host operating at peak capacity can lead to inconsistent or degraded performance.
Apply System and AMPS Configuration
System Configuration
Once you have determined the required capacity for the system and the hardware is available, ensure that the system is configured to support AMPS.
60East recommends the following general settings:
- NUMA enabled in the OS
- Hyperthreading enabled
In addition, 60East recommends disabling any settings in the BIOS designed to save power, as these settings limit performance. While each BIOS differs how to enable the changes, 60East recommends:
- Disable C-states (set C state to mode 0, active mode)
- Disable P-states (set P state to mode 0, highest frequency)
The AMPS User Guide contains detailed settings for tuning the Linux operating system for AMPS. Ensure that you have applied the recommended settings to your operating system installation.
AMPS Configuration
If the system will host multiple instances of AMPS, or any other process that is pinned to a specific core, disable the NUMA tuning in the AMPS engine to allow the operating system NUMA tuning to distribute AMPS instances across processors.
To disable AMPS NUMA tuning, include the following element in the AMPS configuration file:
<Tuning>
<NUMA>
<Enabled>disabled</Enabled>
</NUMA>
</Tuning>
Change the NUMA setting for the instance in cases where a given system will host multiple instances of AMPS, where AMPS (or another process) will be pinned to specific cores, or if AMPS will be running in a virtual environment (including a container).
Notice that this advice applies only to the NUMA tuning done by AMPS for the AMPS instance itself. 60East does not recommend disabling NUMA in the operating system.
Create Maintenance Plan
To keep AMPS running smoothly, 60East recommends developing a maintenance plan, and creating an action configuration to implement the maintenance plan.
AMPS actions provide a way for the server to perform specific tasks in response to specific events. The most common use for actions is to create a maintenance schedule, as described below.
Details and configuration for AMPS actions are provided in the Configuring AMPS for Automation with Actions section of the AMPS User Guide.
The following table describes maintenance for various features of AMPS. If your installation of AMPS doesn't use a given feature, there is no need to have a maintenance plan for that feature.
Feature | Maintenance Required | Actions |
---|---|---|
Statistics Database | Truncate unneeded statistics. 60East recommends retaining statistics for the period of time that you will need to have available for troubleshooting problems. For example, if your monitoring policies evaluate throughput twice a week, truncating statistics the day after each evaluation may be reasonable. | amps-action-do-truncate-statistics |
Transaction Log | Remove unneeded journal files. 60East recommends removing journal files when the messages contained within the file are no longer needed. | amps-action-do-remove-journal |
Event and Error Logs | Remove unneeded log files. 60East recommends retaining log files for the period of time that you will need to have available for troubleshooting problems. | amps-action-do-remove-files |
60East recommends that you consider the following maintenance actions, and whether they are appropriate for your installation:
Feature | Optional Maintenance | Actions |
---|---|---|
Transaction Log | Archiving journals to higher-capacity storage. Journal files moved to the archive directory are still part of the transaction log and are available for replay, replication, queue message delivery, and so on. 60East recommends moving journals to higher capacity storage in the event that high-speed storage does not have enough space to hold the journals required to meet the needs of the application for access to older messages. | amps-action-do-archive-journal |
Transaction Log | Compressing journals to save space. Compressed journal files are still part of the transaction log and are available for replay, replication, queue message deliver, and so on. 60East recommends compressing journals in cases where storage space is restricted, or where storage performance for older journals is low, such that spending the extra CPU time to uncompress a journal will provide higher performance than retrieving an uncompressed journal. The amount of space saved by compression depends on the data contained in each journal file. | amps-action-do-compress-journal |
State of the World (SOW) | Removing unnecessary records. 60East recommends deleting records from the SOW when they are no longer needed. Many times, this happens as a part of the workflow for the application. In cases where the workflow does not inherently remove unneeded data, consider using an action to maintain the topic. | amps-action-do-delete-sow |
Example Maintenance Plan
For example, the following maintenance plan runs each day, at 23:00 (11 PM local time).
The plan performs the following maintenance:
-
Archives journal files older than three days
-
Removes journal files older than 14 days
-
Truncates statistics to keep the last 7 days of statistics
-
Removes log files in the
$AMPS_ADMIN/logs
directory that are older than 7 days, and -
Removes any records in the
Orders
topic in the SOW that have a status ofcanceled
that have not been updated in the last two days.There are a few important points to notice in the filter used to remove records from the
Orders
topic:- Since the timestamps have a unit of seconds, the
Filter
describes 48 hours as 172800 seconds (60 seconds in a minute * 60 minutes in an hour * 48 hours). - Since the configuration file is in XML format, the
Filter
uses the XML escape for the<
symbol (<
). This is translated into a<
symbol when the configuration file is parsed. - Within an action, AMPS replaces the token
{{AMPS_UNIX_TIMESTAMP}}
with the point in time that the action begins running. We can use this to create aFilter
that uses the correct value no matter when the filter is run.
- Since the timestamps have a unit of seconds, the
<Actions>
<Action>
<On>
<Module>amps-action-on-schedule</Module>
<Options>
<Name>Nightly maintenance</Name>
<Every>23:00</Every>
</Options>
</On>
<!-- Archive journals older than three days. -->
<Do>
<Module>amps-action-do-archive-journal</Module>
<Options>
<Age>3d</Age>
</Options>
</Do>
<!-- Remove journals older than 14 days. -->
<Do>
<Module>amps-action-do-remove-journal</Module>
<Options>
<Age>14d</Age>
</Options>
</Do>
<!-- Remove statistics older than 7 days. -->
<Do>
<Module>amps-action-do-truncate-statistics</Module>
<Options>
<Age>7d</Age>
</Options>
</Do>
<!-- Remove event logs that are older than
seven days. -->
<Do>
<Module>amps-action-do-delete-files</Module>
<Options>
<Pattern>${AMPS_ADMIN}/logs/*.log</Pattern>
<Age>7d</Age>
</Options>
</Do>
<!-- Remove records from the Orders topic in the SOW
that are canceled and have not been updated
in two days. -->
<Do>
<Module>amps-action-do-delete-sow</Module>
<Options>
<Topic>Orders</Topic>
<MessageType>json</MessageType>
<Filter>/status = 'canceled'
AND LAST_UPDATED() < ({{AMPS_UNIX_TIMESTAMP}} - 172800) </Filter>
</Options>
</Do>
</Action>
</Actions>
Create Monitoring Strategy
To detect any problems that arise in AMPS or in the underlying hardware, it's important to develop and implement a monitoring strategy.
AMPS is designed to be able to work with your existing monitoring infrastructure, including systems such as ITRS Geneos, Grafana, DataDog, and so on.
The AMPS Monitoring Guide describes the metrics available through the Administrative Interface. A complete monitoring strategy would include metrics tailored to the use case, and would use alerting thresholds based on the environment and the guarantees provided by the application.
This chapter includes a suggested minimal set of metrics to be tracked in a monitoring system. A full monitoring strategy would likely include additional metrics that are relevant to the specific environment and AMPS features used for the application.
The metrics offered here are one suggested baseline set of metrics. Not every metric applies to every installation. For any given installation of AMPS, other metrics are likely also important. See the AMPS Monitoring Guide for details on the metrics available, and create a monitoring strategy that reflects your environment and how your application uses AMPS.
Event Logging
The AMPS error and event log contains an ordered log of significant events in AMPS. The detail provided depends on the verbosity at which the logging is configured.
For a production instance of AMPS, a logging level of info
(or more verbose) is recommended.
Any event recorded with a severity level of error
, critical
, or emergency
indicates that an operation has failed in way that an application may have received partial or incorrect data and should be investigated.
Events at an error
or critical
level do not necessarily mean that AMPS is not functioning as expected but should still be investigated. For example, if an application submits a command that AMPS doesn't recognize, that would be logged as an error
level event since that application will not get the expected data, even though AMPS is correctly rejecting an unknown command. However, this event indicates that an application submitted an incorrect operation and is likely not functioning as expected, even though there is no issue in the AMPS server itself.
A robust monitoring strategy will monitor the event logs for events of error
level and above so that those events can be investigated and corrected.
Baseline Host Metrics
Typically, a monitoring system will capture, at a minimum, the following metrics about host-level performance. Since these related to the underlying system rather than AMPS itself, many sites already collect the equivalent of these statistics by default.
This is not a complete list of statistics available for the host, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/host
Metric | Short Description |
---|---|
/memory/free | Amount of memory currently free. |
/memory/in_use | Amount of memory in use. |
/memory/swap_free | Amount of swap currently free. |
/memory/swap_total | Total amount of swap. |
/network/<if>/bytes_in | Total bytes in (by interface name). |
/network/<if>/bytes_out | Total bytes out (by interface name). |
/disks/<dev>/file_system_free_percent | Free space (by device). |
/cpus/all/iowait_percent | Amount of CPU time waiting on I/O. |
/cpus/all/idle_percent | Amount of CPU time idle. |
Baseline Message Flow Metrics
The following metrics monitor overall message flow to the instance.
This is not a complete list of statistics available for message flow, but provides a starting point for developing your monitoring plan.
Base Metrics Path: /amps/instance/processors/all
Metric | Short Description |
---|---|
/messages_received_per_sec | Incoming messages per second from all sources. |
/denied_reads | Outgoing messages denied due to entitlement (on valid subscription). |
/denied_writes | Incoming messages denied due to entitlement (publishes). |
/throttle_count | Number of times that the processor had to wait to add a message to the processing pipeline due to the instance reaching capacity limits on the number of in-progress messages. This metric can indicate resource constraints on AMPS. |
/last_active | The last active time for a message processor. If this time grows, either there is no traffic to the instance, or there is a delay in processing. |
SOW Topic Traffic Metrics
The following metrics monitor message flow for specific topics in the SOW (including Topics, Views, ConflatedTopics, and all replication models for Queues).
Depending on your application, of course, a given metric may not be relevant. (For example, if an application only uses queues, then the "update" metrics would not be relevant, since a message can be added to the queue or removed from the queue, but cannot be modified while in the queue.)
Base Metrics Path: /amps/instance/sow/<topic_name>!<message_type>
Metric | Short Description |
---|---|
/inserts_per_sec | Number of new records added per second. |
/updates_per_sec | Number of records updated per second. |
/deletes_per_sec | Number of records deleted per second. |
/queries_per_sec | Number of queries of the topic per second. |
/insert_count | Total count of new records added to the topic. |
/delete_count | Total count of records removed from the topic. |
/update_count | Total count of updates to records in the topic. |
View-Specific Metrics
If your application uses views, the following metric, when combined with the general topic metrics above, can give you insight into queue activity and the processing load for the queue.
Base Metrics Path: /amps/instance/views/<topic_name>!<message_type>
Metric | Short Description |
---|---|
/queue_depth | The current number of pending updates for the view. |
Comparing the queue depth in a statistics snapshot with maximum recorded values for update, insert, and delete for the previous intervals can provide a rough approximation of the current latency for this view. For example, if the maximum number of total updates per second, as calculated by the sum of updates_per_sec
inserts_per_sec
and deletes_per_sec
for the topic is 15,000, a current queue depth of 1500 would be expected to be processed in 100ms or less.
Queue-Specific Metrics
If your application uses queues, the following minimal metrics monitor traffic for a queue. These should be monitored in addition to the general topic metrics above.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric | Short Description |
---|---|
/seconds_behind | Age of oldest unacknowledged message in the queue. |
/queue_depth | Number of messages in the queue. |
Replication Destination Metrics
The following metrics monitor traffic to an outgoing replication destination.
Base Metrics Path: /amps/instance/replication/<destination>
Metric | Short Description |
---|---|
/is_connected | Whether or not this destination is currently connected. |
/seconds_behind | The current point in the transaction log that has been acknowledged by this destination. This is calculated as the difference in seconds between the time that the last message acknowledged by the destination was written to the transaction log and the time that the most recent message was written to the transaction log. That is, if the last message that the destination has acknowledged was written to the local transaction log at AMPS rounds any value below Note: This is not an estimate of the time required to synchronize the downstream instance. |
/messages_out_per_sec | Number of messages sent to the destination (per second). |
Replicated Queue Metrics
The following metrics monitor traffic for a replicated queue. These should be monitored in addition to the general replication metrics above and the instance-specific metrics for the queue.
Base Metrics Path: /amps/instance/queues/<topic_name>!<message_type>
Metric | Short Description |
---|---|
/transferred_in | Number of messages that have had ownership transferred to this instance. |
/transferred_out | Number of messages that this instance has previously owned, but granted ownership to another instance. |
/owned | Number of messages currently owned. |
Application Connection Metrics
The following metrics monitor network activity for a client connection.
Base Metrics Path: /amps/instance/clients/<client_id>
Metric | Short Description |
---|---|
/transport_rx_queue | Number of bytes in the transport receive queue for this connection. |
/transport_tx_queue | Number of bytes in the transport send queue for this connection. |
/bytes_out_per_sec | Number of bytes per second sent to the client. |
/bytes_in_per_sec | Number of bytes per second received from the client. |
/queue_depth_out | Number of messages buffered in AMPS for the client. |
/queue_max_latency | Oldest message buffered in AMPS for the client. |
Create Patch and Upgrade Plan
60East releases ongoing hotfixes for supported versions of AMPS. 60East recommends that every AMPS deployment create a regular schedule for updating the version of AMPS in development, test, and production environments.
Hotfix releases are designed to address a single issue (although, of course, a single issue may present multiple different symptoms). A hotfix release typically involves simply deploying the updated AMPS distribution and restarting the server.
Every hotfix is a cumulative rollup. All hotfixes pass the release certification process, and all hotfixes produced are released as soon as they pass certification. This means that an installation of AMPS can treat any hotfix release as a cumulative upgrade, and that any hotfix release will also contain every previously-released fix. An installation can get the latest fixes at any point in time, without needing to raise a support case.
60East recommends that production deployments develop two processes for qualifying and deploying upgrades:
- Planned patch/upgrade - This process is followed for periodic updates and evaluates new releases of AMPS on an ongoing basis, with the intent of having qualified releases roll into production every few weeks or months. Most often, organizations will have patch or upgrade releases continually under evaluation in their development and test environments for regular deployment to production.
- Accelerated patch/upgrade - This process streamlines updates in the event that an issue affects a production deployment of AMPS. This is often an accelerated set of tests that provides a high level of confidence for a given fix in a shorter time than a full qualification cycle.
Regardless of the process you settle on, 60East recommends that you evaluate and deploy patched versions on a regular basis. Since every patch released by 60East is cumulative for a given version, there is no need to wait for "rollup" or "service pack" releases, which means that an installation can upgrade at any time.
60East typically recommends that critical applications upgrade on a monthly or quarterly cadence for planned upgrade. Less critical applications may be patched less frequently, but should still be updated at least every six months.
Once you set a cadence for planned upgrade, scheduling testing and upgrade at that interval. Since any hotfix release can be considered a "quarterly rollup" or "cumulative update", when the planned patch window arrives, pick the most current hotfix to test, certify, and deploy.
Create and Test Support Process
A robust deployment and operations plan includes being able to easily get support if and when it becomes necessary.
60East recommends creating a support plan and testing the plan, including the process for contacting 60East, running diagnostic tools, and providing diagnostic artifacts such as statistics and event logs.
Create Support Plan, Verify Process
Develop a plan and procedure for contacting support in the event that assistance is needed for a production application.
For each instance of AMPS, be sure that the team contacting support can provide the following information as necessary:
- Version of AMPS
- AMPS configuration files
- Event/error logging from AMPS
- Statistics database from AMPS
It may be helpful to develop a template that operations teams can use when working with 60East to ensure that the relevant information is captured quickly, without requiring additional questions from 60East support.
For example, you might use a template like:
Hi 60East:
We are running AMPS Version: <X.Y.Z.H>
The impact of this issue is: <describe impact>
We are seeing the following behavior: <describe issue>
What we would expect to see is: <describe expected behavior>
This gives 60East support enough background to immediately identify the best available engineer to work on the issue. More information (such as server logs and statistics) may still be needed, but this will help the assigned engineer understand whether those artifacts would be helpful.
Verify Access to Diagnostic Tools
The AMPS distribution includes diagnostic tools for inspecting AMPS artifacts (for example, amps-grep
, amps-sqlite3
, amps_journal_dump
, and so on).
An operations team responsible for running AMPS will typically need access to these tools to be able to troubleshoot issues. These tools make it easier to work with artifacts that are stored in text format (such as the error and event logs), and make it possible to inspect artifacts that are stored in binary format (such as journal files and SOW files).
It may not be necessary for these tools to be available on production servers. Often, a good strategy is to be able to run the tools on non-production servers or a sandbox set aside for investigation and to be able to move artifacts to the sandbox or non-production server for analysis.
Test Support Channels Before Deployment
Before going into production, it is helpful to test that any team that will be responsible for troubleshooting AMPS operations has the ability to retrieve logs and statistics for the AMPS servers and has a process in place for providing those artifacts to 60East. If necessary, the 60East support team can assist with providing a "process test" support case for an end-to-end test of the process.
It is particularly important to verify that artifacts can be retrieved from the actual production servers that AMPS is deployed on before an issue emerges. At many sites, production servers have more restrictions than development or test servers. A process designed and verified in a test environment may need to be modified for production.
Once a process for retrieving artifacts from production is developed, document the process so that information can be efficiently transferred if an issue arises.
Verify Support Coverage / Plan
60East support responds as soon as possible to issues that emerge. During hours when coverage is provided (as specified in your license agreement), 60East provides a guaranteed SLA to respond to operational issues and work with you toward a resolution.
Outside of those hours, 60East still attempts to provide support for critical issues -- even if those issues arise outside of the covered support times that an installation has chosen. However, support outside of contracted times is provided on an "as available" basis, with no guaranteed response time or level of engineer availability.
Although 60East support typically responds to issues rapidly, teams should have a plan in place for a worst case scenario where 60East support is unable to respond until support coverage begins if an issue arises outside of the agreed upon support coverage hours. A team should have a strategy for managing issues that emerge until 60East can respond, or until coverage hours begin.
For example, if an application is intended to be available 7 days a week, but the team has chosen a support plan that provides support during weekday business hours, the operations team for the application should have a plan in place for managing issues that fall outside of coverage hours, since responses outside of contracted support hours may be slower than the typical response times.
Conclusion
This document lists the minimum set of considerations for deploying a production instance of AMPS.
By necessity, the checklist presents general guidance and is meant to be an outline to help with your planning and rollout process rather than attempting to cover all of the factors that might be involved in a particular deployment.
For assistance and evaluation of your individual deployment plan, contact 60East at https://support.crankuptheamps.com.