Operations Best Practices

This section covers a selection of best practices for deploying AMPS.

Monitoring

AMPS exposes the statistics available for monitoring via a RESTful interface, described in the Monitoring AMPS section and the AMPS Monitoring Guide. The interface is available through the address specified in the Admin section of the configuration. This interface allows developers, administrators, and monitoring tools to easily inspect various aspects of AMPS performance and resource consumption using standard monitoring tools.

At times, AMPS will emit log messages notifying that a thread has encountered a deadlock or stressful operation. These messages will repeat with the word “stuck” in them. AMPS will attempt to resolve these issues, however after 60 seconds of a single thread being stuck, AMPS will automatically emit a minidump to the previously configured minidump directory. This minidump can be used by 60East support to assist in troubleshooting the location of the stuck thread or the stressful process.

Monitor the contents of dmesg on the instance for errors that affect the AMPS process. For example, if the operating system runs low on memory and begins shutting down processes, this information will be recorded in dmesg. Likewise, system events such as hardware failures that can affect AMPS are most likely to be recorded in the dmesg output.

Another area to examine when monitoring AMPS is the last_active monitor for the processors. This can be found in the /amps/instance/processors/all/last_active url in the monitoring interface. If the last_active value continually increases for more than one minute and there is a noticeable decline in the quality of service, then it may be best to fail-over and restart the AMPS instance.

Logging

60East recommends that an instance of AMPS used for production log at info level (at a minimum). This provides a basic record of the operations requested in AMPS, and is the minimum level of logging needed to troubleshoot most issues. Further, a production instance should have the capacity available to log at a more verbose level if necessary, for troubleshooting and diagnostic purposes.

An instance used for development or UAT purposes should typically log at trace level so that the interaction between an application and AMPS is captured.

60East also recommends capturing stdout and stderr for the AMPS process. This can provide information about operating system or runtime errors in the event that a problem occurs outside of the control of AMPS (and, therefore, cannot be recorded in the AMPS event log).

Stopping AMPS

To stop AMPS, ensure that AMPS runs the amps-action-do-shutdown action. By default, this action is run when AMPS receives SIGHUP, SIGINT, or SIGTERM. However, you can also configure an action to shut down AMPS in response to other conditions. For example, if your company policy is to reboot servers every Saturday night, and AMPS is not running as a system service (or daemon), you could schedule an AMPS shutdown every Saturday before the system reboot.

When AMPS is installed to run as a system service (or daemon), AMPS installs shutdown scripts that will cleanly stop AMPS during a system shutdown or reboot.

SOW Parameters

Choosing the ideal SlabSize for your SOW topic is a balance between the frequency of SOW expansion and storage space efficiency. A large SlabSize will preallocate space for records when AMPS begins writing to the SOW.

If detailed tuning is not necessary, 60East recommends leaving the SlabSize at the default size if your messages are smaller than the default SlabSize. If your messages are larger than the default SlabSize, a good starting point for the SlabSize is to set it to several times the maximum message size you expect to store in the SOW.

There are three considerations when setting the optimum SlabSize:

Frequency of allocations
Overall size of the SOW
Efficient use of space

A SlabSize that is small results in frequent extensions of your SOW topic to occur. These frequent extensions can reduce throughput in a heavily loaded system, and in extreme cases can exhaust the kernel limit on the number of regions that a process can map. Increasing the SlabSize will reduce the number of allocations.

When the SlabSize is large, then the risk of the SOW resize affecting performance is reduced. Since each slab is larger, however, there will be more space consumed if you are only storing a small number of messages: this cost will amortize as the number of messages in the SOW exceeds the number of cores in the system * the number of messages that fit into a slab.

To most efficiently use space, set a SlabSize that minimizes the amount of unused space in a slab. For example, if your message sizes are average 512 bytes but can reach a maximum of 1.2 MB, one approach would be to set a SlabSize of 2.5MB to hold approximately 5 average-sized messages and two of the larger-sized messages. Looking at the actual distribution of message sizes in the SOW (which can be done with the amps_sow_dump utility) can help you determine how best to size slabs for maximum space efficiency.

For optimizing the SlabSize, determine how important each aspect of SOW tuning is for your application, and adjust the configuration to balance allocation frequency, overall SOW size, and space to meet the needs of your application.

Given AMPS is highly-parallelized, AMPS operates more efficiently when it is able to run tasks in parallel. When considering options for SlabSize, be sure that the value you choose will result in a number of slabs that is at least equal to the number of cores in the system. A SlabSize setting that results in only a few slabs could cause reduced query performance. For example, a system with a single publisher and a SlabSize large enough to hold all of the records produced by that publisher, doesn’t allow a query to be parallelized since all of the records will be in a single slab.

Slow Clients

As described in Slow Client Management, AMPS provides capacity limits for slow clients to reduce the memory resources consumed by slow clients. This section discusses tuning slow client handling to achieve your availability goals.

Slow Client Offlining for Large Result Sets

The default settings for AMPS work well in a wide variety of applications with minimal tuning.

If you have particularly large SOW topics and your application is disconnecting clients due to exceeding the offlining threshold when the clients are retrieving large SOW query result sets, 60East recommends the following settings as a baseline for further tuning:

Parameter Recommendation

Parameter	Recommendation
`MessageMemoryLimit`	This controls the maximum memory consumed by AMPS for client messages. You can increase this parameter to allow AMPS to use more memory for records. Notice, however, that memory devoted to client messages is unavailable for other purposes. Recommended starting point for tuning large result sets: 10% of the system memory (for example, on a server with 128GB of memory, start with a 13GB limit). 60East recommends tuning the `MessageDiskLimit` first. If necessary, increase this parameter by 1-2% at a time. Use caution with settings over 20%: devoting large amounts of memory to client messages may cause swapping and reduce, rather than increase, overall performance.
`MessageDiskLimit`	The maximum amount of space to consume for offline messages. Recommended starting point for tuning large result sets: average record size * number of expected records * number of simultaneous clients, or `MessageMemoryLimit`, whichever is greater.
`MessageDiskPath`	The path in which to store offline message files. 60East recommends that the message disk path be hosted on fast, high-capacity storage such as a PCIe-attached flash drive. The available storage capacity of the disk must be greater than the configured `MessageDiskLimit`. Pay attention to the performance characteristics of the device: for example, some devices suffer reduced performance when they run low on free space, so for those devices you would want to make sure that there is space available on the device even when AMPS is close to the `MessageDiskLimit`.

MessageMemoryLimit

This controls the maximum memory consumed by AMPS for client messages. You can increase this parameter to allow AMPS to use more memory for records. Notice, however, that memory devoted to client messages is unavailable for other purposes.

Recommended starting point for tuning large result sets: 10% of the system memory (for example, on a server with 128GB of memory, start with a 13GB limit). 60East recommends tuning the MessageDiskLimit first. If necessary, increase this parameter by 1-2% at a time.

Use caution with settings over 20%: devoting large amounts of memory to client messages may cause swapping and reduce, rather than increase, overall performance.

MessageDiskLimit

The maximum amount of space to consume for offline messages.

Recommended starting point for tuning large result sets: average record size * number of expected records * number of simultaneous clients, or MessageMemoryLimit, whichever is greater.

MessageDiskPath

The path in which to store offline message files.

60East recommends that the message disk path be hosted on fast, high-capacity storage such as a PCIe-attached flash drive. The available storage capacity of the disk must be greater than the configured MessageDiskLimit.

Pay attention to the performance characteristics of the device: for example, some devices suffer reduced performance when they run low on free space, so for those devices you would want to make sure that there is space available on the device even when AMPS is close to the MessageDiskLimit.

60East recommends that you use these settings as a baseline for further tuning, bearing in mind the needs and expected messaging patterns of your application.

WAN Traffic and Slow Client Settings

In some installations, a single AMPS instance will serve both applications that are local to the instance and applications that retrieve data over a higher-latency network. For example, applications in a small regional office may use a server in another region over a WAN.

In these situations, consider either adjusting the slow client settings so that those clients can complete operations such as large SOW queries successfully, or consider creating a separate transport with higher capacity settings that will be used only by the small number of clients that require these settings due to network limitations. In particular, if you set a ClientMessageAgeLimit for an instance or transport, ensure that this limit is large enough that the network can consume the results of the SOW queries that clients are expected to make within the allotted time.

Minidump

AMPS includes the ability to generate a minidump file, which can be used by 60East support, to help troubleshoot a problematic instance.

The minidump captures thread state information: a snapshot of where in the source code each thread is, the call stack for each thread, and the register information for each frame of the call stack. A minidump also contains basic information about the system that AMPS was running on, such as the processor type and number of sockets. Minidumps do not contain other internal state of AMPS or the contents of application memory. Minidumps do not contain detailed information about the host system, and have no information about the state of the host or operating system. Instead, minidumps identify the point of failure to help 60East quickly narrow down the issue without generating large files or potentially compromising sensitive data.

Minidumps can be produced much faster than a standard core dump, and use significantly less space since the minidump contains only a small subset of the information a core dump would contain (see the ulimit section in Linux OS Settings for more configuration options). Because minidumps are relatively inexpensive, the AMPS server may produce minidumps for temporary conditions that the server subsequently recovers from. AMPS also allows creation of a minidump on demand.

Generation of a minidump file occurs in the following ways:

When AMPS detects a crash internally, a minidump file will automatically be generated. This includes cases where an AMPS thread or critical internal component has not reported progress for an extended period of time (typically 300 seconds).
When a user clicks on the minidump link in the amps/instance/administrator link from the administrator console (see the AMPS Monitoring Reference for more information).
By sending the running AMPS process the SIGQUIT signal.
In response to a configured action.
If a thread fails to report progress with the AMPS thread monitor for approximately 60 seconds, a minidump will automatically be generated. This should be sent to AMPS support for evaluation along with a description of the operations taking place at the time (typically, info level or more verbose logging).

By default the minidump is configured to write to /tmp, but this can be changed in the AMPS configuration by modifying the MiniDumpDirectory. 60East recommends monitoring the minidump directory.

If minidumps occur, contact 60East support for diagnosis and troubleshooting. Bear in mind that minidumps are often a symptom of a slowdown in the server due to resource constraints rather than an indication that the server has exited.

Once a minidump is submitted to 60East (and acknowledged as received), there is no further need to retain that minidump. 60East recommends removing minidumps when they are no longer needed.

Deployment and Upgrade Plan

60East offers a deployment checklist for use when planning or upgrading an installation of AMPS. The checklist covers recommendations for operations considerations such as:

Capacity Planning
Operating System Configuration
AMPS Configuration
Developing and Configuring Maintenance Plans
Creating a Monitoring Strategy
Creating a Patching and Upgrade Plan
Creating a Support Plan and Verifying the Support Process

The checklist may not cover all aspects of deployment in a particular environment but can be used to create a checklist and deployment plan for your environment.

Monitoring​

Logging​

Stopping AMPS​

SOW Parameters​

Slow Clients​

Slow Client Offlining for Large Result Sets​

WAN Traffic and Slow Client Settings​

Minidump​

Deployment and Upgrade Plan​