Lease Explosions
As your Vault environment scales to meet deployment needs, it is important to avoid over-subscription. A lease explosion can occur when operators reach over-subscription and clients create leases much faster than Vault is set to revoke them. If this continues unchecked, the active node can run out of memory. Once a lease explosion occurs, mitigation is time consuming and resource intensive.
This document shows you how to prevent lease explosions, mitigate when a lease explosion occurs, and clean up your environment after a lease explosion.
Applications and users can overwhelm system resources through consistent and high-volume API requests, resulting in denial-of-service issues in some Vault nodes or even the entire Vault cluster. Review Vault resource quotas to learn more about enabling rate-limit quotas and lease-count quotas to protect against requests which could trigger lease explosions.
These are common observations and behaviors operators experience as their Vault deployment matures:
TTL values for dynamic secret leases or authentication tokens could be too high, resulting in unused leases consuming storage space while waiting to expire.
Rapid lease count growth disproportionate to the number of clients is a sign of misconfiguration or potential anti-patterns in client usage.
Lease revocation is failing. This can be caused by failures in an external service in the case of dynamic secrets.
Valid credentials which have already been leased are not being reused when possible. e.g. a badly behaving app requests new credentials from Vault every time it starts instead of caching ones it previously requested and using them again. This encourages a build up of leases associated with otherwise unused credentials.
The Vault server is not processing lease revocations as quickly as they're expiring. Usually, this is due to insufficient IOPS for the storage backend.
You can approach lease explosions in three phases:
Preventing lease explosions
Mitigating lease explosions
Cleaning up after lease explosions
Preventing lease explosions
Prevention is the best tool against lease explosion. The following are three important areas you can focus on to prevent lease explosion in your Vault environment.
Although no technical maximum exists, high lease counts can cause degradation in system performance. We recommend short default time-to-live (TTL) values on tokens and leases to avoid a large backlog of unexpired leases or many simultaneous expirations. Review Vault lease limits to learn more.
Client best practices
Ensure clients using Vault adhere to best practices for their authentication and secret retrieval, and do not make excessive dynamic secrets requests or service token authentications. Review Lease Concepts and Auth Concepts to learn more.
You should avoid these client behavior anti-patterns: Long TTLs configured, leading to a slow build over-subscription. Acute aberrant client behavior leading to rapid over-subscription. A combination of both.
AppRole
As Vault matures in your environment, it's important to review and ensure client behavior best practices around machine-based authentication as it can have more impact on lease explosion than human-based authentication typically does.
Monitoring key metrics
Proactive monitoring is key to identifying behavior and usage patterns before they become problematic. Review the following resources for more details:
Vault key metrics
Vault anti-patterns poor metrics
Implementation guardrails
You can choose the appropriate token type for your use case, and use resource quotas as guardrails against lease explosion in your implementation.
TTLs
System-wide maximum TTL and system-wide default TTL
TTL values which you specify in the Vault server configuration file; they are the last used values by Vault in terms of precedence after mount TTLs and high granularity TTLs
Mount maximum TTL and mount default TTL
TTL values specified on a per mount instance of auth method or secrets engine. In terms of precedence, these TTL values override system-wide TTLs, but are overridden by highly granular TTLs.
Highly granular TTLs, for example: Database secrets engine role default TTL and Database secrets engine role maximum TTL
These TTLs are specified on a role, group, or user level, and their values override both mount and system-wide TTL values.
More details are available in the Token Time-To-Live, periodic tokens, and explicit max TTLs and Lease limits documentation.
You should also review the details in the Vault anti-patterns guide: not adjusting the default lease time for a clear explanation of the issue and solution.
The following are examples for setting default and maximum TTL values using the Vault API and CLI, which you can reference when setting values for your implementation.
Adjusting TTL values is not a retroactive operation, and affects just those leases or tokens issued after you make the changes.
Update the default TTL to 8 hours and maximum TTL to 12 hours on a username and password auth method user named "alice". The value of $VAULT_TOKEN
should be that of a token with capabilities to perform the operations.
This command is not expected to produce output, but you can read the user to confirm the settings.
Example output:
When Alice authenticates with Vault and gets a token, its default TTL value is set to 28800 seconds (8 hours) and the maximum TTL value is 43200 seconds (12 hours).
Example output:
You can read the user to confirm the settings.
Example output:
When Alice next authenticates with Vault and gets a token, its default TTL value is set to 8 hours and the maximum TTL value is 12 hours.
Resource Quotas
You can use quotas to control Vault resource usage in the form of API rate limiting quotas and lease count quotas. For the purposes of this overview, lease count quotas are most relevant as you can cap the maximum number of leases generated on a per-mount basis.
Use this feature for use cases where a hard limit to the number of leases makes sense. Also, be sure to monitor Vault audit device logs where Vault emits messages about failures related to exceeding the quota.
The following examples demonstrate creating a lease count quota on an instance of the Approle auth method, for the role named "webapp" to restrict leases to no more than 100. The value of $VAULT_TOKEN
should be that of a token capable of performing the operations.
Create a payload file containing the lease quota parameters.
Write the webapp-tokens lease count quota.
This command is not expected to produce output, but you can read the user to confirm the settings.
Confirm settings.
Example output:
Write the webapp-tokens lease count quota.
Example output:
Confirm the setting.
Example output:
The limit is set to 100 leases for the AppRole auth method role named webapp.
Enabling the rate limit audit logging may have an impact on the Vault performance if the volume of rejected requests is large.
Token type
In some use cases, batch tokens can be a better fit than service tokens with respect to lease explosion. Review the following resources for help deciding when to use batch tokens and when to use service tokens:
Vault service tokens vs batch tokens
Service vs batch token lease handling
Mitigating lease explosions
Ultimately, the number of leases a system can handle is unique to the Vault deployment and environment.
Increase resources
Increasing available resources in your Vault cluster can help mitigate lease explosion and allow for cluster recovery. Review hardware sizing, and focus on increasing available RAM.
Within Vault
Use the information from the Implementation guardrails section to adjust TTL values from the default values according to your use case needs.
External to Vault
You can use firewalls or load balancers to limit API calls to Vault from aberrant clients.)
Knowledge base article around load balancing Vault & load balancing
Cleaning up environment after lease explosions
Once the acute event subsides, the Vault active node will continue to purge leases. Sometimes, the explosion is so great, you will need to manually intervene to revoke leases. If you are running a version of Vault prior to 1.13.0, this lease revocation can cause further performance degradation.
Revoking or forcefully revoking leases is potentially a dangerous operation. You should ensure that you have recent valid snapshots of the cluster. Users of Vault versions prior to 1.13.0 on integrated storage must also perform freelist compaction.
Last updated