Mar 7, 2021

Hashicorp Vault Dynamic Secret Engines

In this post I will cover the Hashicorp Vault dynamic secret engines. Their operating mechanism, use case scenarios, and some of the pitfalls you need to be aware of before you start using them at scale. It might look oversimplified at times but my goal is to give a conceptual understanding and a good head start for folks who never used Hashicorp Vault in general or dynamic secret engines in particular.

Dynamic Secret Engines: What do they do?

In a nutshell dynamic secrets engines provide a way to off-load credential management from the 3rd party apps/services(e.g. databases, message brokers, cloud providers, etc) giving you in exchange a generic mechanism of creating, rotating and revoking credentials with configurable TTL, RBAC, etc. All done via contacting a single API over HTTPS.

Dynamic Secret Engines: How they do it?

You would configure the secret engine with an “admin”-like credentials(I will call them “root” credentials) of a target app/service giving it permissions to create, update, or delete users, passwords, tokens, etc. Then the end user would access Vault to obtain short-lived credentials with a given access scope. Vault would track the status of such credentials via the “lease” mechanism and revoke them(by sending appropriate instructions to the target app/service) when the lease expires.

The “root” credentials are decently privileged and long-lived at the same time. Although once they are written to the Vault, they can’t be extracted in any way making a leak an unlikely event and giving you some peace of mind.

Dynamic Secret Engines: Why and how would you use them?

Dynamic secret engines give you a powerful mechanism to take control over the secrets sprawl and put an end to the long-lived credentials. At the same time, Vault’s audit trail will have an ultimate record of all access attempts. After Secret Engine is configured you would create one or more Role(some secret engines call it RoleSets) which would be associated with a specific access scope on the target. In other words speaking a Role/RoleSet is a Vault’s internal representation of an identity in a target app/service(such as user, service account, etc). The capabilities of a RoleSet greatly depends on the Secret Engine implementation and capabilities of the target app/service. By reading(or writing sometimes) from the Role/RoleSet path you are obtaining a credentials of such identity.

The end-user access is controlled by the Vault’s policies. You can grant access to the user/group on a specific path where secrets engine Roles/RoleSets are mounted. The granularity of the access granted is limited by the target service capabilities and your imagination.

Practice

Now let’s look closely at the Google Cloud secret engine in action. We’ll start as always with setting up Vault server. To keep it simple, I’ll use the ‘filesystem’ storage backend and no SSL/TLS. Config file first:

cat <<EOF > config.hcl
listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable = "true"
}

storage "file" {
  path = "$(pwd)/data"
}

disable_mlock = "true"
api_addr      = "http://127.0.0.1:8200"
EOF

And the start the server:

vault server -config ./config.hcl

In another terminal window initialize and unseal it (just one key share would do for this test) and then set up environment variables for further configuration:

export VAULT_ADDR="http://127.0.0.1:8200"
vault operator init -key-shares=1 -key-threshold=1 -format=json | tee init_info.json
vault operator unseal $(jq -r ".unseal_keys_hex[0]" init_info.json)
export VAULT_TOKEN="$(jq -r ".root_token" init_info.json)"

Now enable the secret engine:

vault secrets enable gcp

At this point we’ll need a “root” GCP Service Account key. The Service Account must have following permissions:

iam.serviceAccounts.create
iam.serviceAccounts.delete
iam.serviceAccounts.get
iam.serviceAccounts.list
iam.serviceAccounts.update
iam.serviceAccountKeys.create
iam.serviceAccountKeys.delete
iam.serviceAccountKeys.get
iam.serviceAccountKeys.list

Plus additional set of permission of a following pattern:

<service>.<resource>.getIamPolicy
<service>.<resource>.setIamPolicy

First block will give the “root” service account permissions to create child Service Accounts and their keys. Second block will allow to grant them roles according to the RoleSet specification.

<service>.<resource> will define what kind of GCP services and resources Vault will be able to grant access to. I will be granting pre-defined roles, though in real use cases you would probably go with a Custom Role.

gcloud iam service-accounts create vault-dyn-engine
SA_EMAIL="$(gcloud iam service-accounts list --project vault-dynamic-secrets| grep vault-dyn-engine | awk '{print $1}')"

gcloud iam service-accounts keys create credentials.json --iam-account=$SA_EMAIL --project vault-dynamic-secrets
PROJECT_ID="$(jq -r '.project_id' credentials.json)"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SA_EMAIL \
    --role=roles/resourcemanager.projectIamAdmin

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SA_EMAIL \
    --role=roles/iam.serviceAccountAdmin

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SA_EMAIL \
    --role=roles/iam.serviceAccountKeyAdmin

Back to the Vault configuration. Time to create a RoleSet with a Project Viewer role:

vault write gcp/config credentials=@credentials.json

vault write gcp/roleset/project_viewer \
    project="vault-dynamic-secrets" \
    secret_type="service_account_key"  \
    bindings=-<<EOF
      resource "//cloudresourcemanager.googleapis.com/projects/vault-dynamic-secrets" {
        roles = ["roles/viewer"]
      }
EOF

After this step you should be able to see a new service account in GCP:

IAM roles granted to this Service Account will match one you put it the RoleSet:

vault-roleset-iam

Now the secret engine is ready for use. At this point you would create policies and grant your Vault users write permission ot the path gcp/key/project_viewer. I’ll omit this part and use the same root token I used for configuration:

vault write gcp/key/project_viewer ttl=5s -format=json > result.json

jq -r ".data.private_key_data" result.json | base64 -d > generated_key.json

echo "Lease duration: $(jq -r '.lease_duration' result.json)"

I use a silly TTL value of 5 sec, but it shows that keys are created and then shortly after destroyed:

And to finish the demo let’s delete the RoleSet:

vault delete gcp/roleset/project_viewer

The moment RoleSet is deleted, corresponding Service Account in GCP is deleted too.

Pitfalls and workarounds

Thinking about the lease mechanism and decoupled nature of a target app/service and a secrets manager, two corner cases inevitably come to mind:

What will happen if Vault crashes, reboots, or simply gets sealed? Typically you would address this issue by configuring HA-Vault. The OSS Vault allows only Active/Standby mode, when only one instance of Vault is serving user requests at all time. Enterprise Vault gives you “Performance Standby Nodes” feature which allows standby nodes to serve Read request, increasing the request throughput your Vault can handle. Though even if all of Vault nodes will go down at the same time somehow you will be ok. Due to the fact that secret lifetime gets controlled by the Vault and it needs to send the request to the target app/service to revoke the secret, revocation stops working until Vault is down/sealed. The good news is that the moment you unseal it, Vault will reassess all it’s leases and will revoke all secrets associated with expired leases.
What will happen if the target app/service is down, or the connectivity is disrupted in any way? At the first glance this scenario looks similar to the first one, Vault is aware of all expired leases and can repeat the revocation once connection is restored. In practice there is one significant limitation. Vault makes a limited number of attempts to revoke the secret and then just stops trying. In the test I have performed with Google Cloud secrets engine(and quick research confirmed that it is valid for other dynamic secret engines too) Vault made 6 retries starting with 40sec back-off timer, and then gradually increasing it(+10sec, +20sec, +40sec, etc). What that means is if your target app/service connectivity were not restored in ~13 mins, the secret will be hanging there indefinitely.

This is a sad news indeed, but there are things you can do to address it:

You can manually revoke such secrets(or build the automation around the Vault API and your logging service that captures Vault logs and can detect the event of failed revocation)
If your Vault gets sealed/unsealed all expired leases will be reassessed and revoked accordingly. If you are using manual unsealing it’s not making things much better. If you do use auto-unsealing, you can potentially build a workflow with a daily seal/unseal routine. Not a bulletproof solution but it is something at least.

Conclusion

Vault dynamic secret’s engines is an extremely powerful and handy tool that can make life of you SecOps teams much easier. Though as it typically is the case, it’s not perfect. There are pitfalls and edge cases. The “maximum revocation attempt” issue in the Vault’s Git repository is still open. Let’s hope it’ll be addressed and resolved in the near future. Meanwhile if you are planning to use Dynamic Secret Engines, consider if a app/service downtime is a frequent event for you and plan your remediation procedures accordingly.