# Tenant auto creation

## Tenant Onboarding Automation - Swimlane Flow Diagram

### Issue Reference: enclaive/emcp-be#2279

***

### Swimlane: Full Tenant Provisioning Flow

```mermaid
sequenceDiagram
    autonumber
    actor Admin as Super Admin
    participant UI as admin-ui<br/>(React)
    participant BE as emcp-be<br/>(Orchestrator)
    participant DB as MongoDB
    participant RS as Redis Streams
    participant KC as Keycloak
    participant VLT as Vault
    participant WR as workload-runner<br/>(NestJS)
    participant CF as Cloudflare
    participant GH as GitHub Actions
    participant ARGO as ArgoCD

    %% ──────────────────────────────────────────────
    rect rgb(240, 248, 255)
    Note over Admin,ARGO: PHASE 1 ── Request & Validation
    %% ──────────────────────────────────────────────

    Admin->>UI: Fills "Create Tenant" form<br/>(subdomain, admin email, plan)
    UI->>UI: Client-side validation<br/>- subdomain format (lowercase, no special chars)<br/>- email format<br/>- reserved word check (api, admin, www, mail...)

    UI->>BE: POST /api/admin/tenants<br/>{subdomain, adminEmail, plan, createdBy}

    BE->>BE: Server-side validation<br/>- sanitize input<br/>- verify caller is super-admin

    BE->>DB: Query: does subdomain already exist?
    DB-->>BE: No match found - subdomain available

    BE->>BE: Generate env-specific subdomains<br/>- {tenant}.enclaive.cloud (prod)<br/>- {tenant}.staging.enclaive.cloud<br/>- {tenant}.dev.enclaive.cloud

    BE->>DB: INSERT tenant record<br/>{id, subdomain, status: PROVISIONING,<br/>createdBy: adminId, createdAt: now(),<br/>steps: {keycloak: pending, vault: pending,<br/>dns: pending, ingress: pending,<br/>deployment: pending}}
    DB-->>BE: Tenant doc created (tenantId)

    BE->>DB: INSERT audit log<br/>{action: TENANT_CREATE, actor: adminId,<br/>timestamp, previousState: null, newState: PROVISIONING}

    BE-->>UI: 202 Accepted<br/>{tenantId, status: PROVISIONING}
    UI->>Admin: Show provisioning progress tracker<br/>(step indicators with live updates)
    end

    %% ──────────────────────────────────────────────
    rect rgb(255, 248, 240)
    Note over Admin,ARGO: PHASE 2 ── Keycloak Realm Provisioning
    %% ──────────────────────────────────────────────

    BE->>DB: UPDATE tenant.steps.keycloak = "in_progress"
    BE-->>UI: SSE push: {step: keycloak, status: in_progress}

    BE->>BE: Render realm JSON from Handlebars template<br/>- inject: subdomain, adminEmail, clientURLs<br/>- template includes: theme, token config, scopes

    BE->>KC: POST /admin/realms<br/>Create realm via keycloak-config-cli JAR
    KC-->>BE: 201 Realm "{tenant}" created

    BE->>KC: Configure realm settings<br/>- Login theme: enclaive-theme<br/>- Email provider: SMTP/SendGrid<br/>- Token lifespans (access: 5m, refresh: 30m, SSO: 8h)<br/>- Password policy: 8 chars, mixed case, special
    KC-->>BE: Realm settings applied

    BE->>KC: Create client scopes<br/>- openid, profile, email, roles
    KC-->>BE: Scopes created

    BE->>KC: Register OIDC clients<br/>- console-ui (public, PKCE)<br/>- admin-ui (public, PKCE)<br/>- api (confidential, client_credentials)
    KC-->>BE: Clients registered + client secrets returned

    BE->>KC: POST /admin/realms/{tenant}/users<br/>Create admin user (email, temp password,<br/>requiredActions: [UPDATE_PASSWORD])
    KC-->>BE: Admin user created (userId)

    BE->>KC: Assign roles to admin user<br/>- realm-admin (realm-management client)<br/>- view-users, manage-users, manage-clients
    KC-->>BE: Roles assigned

    BE->>DB: UPDATE tenant<br/>steps.keycloak = "completed"<br/>keycloakRealmId, oidcClientIds, adminUserId
    BE-->>UI: SSE push: {step: keycloak, status: completed}
    UI->>Admin: Keycloak step turns green
    end

    %% ──────────────────────────────────────────────
    rect rgb(240, 255, 240)
    Note over Admin,ARGO: PHASE 3 ── Vault Namespace & Secrets
    %% ──────────────────────────────────────────────

    BE->>DB: UPDATE tenant.steps.vault = "in_progress"
    BE-->>UI: SSE push: {step: vault, status: in_progress}

    BE->>VLT: PUT /v1/sys/namespaces/{tenant}<br/>Create isolated namespace
    VLT-->>BE: Namespace created

    BE->>VLT: POST /v1/{tenant}/sys/mounts/identity<br/>Enable identity engine
    VLT-->>BE: Identity engine mounted

    BE->>VLT: POST /v1/{tenant}/sys/mounts/secret<br/>Enable KV-v2 secrets engine
    VLT-->>BE: KV-v2 engine mounted at secret/

    BE->>VLT: Write secrets batch<br/>- secret/db → MongoDB connection string<br/>- secret/oidc → client secrets from Keycloak<br/>- secret/api-keys → Stripe key, Cloudflare token<br/>- secret/encryption → AES-256 key for data at rest
    VLT-->>BE: All secrets written

    BE->>VLT: PUT /v1/{tenant}/sys/policies/acl/tenant-policy<br/>Create access policy (path-based ACL)<br/>- allow read on secret/*<br/>- deny sys/*
    VLT-->>BE: Policy created

    BE->>VLT: POST /v1/{tenant}/auth/kubernetes/role/{tenant}-role<br/>Create K8s auth role<br/>- bound service account: {tenant}-sa<br/>- bound namespace: emcp-{env}<br/>- policies: tenant-policy<br/>- TTL: 1h
    VLT-->>BE: K8s auth role created

    BE->>VLT: Generate Vault Agent sidecar config<br/>- pod annotation template for injection<br/>- auto-auth via K8s service account
    VLT-->>BE: Agent config stored

    BE->>DB: UPDATE tenant<br/>steps.vault = "completed"<br/>vaultNamespace: "{tenant}"
    BE-->>UI: SSE push: {step: vault, status: completed}
    UI->>Admin: Vault step turns green
    end

    %% ──────────────────────────────────────────────
    rect rgb(255, 255, 235)
    Note over Admin,ARGO: PHASE 4 ── Cloudflare DNS Records (via workload-runner)
    %% ──────────────────────────────────────────────

    BE->>DB: UPDATE tenant.steps.dns = "in_progress"
    BE-->>UI: SSE push: {step: dns, status: in_progress}

    BE->>RS: XADD stream:dns-provisioning<br/>{tenantId, subdomain, environments: [prod, staging, dev],<br/>zoneId, recordType: CNAME, target: ingress-lb.enclaive.cloud}

    RS-->>WR: Consumer picks up DNS task

    WR->>CF: POST /zones/{zoneId}/dns_records<br/>{type: CNAME, name: "{tenant}.enclaive.cloud",<br/>content: "ingress-lb.enclaive.cloud", proxied: true}
    CF-->>WR: Record created (prod)

    WR->>CF: POST /zones/{zoneId}/dns_records<br/>{type: CNAME, name: "{tenant}.staging.enclaive.cloud",<br/>content: "ingress-lb.enclaive.cloud", proxied: true}
    CF-->>WR: Record created (staging)

    WR->>CF: POST /zones/{zoneId}/dns_records<br/>{type: CNAME, name: "{tenant}.dev.enclaive.cloud",<br/>content: "ingress-lb.enclaive.cloud", proxied: true}
    CF-->>WR: Record created (dev)

    %% ── RETRY LOGIC ──
    rect rgb(255, 230, 230)
    Note over WR,CF: RETRY LOGIC (Cloudflare is external ── most failure-prone)
    alt CF returns 429 (rate limit) or 5xx (server error)
        WR->>WR: Retry with exponential backoff<br/>Attempt 1: wait 2s<br/>Attempt 2: wait 4s<br/>Attempt 3: wait 8s
        WR->>CF: Retry failed DNS record creation
        CF-->>WR: Success on retry
    else CF returns 409 (record exists)
        WR->>WR: Idempotent ── skip, treat as success<br/>(record already provisioned from prior attempt)
    else All 3 retries exhausted
        WR->>RS: XADD stream:dns-result<br/>{tenantId, status: FAILED,<br/>error: "CF API unreachable after 3 retries",<br/>failedRecords: [...], succeededRecords: [...]}
        RS-->>BE: DNS failure event
        BE->>DB: UPDATE tenant<br/>steps.dns = "failed", lastError: "..."
        BE-->>UI: SSE push: {step: dns, status: failed, error: "..."}
        UI->>Admin: DNS step turns red<br/>Show "Retry" button
        Admin->>UI: Clicks "Retry DNS"
        UI->>BE: POST /api/admin/tenants/{id}/retry?step=dns
        Note over BE,CF: Restarts from Phase 4<br/>Only retries failed records (idempotent)
    end
    end

    WR->>RS: XADD stream:dns-result<br/>{tenantId, status: SUCCESS,<br/>records: [prod, staging, dev]}
    RS-->>BE: DNS success event

    BE->>DB: UPDATE tenant<br/>steps.dns = "completed"<br/>dnsRecords: [{env, recordId, fqdn}]
    BE-->>UI: SSE push: {step: dns, status: completed}
    UI->>Admin: DNS step turns green
    end

    %% ──────────────────────────────────────────────
    rect rgb(245, 240, 255)
    Note over Admin,ARGO: PHASE 5 ── GitHub Action: Ingress Generation
    %% ──────────────────────────────────────────────

    BE->>DB: UPDATE tenant.steps.ingress = "in_progress"
    BE-->>UI: SSE push: {step: ingress, status: in_progress}

    BE->>GH: POST /repos/enclaive/enclaive-deployment/actions/workflows/tenant-ingress.yml/dispatches<br/>{ref: "main", inputs: {<br/>  subdomain: "{tenant}",<br/>  environments: "prod,staging,dev",<br/>  ingressClass: "nginx",<br/>  tlsSecret: "{tenant}-tls",<br/>  certIssuer: "letsencrypt-prod"<br/>}}

    GH->>GH: Checkout enclaive-deployment
    GH->>GH: Create branch: tenant/{tenant}-ingress

    GH->>GH: Modify Helm values.yaml for each env<br/>- 203-emcp-prod/backend/values.yaml<br/>- 202-emcp-staging/backend/values.yaml<br/>- 201-emcp-dev/backend/values.yaml<br/>Add ingress entries:<br/>  hosts:<br/>    - host: {tenant}.enclaive.cloud<br/>      paths: [/]<br/>    - host: admin.{tenant}.enclaive.cloud<br/>      paths: [/]<br/>    - host: console.{tenant}.enclaive.cloud<br/>      paths: [/]<br/>  tls:<br/>    - secretName: {tenant}-tls<br/>      hosts: [*.{tenant}.enclaive.cloud]

    GH->>GH: Commit + push branch
    GH->>GH: Open PR: "feat: add ingress for tenant {tenant}"

    GH->>GH: Auto-approve workflow triggers<br/>- Validate: only ingress fields changed<br/>- Validate: no unexpected file modifications<br/>- If validation passes: approve + merge

    alt Auto-approve validation fails
        GH-->>BE: Webhook: PR flagged for manual review
        BE->>DB: UPDATE tenant<br/>steps.ingress = "needs_review"
        BE-->>UI: SSE push: {step: ingress, status: needs_review,<br/>message: "PR needs manual approval"}
        UI->>Admin: Show warning + link to PR
    else Validation passes
        GH->>GH: Squash merge PR to main
        GH->>GH: Delete branch tenant/{tenant}-ingress
        GH-->>BE: Webhook: PR merged (merge_commit_sha)
        BE->>DB: UPDATE tenant<br/>steps.ingress = "completed"<br/>ingressPR: {number, mergeCommit}
        BE-->>UI: SSE push: {step: ingress, status: completed}
        UI->>Admin: Ingress step turns green
    end
    end

    %% ──────────────────────────────────────────────
    rect rgb(240, 250, 255)
    Note over Admin,ARGO: PHASE 6 ── ArgoCD Deployment & TLS
    %% ──────────────────────────────────────────────

    BE->>DB: UPDATE tenant.steps.deployment = "in_progress"
    BE-->>UI: SSE push: {step: deployment, status: in_progress}

    ARGO->>ARGO: Detect Git change on main branch<br/>(poll interval or webhook trigger)
    ARGO->>ARGO: Diff: new ingress entries detected
    ARGO->>ARGO: Auto-sync: apply Helm chart changes<br/>to prod/staging/dev namespaces

    ARGO->>ARGO: K8s applies new Ingress resources
    Note over ARGO: cert-manager detects new Ingress<br/>with tls annotation → provisions<br/>Let's Encrypt certificates<br/>(ACME HTTP-01 challenge via Cloudflare)

    ARGO-->>BE: Webhook: sync status = Synced<br/>(or BE polls ArgoCD API for sync status)

    %% ── HEALTH CHECK WITH RETRY ──
    rect rgb(230, 255, 230)
    Note over BE,ARGO: HEALTH CHECK POLLING (with retry + backoff)

    BE->>BE: Wait 30s for cert-manager + DNS propagation

    loop Poll every 15s ── max 20 attempts (5 min timeout)
        BE->>BE: GET https://{tenant}.enclaive.cloud/health
        BE->>BE: GET https://admin.{tenant}.enclaive.cloud/health
        BE->>BE: GET https://console.{tenant}.enclaive.cloud/health

        alt All 3 endpoints return 200
            BE->>BE: Health check PASSED
        else Any endpoint not ready
            BE->>BE: Increment attempt counter<br/>Log: "Attempt {n}/20 - waiting for endpoints"
            BE-->>UI: SSE push: {step: deployment,<br/>status: in_progress, attempt: n,<br/>detail: "Waiting for endpoints..."}
        end
    end

    alt Health check passed within timeout
        BE->>DB: UPDATE tenant<br/>steps.deployment = "completed"
        BE-->>UI: SSE push: {step: deployment, status: completed}
        UI->>Admin: Deployment step turns green
    else Timeout after 5 min
        BE->>DB: UPDATE tenant<br/>steps.deployment = "timeout"<br/>lastError: "Health check timeout after 20 attempts"
        BE-->>UI: SSE push: {step: deployment, status: timeout}
        UI->>Admin: Deployment step shows warning<br/>"Infrastructure deployed but endpoints not responding"<br/>Show "Retry Health Check" button
    end
    end
    end

    %% ──────────────────────────────────────────────
    rect rgb(240, 255, 245)
    Note over Admin,ARGO: PHASE 7 ── Finalization & Activation
    %% ──────────────────────────────────────────────

    BE->>DB: Create payment/billing settings<br/>{tenantId, plan, billingCycle: monthly,<br/>stripeCustomerId: null (created on first payment)}
    DB-->>BE: Billing record created

    BE->>DB: Sync default theme settings<br/>{tenantId, logo: default, primaryColor: "#1976d2",<br/>favicon: default, loginBackground: default}
    DB-->>BE: Theme synced

    BE->>DB: Sync default tenant settings<br/>{tenantId, features: [...defaultFeatures],<br/>limits: {maxUsers: planLimit, maxProjects: planLimit},<br/>notifications: {email: true, slack: false}}
    DB-->>BE: Settings synced

    BE->>RS: XADD stream:tenant-events<br/>{type: TENANT_CREATED, tenantId, subdomain,<br/>timestamp, createdBy: adminId}
    Note over RS: finops-service consumes this<br/>for billing initialization +<br/>welcome email via SendGrid

    BE->>DB: UPDATE tenant<br/>status: ACTIVE<br/>activatedAt: now()

    BE->>DB: INSERT audit log<br/>{action: TENANT_ACTIVATED, actor: system,<br/>timestamp, previousState: PROVISIONING,<br/>newState: ACTIVE, duration: totalMs}

    BE-->>UI: SSE push: {status: ACTIVE,<br/>urls: {<br/>  console: "https://console.{tenant}.enclaive.cloud",<br/>  admin: "https://admin.{tenant}.enclaive.cloud",<br/>  api: "https://{tenant}.enclaive.cloud/api"<br/>},<br/>credentials: {<br/>  adminEmail: "...",<br/>  tempPassword: "sent via email"<br/>}}

    UI->>Admin: All steps green!<br/>Show tenant URLs + "Copy credentials" button
    end
```

***

### State Machine: Tenant Provisioning Lifecycle

```mermaid
stateDiagram-v2
    [*] --> VALIDATING: Admin submits form

    VALIDATING --> PROVISIONING: Validation OK ── record created in DB
    VALIDATING --> REJECTED: Subdomain taken / invalid input / not authorized

    state PROVISIONING {
        [*] --> KEYCLOAK_IN_PROGRESS

        KEYCLOAK_IN_PROGRESS --> KEYCLOAK_COMPLETED: Realm + user + roles + clients done
        KEYCLOAK_IN_PROGRESS --> STEP_FAILED: Keycloak API error

        KEYCLOAK_COMPLETED --> VAULT_IN_PROGRESS

        VAULT_IN_PROGRESS --> VAULT_COMPLETED: Namespace + secrets + policies done
        VAULT_IN_PROGRESS --> STEP_FAILED: Vault API error

        VAULT_COMPLETED --> DNS_IN_PROGRESS

        DNS_IN_PROGRESS --> DNS_COMPLETED: All CNAME records created
        DNS_IN_PROGRESS --> DNS_RETRYING: CF 429/5xx ── exponential backoff
        DNS_RETRYING --> DNS_IN_PROGRESS: Retry attempt
        DNS_RETRYING --> STEP_FAILED: Max retries (3) exhausted
        DNS_IN_PROGRESS --> STEP_FAILED: CF error (non-retryable)

        DNS_COMPLETED --> INGRESS_IN_PROGRESS

        INGRESS_IN_PROGRESS --> INGRESS_COMPLETED: PR auto-approved + merged
        INGRESS_IN_PROGRESS --> INGRESS_NEEDS_REVIEW: PR flagged (unexpected changes)
        INGRESS_NEEDS_REVIEW --> INGRESS_COMPLETED: Manual approval + merge
        INGRESS_IN_PROGRESS --> STEP_FAILED: GitHub API error

        INGRESS_COMPLETED --> DEPLOYMENT_IN_PROGRESS

        DEPLOYMENT_IN_PROGRESS --> HEALTH_CHECKING: ArgoCD sync completed
        DEPLOYMENT_IN_PROGRESS --> STEP_FAILED: ArgoCD sync failed

        HEALTH_CHECKING --> DEPLOYMENT_COMPLETED: All endpoints 200 OK
        HEALTH_CHECKING --> DEPLOYMENT_TIMEOUT: 5 min timeout exceeded
        DEPLOYMENT_TIMEOUT --> HEALTH_CHECKING: Admin clicks retry

        DEPLOYMENT_COMPLETED --> FINALIZING

        FINALIZING --> [*]: Billing + theme + settings + events done
    }

    PROVISIONING --> ACTIVE: All steps completed

    STEP_FAILED --> PROVISIONING: Admin clicks "Retry"<br/>(resumes from failed step)
    STEP_FAILED --> CANCELLED: Admin cancels

    REJECTED --> [*]
    CANCELLED --> [*]
    ACTIVE --> [*]
```

***

### Service Responsibility Matrix

```mermaid
flowchart TB
    subgraph UI["admin-ui"]
        direction TB
        U1["Form + client validation"]
        U2["Progress tracker (SSE)"]
        U3["Retry buttons per step"]
        U4["Success: URLs + credentials"]
    end

    subgraph BE["emcp-be ── The Brain"]
        direction TB
        B1["API endpoint + auth guard"]
        B2["Subdomain validation"]
        B3["Keycloak provisioning service"]
        B4["Vault provisioning service"]
        B5["DNS dispatch (→ Redis Streams)"]
        B6["GitHub API client"]
        B7["Health check poller"]
        B8["Status manager (DB + SSE)"]
        B9["Audit logger"]
        B10["Billing/theme/settings init"]
    end

    subgraph WR["workload-runner"]
        direction TB
        W1["Redis Streams consumer"]
        W2["Cloudflare DNS module"]
        W3["Retry engine (exp. backoff)"]
    end

    subgraph EXT["External Systems"]
        direction TB
        E1["Keycloak 26"]
        E2["Vault (vhsm.vshopping.enclaive.cloud)"]
        E3["Cloudflare API"]
        E4["GitHub Actions (enclaive-deployment)"]
        E5["ArgoCD"]
        E6["cert-manager + Let's Encrypt"]
    end

    U1 --> B1
    B3 --> E1
    B4 --> E2
    B5 --> W1 --> W2 --> E3
    B6 --> E4 --> E5 --> E6
    B7 -.->|polls| E5
    B8 --> U2
```

***

### Retry Strategy Summary

| Phase       | Retry Target   | Trigger                  | Strategy                | Max Attempts | Backoff          | On Exhaustion                    |
| ----------- | -------------- | ------------------------ | ----------------------- | :----------: | ---------------- | -------------------------------- |
| **Phase 2** | Keycloak API   | Connection timeout / 5xx | Simple retry            |       2      | 3s fixed         | STEP\_FAILED ── admin notified   |
| **Phase 3** | Vault API      | Connection timeout / 5xx | Simple retry            |       2      | 3s fixed         | STEP\_FAILED ── admin notified   |
| **Phase 4** | Cloudflare API | 429 rate limit / 5xx     | **Exponential backoff** |     **3**    | **2s → 4s → 8s** | STEP\_FAILED ── admin can retry  |
| **Phase 4** | Cloudflare API | 409 conflict (exists)    | Skip (idempotent)       |       -      | -                | Treat as success                 |
| **Phase 5** | GitHub API     | 403 / 5xx                | Simple retry            |       2      | 5s fixed         | STEP\_FAILED ── manual PR needed |
| **Phase 6** | Health check   | Endpoint not ready       | **Polling**             |    **20**    | **15s interval** | TIMEOUT ── admin can retry       |
| **Any**     | Admin retry    | Button click             | Resume from failed step |   Unlimited  | -                | Admin decides                    |

***

### Missing Steps Added (vs Original List)

| Your Step        | What Was Added                                                                                                            |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Keycloak created | + OIDC client registration, admin user creation, role assignment, theme & email config, token lifespans, password policy  |
| Vault created    | + Identity engine, KV-v2 engine, K8s auth role for pod injection, access policies, agent sidecar config                   |
| Domains created  | + Redis Streams dispatch (BE → WR), per-environment DNS (prod/staging/dev), idempotent duplicate handling                 |
| DNS → Git Action | + Branch creation, Helm values.yaml modification per env, auto-approve validation, squash merge                           |
| Deployment       | + ArgoCD sync detection, cert-manager TLS provisioning, structured health check polling                                   |
| Completed        | + Billing setup, theme sync, settings sync, tenant-events stream (→ finops-service), audit log with actor tracking        |
| *(new)*          | + Client-side validation, subdomain uniqueness check, env prefix generation, SSE progress updates, retry from failed step |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.enclaive.cloud/enclaive-multi-cloud-platform/tenant-auto-creation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
