Mimir 6.x Upgrade Guide📜

This document covers the breaking changes and required actions when upgrading Grafana Mimir from 5.x to 6.x within Big Bang. It is organized into two sections: upstream breaking changes (with references to the authoritative Grafana documentation) and Big Bang-specific changes that are unique to this package’s implementation.

Upstream Breaking Changes📜

The upstream Mimir 6.0 release introduces significant architectural changes. This section highlights the areas most relevant to Big Bang operators, but is not a replacement for the full upstream migration guide.

Required reading: Migrate from Helm chart 5.x to 6.0

rollout-operator CRD Schema Changes📜

Mimir 6.0 ships with a new version of the rollout-operator that includes breaking schema changes to two CRDs:

replicatemplates.rollout-operator.grafana.com
zoneawarepoddisruptionbudgets.rollout-operator.grafana.com

Because Kubernetes does not allow in-place updates to CRD schemas when existing data cannot be migrated automatically, these CRDs must be deleted and re-applied before the Helm upgrade completes. Failure to do so will cause the upgrade to fail.

See the upstream migration guide for the full context on why these CRDs changed and what the new schema introduces.

Big Bang handles the CRD delete/re-apply automatically via the upgradeJob pre-upgrade Helm hook — see Automated CRD Upgrade Job for details and the manual fallback procedure.

New Ingest Storage Architecture (Kafka)📜

Mimir 6.0 introduces a native ingest storage path backed by Kafka, replacing the classic gRPC push path between the distributor and the ingester. This is now the upstream-preferred architecture.

Key implications:

The new architecture requires a Kafka broker (or Kafka-compatible endpoint) to be available.
The classic gRPC push path remains supported but must be explicitly re-enabled if you are not adopting Kafka.
The two modes are mutually exclusive — ingest_storage.enabled: true and ingester.push_grpc_method_enabled: true cannot both be active.

See upstream docs: Ingest storage overview

Big Bang defaults to the classic architecture rather than the upstream default — see Classic Architecture Default and Ingest Storage Adoption for details.

NGINX Deprecation📜

The standalone NGINX deployment model, deprecated in Mimir 5.x, is fully removed in 6.0 in favor of the unified gateway.

Big Bang already defaults to the unified gateway (upstream.gateway.enabled: true). Operators who have not customized this are unaffected.
If you have any explicit nginx: configuration in your overrides, you must complete the migration to the unified gateway before upgrading to 6.0.

See upstream migration guide: Migrate to unified proxy deployment

Additional Upstream Changes📜

The following upstream changes are covered in detail in the Grafana migration guide linked above. Operators should review each section for applicability to their deployment:

Zone-aware replication topology changes
store_gateway.sharding_ring stability tuning recommendations for upgrades
Removal of previously deprecated configuration keys

Big Bang-Specific Changes📜

The following changes are unique to the Big Bang Mimir package and are not addressed in upstream documentation.

1. Automated CRD Upgrade Job📜

To eliminate the manual CRD deletion step required by the upstream migration guide, Big Bang includes a pre-upgrade Helm hook job (upgradeJob) that automatically handles the rollout-operator CRD lifecycle during a helm upgrade.

What it does:

Detects whether the upgrade is transitioning from a chart version prior to 6.0.0-bb.0.
Deletes the old replicatemplates and zoneawarepoddisruptionbudgets CRDs.
Applies the updated CRD definitions (bundled in the chart under files/crds/).
Waits up to 60 seconds for both CRDs to reach Established state before allowing the upgrade to proceed.

The job only fires when all three conditions are true: - helm upgrade is being run (not an install) - upgradeJob.enabled: true (the default) - upstream.rollout_operator.enabled: true (the default)

This job is a one-time operation. It only triggers when upgrading from a chart version prior to 6.0.0-bb.0. Subsequent upgrades within the 6.x line will not fire the job.

Upgrade job network policy (when networkPolicies.enabled: true):

When Big Bang network policies are enabled, the upgradeJob hook also creates a NetworkPolicy (api-egress-upgrade-job) scoped to the upgrade job pod, permitting it to reach the Kubernetes API. This policy is automatically deleted after the hook completes regardless of success or failure — it will not appear in steady-state.

By default, Big Bang sets networkPolicies.controlPlaneCidr: 0.0.0.0/0, resulting in a permissive egress rule (all destinations except the AWS metadata IP 169.254.169.254/32). No action is required for standard deployments.

For hardened environments (GovCloud, IL4/IL5), it is strongly recommended to scope this to your cluster’s specific control plane IP before upgrading:

networkPolicies:
  controlPlaneCidr: "172.16.0.1/32"  # replace with your actual control plane IP

Use kubectl get endpoints -n default kubernetes to find the correct value. If networkPolicies.enabled: false, this NetworkPolicy is not rendered and no action is needed.

Disabling the job and upgrading manually:

If you prefer to manage the CRD lifecycle yourself — for example, in an air-gapped environment where the job cannot reach the Kubernetes API — disable the job and follow the upstream manual procedure:

# In your Big Bang values override
mimir:
  values:
    upgradeJob:
      enabled: false

Then perform the CRD deletion and re-application manually before running helm upgrade:

# Delete the old CRDs (safe — no CR data is stored in these CRDs)
kubectl delete crd replicatemplates.rollout-operator.grafana.com --ignore-not-found
kubectl delete crd zoneawarepoddisruptionbudgets.rollout-operator.grafana.com --ignore-not-found

# Apply the updated CRDs from the chart source
kubectl apply -f chart/files/crds/replica-templates.yaml
kubectl apply -f chart/files/crds/zone-aware-pod-disruption-budget.yaml

# Verify both CRDs are established before proceeding
kubectl wait --for=condition=established --timeout=60s \
  crd/replicatemplates.rollout-operator.grafana.com \
  crd/zoneawarepoddisruptionbudgets.rollout-operator.grafana.com

Upstream reference for the manual CRD procedure: Migrate from Helm chart 5.x to 6.0 — CRD upgrade steps

2. Classic Architecture Default and Ingest Storage Adoption📜

Unlike upstream, which enables ingest storage and Kafka by default in 6.0, Big Bang defaults to the classic gRPC push architecture (ingest_storage.enabled: false, ingester.push_grpc_method_enabled: true, kafka.enabled: false). This is an intentional deviation that preserves continuity for existing deployments upgrading from 5.x, giving operators the time to plan and provision a production Kafka backend at their own pace.

When you are ready to adopt ingest storage, the bundled kafka-native image included in the chart is intended for demonstration and testing only — upstream explicitly states it is not suitable for production. For production deployments, use a cloud-managed Kafka service such as Amazon MSK, Confluent Cloud, or Azure Event Hubs, and configure Mimir to connect to it externally. Refer to the upstream ingest storage documentation for production Kafka configuration guidance.

For hardened environments (GovCloud, IL4/IL5), TLS encryption of the Kafka broker connection is required. Configure ingest_storage.kafka.tls in your structuredConfig and ensure your Kafka backend (e.g. Amazon MSK) has TLS/SASL enabled before pointing Mimir at it.

3. rollout-operator Admission Webhook Impact on MinIO Tenant📜

Mimir 6.0 enables the rollout-operator by default. The rollout-operator installs namespace-scoped admission webhooks (no-downscale, prepare-downscale, pod-eviction) that intercept all StatefulSet UPDATE operations within the Mimir release namespace with failurePolicy: Fail.

The problem:

When using the Mimir-package MinIO Tenant (minio-tenant.enabled: true), the MinIO Tenant runs in the mimir namespace alongside Mimir. The rollout-operator’s admission webhooks intercept all StatefulSet UPDATE operations in that namespace, including those issued by the MinIO Operator when reconciling the Tenant. This causes:

MinIO Operator reconciliation failures silently blocked at the webhook
Tenant CR spec changes (scaling, configuration updates) that appear to apply but are never actuated
Initial bucket creation that may fail or time out if MinIO does not fully initialize

Option A — Same-namespace mitigation (Mimir-package MinIO Tenant):

The package includes a minio-tenant.bucketInit Job that mitigates the initial bucket creation race by polling MinIO’s health endpoint and only creating buckets once MinIO is ready. Enable it alongside the Mimir-package MinIO Tenant:

mimir:
  values:
    minio-tenant:
      enabled: true
      bucketInit:
        enabled: true

Important: The bucketInit job only addresses initial bucket creation. Subsequent Tenant spec changes made after initial deployment may still be silently blocked by the rollout-operator webhooks as long as MinIO and Mimir share a namespace. For production deployments requiring ongoing MinIO Tenant management, use Option B below.

Option B — Big Bang MinIO Tenant (separate namespace, recommended for production):

The recommended long-term solution is to use the Big Bang MinIO Tenant — the minio and minioOperator addons — which deploys MinIO in its own dedicated minio namespace. Because the rollout-operator webhooks are scoped to the mimir namespace, the MinIO Operator can reconcile the Tenant freely without interference.

Credential management: The access_key_id and secret_access_key values below must not be stored as plaintext in Git. Manage them via SOPS-encrypted values or an external secrets provider (e.g. Vault, AWS Secrets Manager via External Secrets Operator).

Enable the minioOperator and minio addons in your Big Bang values:

addons:
  minioOperator:
    enabled: true
  minio:
    enabled: true

Disable the Mimir-package MinIO Tenant and point Mimir’s object storage at the Big Bang MinIO service:

mimir:
  values:
    minio-tenant:
      enabled: false
    upstream:
      mimir:
        structuredConfig:
          common:
            storage:
              backend: s3
              s3:
                endpoint: minio.minio.svc.cluster.local:9000
                # Recommended: manage credentials via SOPS or an external secrets provider.
                # Plaintext values are acceptable for development/testing only — do not commit to Git.
                access_key_id: <your-access-key>
                secret_access_key: <your-secret-key>
                insecure: true
          blocks_storage:
            s3:
              bucket_name: mimir-blocks
          ruler_storage:
            s3:
              bucket_name: mimir-ruler
          alertmanager_storage:
            s3:
              bucket_name: mimir-alertmanager

When Istio hardening is enabled, you will also need an Istio ServiceEntry and a NetworkPolicy egress rule to allow Mimir pods to reach the MinIO service across namespaces. See overview.md for guidance on configuring Istio egress for external storage endpoints.

Option C — Disable the rollout-operator:

Important: This option is only safe if zone-aware replication is not enabled for your ingesters (upstream.ingester.zoneAwareReplication.enabled: false) and store-gateways (upstream.store_gateway.zoneAwareReplication.enabled: false), which is the default. If zone-aware replication is active, disabling the rollout-operator removes the no-downscale, prepare-downscale, and pod-eviction safeguards against data loss during scale-down and is not recommended.

If neither Option A nor Option B fits your deployment, and zone-aware replication is not in use, you can disable the rollout-operator entirely. This removes the admission webhooks from the namespace, eliminating the interference with MinIO Operator reconciliation.

mimir:
  values:
    upstream:
      rollout_operator:
        enabled: false

4. rollout-operator Admission Webhooks Block Node Rotation and AMI Updates📜

When zone-aware replication is active, the rollout-operator installs four admission webhook configurations scoped to the Mimir namespace:

Webhook	Kind	Intercepts	failurePolicy
`no-downscale-<namespace>`	`ValidatingWebhookConfiguration`	StatefulSet UPDATE/scale	`Fail`
`prepare-downscale-<namespace>`	`MutatingWebhookConfiguration`	StatefulSet UPDATE/scale	`Fail`
`pod-eviction-<namespace>`	`ValidatingWebhookConfiguration`	Pod eviction CREATE	`Fail`
`zpdb-validation-<namespace>`	`ValidatingWebhookConfiguration`	ZoneAwarePodDisruptionBudget CREATE/UPDATE	`Fail`

These webhooks are designed to enforce safe shutdown ordering - specifically ensuring ingesters flush in-memory series to the WAL and hand off ring ownership before terminating. This is intentional and correct behavior under normal operation.

The problem during node rotation and AMI updates:

When performing a node rotation (AMI update, instance type change, or node group replacement), Kubernetes drains nodes by issuing eviction requests for every pod. The pod-eviction webhook intercepts these requests within the Mimir namespace and evaluates whether evicting a given Mimir pod is safe given the current zone-aware replication state. If the rotation removes all nodes in a zone simultaneously, the rollout-operator will actively deny the evictions to prevent loss of a full zone’s worth of ingester data. With failurePolicy: Fail, a webhook that is unreachable (e.g. the rollout-operator pod itself is on a node being drained) also hard-blocks all evictions within the Mimir namespace until the webhook recovers.

The result is a deadlock: nodes cannot drain, new nodes sit idle, and the cloud provider eventually terminates them after the replacement timeout expires.

Required procedure before any planned node rotation:

The commands below assume Mimir is deployed in the mimir namespace. Substitute your release namespace if different (the webhook names include the namespace, e.g. no-downscale-<namespace>).

# 1. Suspend the Mimir HelmRelease to prevent Flux recreating the webhook
#    configurations mid-rotation
flux suspend helmrelease mimir -n bigbang

# 2. Delete the webhook configurations
#    The rollout-operator only has RBAC to list/patch/watch these resources -
#    it cannot recreate them. They remain deleted until Flux reconciles.
kubectl delete validatingwebhookconfiguration no-downscale-mimir pod-eviction-mimir zpdb-validation-mimir
kubectl delete mutatingwebhookconfiguration prepare-downscale-mimir

Rotate nodes one zone at a time. Draining all zones simultaneously risks losing a quorum of ingester in-memory data. Allow Mimir to fully stabilize (all pods Running and Ready) after each zone before proceeding to the next.

# 3. After all nodes are healthy and all Mimir pods are Running/Ready,
#    resume to restore the webhook configurations - this triggers reconciliation automatically
flux resume helmrelease mimir -n bigbang

Why the webhooks must be restored promptly:

The webhook configurations are the enforcement mechanism for zone-aware replication safety. Without them, Kubernetes can evict or reschedule Mimir pods without the safeguards that prevent data loss. Restore them as soon as the node rotation is complete.