Skip to content

Package Troubleshooting📜

This guide helps you diagnose and resolve issues with Big Bang packages. Package problems can range from deployment failures and configuration issues to networking connectivity and policy violations.

Overview📜

Big Bang packages are deployed using Flux and can encounter various types of issues:

  • Deployment Issues: Pods failing to start, image pull errors, resource constraints
  • Configuration Problems: Invalid Helm values, schema validation failures
  • Network Connectivity: Service mesh issues, network policies, DNS resolution
  • Policy Violations: Kyverno admission controller blocks, security policy denials
  • Resource Issues: Insufficient resources, scaling problems, persistent volume issues

Quick Diagnostics📜

1. Check Package Status📜

Start by examining the overall package health:

# Check Flux HelmRelease status
kubectl get helmreleases -A

# Check specific package status
kubectl get helmrelease <package-name> -n bigbang -o yaml

# Check pod status for the package
kubectl get pods -n <package-namespace>

2. Review Events📜

Events provide immediate insight into recent issues:

# Get events for a specific namespace
kubectl get events -n <package-namespace> --sort-by='.lastTimestamp'

# Get events for a specific pod
kubectl describe pod <pod-name> -n <package-namespace>

# Get cluster-wide events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Flux Troubleshooting📜

1. Check Flux Controllers📜

Verify Flux components are healthy:

# Check Flux system pods
kubectl get pods -n flux-system

# Check Flux controller logs
kubectl logs -n flux-system deployment/helm-controller
kubectl logs -n flux-system deployment/source-controller
kubectl logs -n flux-system deployment/kustomize-controller

2. HelmRelease Debugging📜

Examine HelmRelease status and conditions:

# Get detailed HelmRelease status
kubectl describe helmrelease <package-name> -n bigbang

# Check for reconciliation errors
kubectl get helmrelease <package-name> -n bigbang -o jsonpath='{.status.conditions[*].message}'

# Force reconciliation
flux reconcile helmrelease <package-name> -n bigbang

3. Common Flux Issues📜

Schema Validation Errors:

# Check for schema validation issues in HelmRelease status
kubectl get helmrelease <package-name> -n bigbang -o yaml | grep -A 10 "conditions:"

# Common schema errors indicate:
# - Invalid Helm values
# - Missing required fields
# - Type mismatches in configuration

Source Errors:

# Check GitRepository or HelmRepository status
kubectl get gitrepository -n flux-system
kubectl get helmrepository -n flux-system

# Check source controller logs for repository access issues
kubectl logs -n flux-system deployment/source-controller

Helm Installation Failures:

# Check Helm release status directly
helm list -A
helm status <release-name> -n <namespace>

# Get Helm release history
helm history <release-name> -n <namespace>

Kyverno Policy Troubleshooting📜

1. Check Policy Violations📜

Identify admission policy blocks:

# Check Kyverno admission controller logs
kubectl logs -n kyverno deployment/kyverno-admission-controller

# Get policy violation events
kubectl get events --all-namespaces | grep -i "blocked\|denied\|failed"

# Check specific policy status
kubectl get cpol  # ClusterPolicy
kubectl get pol -A  # Policy

2. Policy Reports📜

Review policy evaluation results:

# Get cluster policy reports
kubectl get cpolr  # ClusterPolicyReport

# Get namespace policy reports
kubectl get polr -A  # PolicyReport

# Detailed policy report for a specific resource
kubectl describe cpolr <report-name>

3. Kyverno Reporter Setup📜

Follow the Overview of Kyverno Reporter to set up detailed reporting and alerting for policy violations.

4. Common Policy Issues📜

Resource Mutation Conflicts: - Check if multiple policies modify the same resource - Review policy precedence and order - Examine mutating vs validating policies

Review Kyverno Exceptions for guidance on handling necessary exceptions.

Network Connectivity Issues📜

For network-related package problems, refer to the networking troubleshooting guide which covers:

  • Service Mesh Issues: Istio configuration, mTLS problems, traffic routing
  • Network Policies: Connectivity blocks, policy misconfigurations
  • DNS Resolution: Service discovery failures, external DNS issues
  • Ingress Problems: Load balancer issues, certificate problems
  • Service Entries: External service access, HTTPS/TLS configuration

Quick Network Checks📜

# Test pod-to-pod connectivity
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Verify Istio sidecar injection
kubectl get pods -n <namespace> -o jsonpath='{.items[*].spec.containers[*].name}'

Resource and Scaling Issues📜

1. Resource Constraints📜

Check for resource-related problems:

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check pod resource usage
kubectl top pods -A
kubectl describe pod <pod-name> -n <namespace>

# Check resource quotas
kubectl get resourcequota -A
kubectl describe resourcequota <quota-name> -n <namespace>

2. Persistent Volume Issues📜

Debug storage problems:

# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage classes
kubectl get storageclass

# Check persistent volumes
kubectl get pv
kubectl describe pv <pv-name>

3. Scaling Problems📜

Address autoscaling issues:

# Check HPA status
kubectl get hpa -A
kubectl describe hpa <hpa-name> -n <namespace>

# Check VPA recommendations
kubectl get vpa -A
kubectl describe vpa <vpa-name> -n <namespace>

# Check deployment replica status
kubectl get deployment -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>

Observability and Monitoring📜

1. Check Monitoring Stack📜

Use Big Bang’s observability tools:

  • Grafana Dashboards: Review package-specific dashboards
  • Prometheus Metrics: Query application and infrastructure metrics
  • Tempo Tracing: Analyze request flows and performance
  • AlertManager: Check for active alerts

2. Application Logs📜

Examine application logs for errors:

# Get pod logs
kubectl logs <pod-name> -n <namespace>

# Get logs from all containers in a pod
kubectl logs <pod-name> -n <namespace> --all-containers

# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>

# Get previous container logs (for crashed pods)
kubectl logs <pod-name> -n <namespace> --previous

3. Custom Metrics📜

Enable application-specific monitoring as described in the monitoring guide:

# Add Prometheus scraping annotations
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Configuration and Immutability Issues📜

1. Configuration Drift📜

Check for Flux drift detection and reconcile with Flux CLI:

# Inspect Flux resources and their conditions
flux get kustomizations -A
flux get helmreleases -A

# Inspect a specific resource for reconciliation status
flux get kustomization <name> -n <namespace> -o yaml
flux get helmrelease <name> -n <namespace> -o yaml
# Use flux diff to compare cluster state vs Git/source (detects drift)
flux diff kustomization <name> -n <namespace>
flux diff helmrelease <name> -n <namespace>
flux diff source gitrepository <repo-name> -n flux-system
# Remediate detected drift by forcing reconciliation from source
flux reconcile kustomization <name> -n <namespace> --with-source
flux reconcile helmrelease <name> -n <namespace>
# Reconcile source if Git/Helm repository changes need to be refreshed
flux reconcile source git <repo-name> -n flux-system

Interpretation and guidance: - If flux diff shows differences, those are drifted resources (cluster != Git/source). - Reconcile to reapply Git-desired state; if the drift is intentional, update the Git source instead of reconciling. - Use consistent Kustomization/HelmRelease intervals and automation to reduce manual drift. - Review Flux resource status (conditions and lastApplied/lastAttempted revisions) to determine why reconciliation failed and whether source updates are required. - Consider adding alerting around failed reconciliations or large diffs to catch drift early.

2. Immutable Field Updates📜

Handle immutable field errors:

# Common immutable fields that cause issues:
# - Pod selectors in Deployments
# - Service ClusterIP
# - PVC storage size (depending on storage class)

# Solution: Delete and recreate the resource
kubectl delete deployment <deployment-name> -n <namespace>
# Flux will recreate based on GitOps

3. Helm Value Validation📜

Validate Helm values before deployment:

# Dry-run Helm install
helm install <release-name> <chart> --dry-run --debug --values values.yaml

# Template and validate manifests
helm template <release-name> <chart> --values values.yaml | kubectl apply --dry-run=client -f -

Advanced Debugging📜

1. Debug Containers📜

Use debug containers for deeper investigation:

# Create debug container
kubectl debug <pod-name> -n <namespace> -it --image=busybox

# Debug with specific tools
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

2. Package-Specific Issues📜

Image Pull Problems:

# Check image pull secrets
kubectl get secrets -n <namespace> | grep docker

# Verify registry access
kubectl describe pod <pod-name> -n <namespace>

Init Container Failures:

# Check init container logs
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

# Check init container status
kubectl describe pod <pod-name> -n <namespace>

3. Rollback Procedures📜

When issues persist, consider rollback:

# Rollback Helm release
helm rollback <release-name> <revision> -n <namespace>

# Rollback via Flux (revert Git commit)
git revert <commit-hash>
git push origin main

Escalation and Support📜

1. Gather Debug Information📜

Before escalating, collect:

# Create debug bundle
kubectl cluster-info dump --output-directory=./debug-info

# Export relevant logs
kubectl logs -n <namespace> --all-containers --prefix=true > package-logs.txt

# Export events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > events.txt

2. Community Resources📜

  • Check Big Bang documentation and troubleshooting guides
  • Search Big Bang GitLab issues for similar problems
  • Engage with the Big Bang community for complex issues
  • Review package-specific documentation and upstream issues

3. Preventive Measures📜

  • Implement comprehensive monitoring and alerting
  • Use staging environments for testing changes
  • Regularly review and update package configurations
  • Maintain backup and restore procedures
  • Document custom configurations and known issues

Remember to always test fixes in a non-production environment first and maintain detailed logs of troubleshooting steps for future reference.