so-l2l: in-place stop/restart via label-scoped cleanup (#743)

- `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default - New `stop/down --delete-namespace` flag for opt-in full teardown - `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops - `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it) - Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes` - Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper - Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328
2026-04-16 12:10:04 +05:30 · 2026-04-16 12:10:04 +05:30 · fc5dc80058
parent f40913d187
commit fc5dc80058
8 changed files with 414 additions and 127 deletions
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@ -26,4 +26,16 @@
 {"type":"create","timestamp":"2026-04-08T05:51:31.557582604Z","issue_id":"so-5cd","payload":{"description":"The DockerDeployer.up() in stack_orchestrator/deploy/compose/deploy_docker.py accepts image_overrides as a parameter but silently drops it — only k8s mode (deploy_k8s.py) actually applies overrides.\n\nImpact: the --image container=image CLI flag on 'laconic-so deployment start' is a no-op for compose-mode deployments. Spec-level image-overrides: keys are also ignored in compose mode (they reach up() via deployment.py but are never applied).\n\nUse case: gorchain-stacks test scripts build :local images via build-containers, but compose files reference ghcr.io/gorbagana-dev/*:latest (so prod pulls work). Without image override support in compose mode, tests either need to docker tag the builds or the compose file needs to be rewritten before start — both ugly workarounds for what should be a first-class mechanism.\n\nFix sketch: in DockerDeployer.up(), when image_overrides is non-empty, write a temporary docker-compose.override.yml with {services: {name: {image: ref}}} and construct a new DockerClient with compose_files + [override_path]. Keeps k8s path untouched, reuses existing --image CLI flag and spec-level image-overrides: plumbing.","priority":"2","title":"Compose deployer ignores image_overrides","type":"bug"}}
 {"type": "create", "timestamp": "2026-04-13T09:54:05.207241Z", "issue_id": "so-c71", "payload": {"title": "extraPortMappings maps all compose ports unconditionally", "type": "bug", "priority": "2", "description": "Commit fb69cc58 added compose service port mapping to Kind extraPortMappings. The intent was to support network_mode: host services (RPC, gossip), but the implementation maps ALL compose ports unconditionally. Internal-only ports (postgres 5432, redis 6379) get exposed on the host, causing conflicts with local services. The port mapping should only apply to services with network_mode: host, or be controlled by a spec-level opt-in.", "source_commit": "fb69cc58"}}
 {"type": "create", "timestamp": "2026-04-14T09:53:31.040118Z", "issue_id": "so-078", "payload": {"title": "Deployments should be self-sufficient: copy hooks into deployment dir", "type": "feature", "priority": "1", "description": "deploy/commands.py hooks are resolved from the stack repo at runtime via get_stack_path. The deployment dir has no copy. This means: (1) the repo must remain at the same path after deploy create, (2) deployment start/restart fail with 'stack does not exist' if cwd differs from deploy create time (stack-source in deployment.yml is relative), (3) deployments cannot be moved or run independently of the source repo. Fix: deploy create should copy deploy/commands.py into the deployment dir alongside compose files and configmaps. call_stack_deploy_start should load from the deployment dir. The deployment becomes self-sufficient."}}
-{"type": "update", "timestamp": "2026-04-14T10:01:14.937483Z", "issue_id": "so-c71", "payload": {"status": "resolved", "resolution": "Fixed in commit e909357a on fix/extraport-host-only branch. Only map ports for services with network_mode: host. Ports 80/443 for Caddy always mapped."}}
+{"type":"comment","timestamp":"2026-04-15T06:12:45.58660796Z","issue_id":"so-c71","payload":{"body":"Fixed in commit e909357a on fix/extraport-host-only branch. Only map ports for services with network_mode: host. Ports 80/443 for Caddy always mapped."}}
 {"type":"close","timestamp":"2026-04-15T06:12:45.832454065Z","issue_id":"so-c71","payload":{}}
 {"type":"comment","timestamp":"2026-04-15T06:18:02.64056792Z","issue_id":"so-b2b","payload":{"body":"Fixed. create_registry_secret() in deployment_create.py:583 reads image-pull-secret from spec, resolves token via token-env/token-file. Spec key renamed from registry-credentials to image-pull-secret (spec.py:140). Documented in docs/deployment_patterns.md with REGISTRY_TOKEN usage example."}}
 {"type":"close","timestamp":"2026-04-15T06:18:02.965856003Z","issue_id":"so-b2b","payload":{}}
 {"type":"comment","timestamp":"2026-04-15T06:18:04.543850719Z","issue_id":"so-k1k","payload":{"body":"Largely resolved. deployment restart (deployment.py:324) now uses 'git rev-parse --show-toplevel' to find repo_root dynamically (lines 364-378), removing the fixed 4-parents-up assumption. External stacks with varying nesting depths now work for restart. deploy create still uses get_stack_path(stack_name) which is a different mechanism but works correctly with --stack-path. Closing — the underlying breakage is gone."}}
 {"type":"close","timestamp":"2026-04-15T06:18:04.856542806Z","issue_id":"so-k1k","payload":{}}
 {"type":"comment","timestamp":"2026-04-15T06:18:08.436540869Z","issue_id":"so-076.2","payload":{"body":"Partially mitigated by commit cc6acd5f which flipped --skip-cluster-management default to true, so 'deployment stop' no longer destroys the cluster by default. Root fix still open: down() in deploy_k8s.py:904-936 unconditionally calls _delete_namespace() (line 929) and destroy_cluster() (line 936) when --perform-cluster-management is passed. No logic distinguishes shared vs dedicated clusters."}}
 {"type":"comment","timestamp":"2026-04-15T06:18:11.374723274Z","issue_id":"so-l2l","payload":{"body":"Partially addressed. Readiness probes are now generated in cluster_info.py:652-671 (part C of the original fix). Parts A and B still open: up() does not use patch/apply (delete/recreate semantics remain), and down() still calls _delete_namespace() unconditionally at deploy_k8s.py:929 on every restart. A 'fix: never delete namespace on deployment down' commit (ae2cea34) exists on a remote branch but is not merged to main."}}
 {"type":"create","timestamp":"2026-04-15T11:11:15.584733236Z","issue_id":"so-328","payload":{"description":"deployment restart runs create_operation(update=True) which uses copytree(dirs_exist_ok=True) to sync the stack repo into the deployment dir (deployment_create.py:1079, 1130). This is additive only — it overwrites and adds files, but never removes them. Two resulting gaps:\n\n1. Deletions don't propagate. If a script, configmap file, or compose service is removed from the stack repo, the deployment dir keeps it, and up_operation keeps applying it. The k8s ConfigMap retains removed keys; removed Deployments/Services are not cleaned up (up() is create/patch, not full reconciliation). Operators see stale files and orphan workloads that won't disappear without manual kubectl intervention or a full teardown.\n\n2. stack.yml structural changes don't auto-surface in the spec. If a stack.yml gains a new configmap declaration or a new compose file reference, restart doesn't pull it in unless the operator's spec.yml already references it. The spec is the contract, so additions to the stack aren't applied to live deployments just by pulling the repo.\n\nTeardown + redeploy is the only reliable way to clean up today, which is not practical in production.\n\nFix direction: create_operation(update=True) should treat the deployment dir as reconciled state — diff the desired tree (from the stack repo + spec) against what's on disk and remove files that no longer exist upstream. up_operation should then delete k8s resources (Deployments, Services, ConfigMaps) that are no longer declared by any compose/configmap source, likely scoped by an 'app.kubernetes.io/managed-by: laconic-so' label to avoid nuking unrelated resources. For new stack.yml entries, consider whether the spec needs operator action or whether restart can auto-detect and warn.","priority":"3","title":"deployment restart does not propagate repo deletions or new stack.yml entries","type":"bug"}}
 {"type":"comment","timestamp":"2026-04-16T06:24:38.826132538Z","issue_id":"so-l2l","payload":{"body":"Fixed in so-l2l Parts A and B on this branch:\n\n**Part A (up() as create-or-update):** Deployments, Services, ConfigMaps, Secrets, Ingresses, and Endpoints already used create-or-replace in up(). Completed coverage by adding 409 skip-if-exists for Jobs (one-shot, re-run unwanted). Readiness probes (Part C) were already present.\n\n**Part B (down() preserves namespace):** _delete_labeled_resources now deletes by 'app.kubernetes.io/stack' label and keeps the namespace Active. Full-teardown option is a new --delete-namespace flag on stop/down. down() is synchronous (waits for resources to actually be gone before returning) so tests and ansible can assume clean state on return. Orphan PVs from prior --delete-namespace runs are also cleaned on subsequent stop --delete-volumes.\n\nrestart no longer calls down() at all (deployment.py:421-468), so the original wd-d92-style namespace termination race is structurally impossible. In-cluster rolling updates work via k8s native semantics when Deployment pod specs change via replace."}}
 {"type":"close","timestamp":"2026-04-16T06:24:39.175431401Z","issue_id":"so-l2l","payload":{}}
 {"type":"comment","timestamp":"2026-04-16T06:24:41.70556861Z","issue_id":"so-076.2","payload":{"body":"Fixed on chore/pebble-status-audit. stop now uses label-scoped cleanup (app.kubernetes.io/stack=\u003cstack\u003e) and keeps the namespace Active by default. The Kind cluster is not destroyed unless --perform-cluster-management is passed. Full namespace teardown is opt-in via the new --delete-namespace flag. Multiple stacks sharing a namespace/cluster are now cleaned up independently, not blown away en masse."}}
 {"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
--- a/stack_orchestrator/deploy/compose/deploy_docker.py
+++ b/stack_orchestrator/deploy/compose/deploy_docker.py
@ -55,7 +55,8 @@ class DockerDeployer(Deployer):
            except DockerException as e:
                raise DeployerException(e)
-    def down(self, timeout, volumes, skip_cluster_management):
+    def down(self, timeout, volumes, skip_cluster_management, delete_namespace=False):
        # delete_namespace is k8s-only; ignored in compose mode.
        if not opts.o.dry_run:
            try:
                return self.docker.compose.down(timeout=timeout, volumes=volumes)
--- a/stack_orchestrator/deploy/deploy.py
+++ b/stack_orchestrator/deploy/deploy.py
@ -172,7 +172,13 @@ def up_operation(
    )
-def down_operation(ctx, delete_volumes, extra_args_list, skip_cluster_management=False):
+def down_operation(
    ctx,
    delete_volumes,
    extra_args_list,
    skip_cluster_management=False,
    delete_namespace=False,
 ):
    timeout_arg = None
    if extra_args_list:
        timeout_arg = extra_args_list[0]
@ -182,6 +188,7 @@ def down_operation(ctx, delete_volumes, extra_args_list, skip_cluster_management
        timeout=timeout_arg,
        volumes=delete_volumes,
        skip_cluster_management=skip_cluster_management,
        delete_namespace=delete_namespace,
    )
--- a/stack_orchestrator/deploy/deployer.py
+++ b/stack_orchestrator/deploy/deployer.py
@ -24,7 +24,7 @@ class Deployer(ABC):
        pass
    @abstractmethod
-    def down(self, timeout, volumes, skip_cluster_management):
+    def down(self, timeout, volumes, skip_cluster_management, delete_namespace=False):
        pass
    @abstractmethod
--- a/stack_orchestrator/deploy/deployment.py
+++ b/stack_orchestrator/deploy/deployment.py
@ -157,13 +157,21 @@ def prepare(ctx, skip_cluster_management):
    default=True,
    help="Skip cluster initialization/tear-down (only for kind-k8s deployments)",
 )
@click.option(
    "--delete-namespace",
    is_flag=True,
    default=False,
    help="Also delete the k8s namespace (full teardown)",
 )
@click.argument("extra_args", nargs=-1)  # help: command: down <service1> <service2>
@click.pass_context
-def down(ctx, delete_volumes, skip_cluster_management, extra_args):
+def down(ctx, delete_volumes, skip_cluster_management, delete_namespace, extra_args):
    # Get the stack config file name
    # TODO: add cluster name and env file here
    ctx.obj = make_deploy_context(ctx)
-    down_operation(ctx, delete_volumes, extra_args, skip_cluster_management)
+    down_operation(
        ctx, delete_volumes, extra_args, skip_cluster_management, delete_namespace
    )
 # stop is the preferred alias for down
@ -176,12 +184,20 @@ def down(ctx, delete_volumes, skip_cluster_management, extra_args):
    default=True,
    help="Skip cluster initialization/tear-down (only for kind-k8s deployments)",
 )
@click.option(
    "--delete-namespace",
    is_flag=True,
    default=False,
    help="Also delete the k8s namespace (full teardown)",
 )
@click.argument("extra_args", nargs=-1)  # help: command: down <service1> <service2>
@click.pass_context
-def stop(ctx, delete_volumes, skip_cluster_management, extra_args):
+def stop(ctx, delete_volumes, skip_cluster_management, delete_namespace, extra_args):
    # TODO: add cluster name and env file here
    ctx.obj = make_deploy_context(ctx)
-    down_operation(ctx, delete_volumes, extra_args, skip_cluster_management)
+    down_operation(
        ctx, delete_volumes, extra_args, skip_cluster_management, delete_namespace
    )
@command.command()
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -118,6 +118,17 @@ class ClusterInfo:
        volumes.extend(named_volumes_from_pod_files(self.parsed_job_yaml_map))
        return volumes
    def _stack_labels(self, extra: Optional[dict] = None) -> dict:
        """Standard resource labels. Use on every k8s resource SO creates so
        label-based cleanup (down by stack) can find them all.
        """
        labels = {"app": self.app_name}
        if self.stack_name:
            labels["app.kubernetes.io/stack"] = self.stack_name
        if extra:
            labels.update(extra)
        return labels
    def get_nodeports(self):
        nodeports = []
        for pod_name in self.parsed_pod_yaml_map:
@ -151,7 +162,7 @@ class ClusterInfo:
                                    f"{self.app_name}-nodeport-"
                                    f"{pod_port}-{protocol.lower()}"
                                ),
-                                labels={"app": self.app_name},
+                                labels=self._stack_labels(),
                            ),
                            spec=client.V1ServiceSpec(
                                type="NodePort",
@ -268,7 +279,7 @@ class ClusterInfo:
            ingress = client.V1Ingress(
                metadata=client.V1ObjectMeta(
                    name=f"{self.app_name}-ingress",
-                    labels={"app": self.app_name},
+                    labels=self._stack_labels(),
                    annotations=ingress_annotations,
                ),
                spec=spec,
@ -323,7 +334,7 @@ class ClusterInfo:
        service = client.V1Service(
            metadata=client.V1ObjectMeta(
                name=f"{self.app_name}-service",
-                labels={"app": self.app_name},
+                labels=self._stack_labels(),
            ),
            spec=client.V1ServiceSpec(
                type="ClusterIP",
@ -355,10 +366,9 @@ class ClusterInfo:
                self.spec.get_volume_resources_for(volume_name) or global_resources
            )
-            labels = {
+            labels = self._stack_labels(
-                "app": self.app_name,
+                {"volume-label": f"{self.app_name}-{volume_name}"}
-                "volume-label": f"{self.app_name}-{volume_name}",
+            )
            }
            if volume_path:
                storage_class_name = "manual"
                k8s_volume_name = f"{self.app_name}-{volume_name}"
@ -418,7 +428,7 @@ class ClusterInfo:
            spec = client.V1ConfigMap(
                metadata=client.V1ObjectMeta(
                    name=f"{self.app_name}-{cfg_map_name}",
-                    labels={"app": self.app_name, "configmap-label": cfg_map_name},
+                    labels=self._stack_labels({"configmap-label": cfg_map_name}),
                ),
                binary_data=data,
            )
@ -482,10 +492,9 @@ class ClusterInfo:
            pv = client.V1PersistentVolume(
                metadata=client.V1ObjectMeta(
                    name=f"{self.app_name}-{volume_name}",
-                    labels={
+                    labels=self._stack_labels(
-                        "app": self.app_name,
+                        {"volume-label": f"{self.app_name}-{volume_name}"}
-                        "volume-label": f"{self.app_name}-{volume_name}",
+                    ),
                    },
                ),
                spec=spec,
            )
@ -737,9 +746,7 @@ class ClusterInfo:
        Returns (annotations, labels, affinity, tolerations).
        """
        annotations = None
-        labels = {"app": self.app_name}
+        labels = self._stack_labels()
        if self.stack_name:
            labels["app.kubernetes.io/stack"] = self.stack_name
        affinity = None
        tolerations = None
@ -920,21 +927,11 @@ class ClusterInfo:
                kind="Deployment",
                metadata=client.V1ObjectMeta(
                    name=deployment_name,
-                    labels={
+                    labels=self._stack_labels(
-                        "app": self.app_name,
+                        {"app.kubernetes.io/component": pod_name}
-                        **(
+                        if multi_pod
-                            {
+                        else None
-                                "app.kubernetes.io/stack": self.stack_name,
+                    ),
                            }
                            if self.stack_name
                            else {}
                        ),
                        **(
                            {"app.kubernetes.io/component": pod_name}
                            if multi_pod
                            else {}
                        ),
                    },
                ),
                spec=spec,
            )
@ -1001,7 +998,7 @@ class ClusterInfo:
            service = client.V1Service(
                metadata=client.V1ObjectMeta(
                    name=f"{self.app_name}-{pod_name}-service",
-                    labels={"app": self.app_name},
+                    labels=self._stack_labels(),
                ),
                spec=client.V1ServiceSpec(
                    type="ClusterIP",
@ -1054,14 +1051,9 @@ class ClusterInfo:
            # Use a distinct app label for job pods so they don't get
            # picked up by pods_in_deployment() which queries app={app_name}.
-            pod_labels = {
+            # Use a distinct app label for job pods (see comment above) so we
-                "app": f"{self.app_name}-job",
+            # still build via _stack_labels then override.
-                **(
+            pod_labels = self._stack_labels({"app": f"{self.app_name}-job"})
                    {"app.kubernetes.io/stack": self.stack_name}
                    if self.stack_name
                    else {}
                ),
            }
            template = client.V1PodTemplateSpec(
                metadata=client.V1ObjectMeta(labels=pod_labels),
                spec=client.V1PodSpec(
@ -1076,14 +1068,7 @@ class ClusterInfo:
                template=template,
                backoff_limit=0,
            )
-            job_labels = {
+            job_labels = self._stack_labels()
                "app": self.app_name,
                **(
                    {"app.kubernetes.io/stack": self.stack_name}
                    if self.stack_name
                    else {}
                ),
            }
            job = client.V1Job(
                api_version="batch/v1",
                kind="Job",
@ -1121,7 +1106,7 @@ class ClusterInfo:
                svc = client.V1Service(
                    metadata=client.V1ObjectMeta(
                        name=name,
-                        labels={"app": self.app_name},
+                        labels=self._stack_labels(),
                    ),
                    spec=client.V1ServiceSpec(
                        type="ExternalName",
@ -1138,7 +1123,7 @@ class ClusterInfo:
                svc = client.V1Service(
                    metadata=client.V1ObjectMeta(
                        name=name,
-                        labels={"app": self.app_name},
+                        labels=self._stack_labels(),
                    ),
                    spec=client.V1ServiceSpec(
                        cluster_ip="None",
@ -1156,7 +1141,7 @@ class ClusterInfo:
                svc = client.V1Service(
                    metadata=client.V1ObjectMeta(
                        name=name,
-                        labels={"app": self.app_name},
+                        labels=self._stack_labels(),
                    ),
                    spec=client.V1ServiceSpec(
                        cluster_ip="None",
@ -1199,7 +1184,7 @@ class ClusterInfo:
        secret = client.V1Secret(
            metadata=client.V1ObjectMeta(
                name=secret_name,
-                labels={"app": self.app_name},
+                labels=self._stack_labels(),
            ),
            data=secret_data,
        )
--- a/stack_orchestrator/deploy/k8s/deploy_k8s.py
+++ b/stack_orchestrator/deploy/k8s/deploy_k8s.py
@ -189,7 +189,7 @@ class K8sDeployer(Deployer):
                ns = client.V1Namespace(
                    metadata=client.V1ObjectMeta(
                        name=self.k8s_namespace,
-                        labels={"app": self.cluster_info.app_name},
+                        labels=self.cluster_info._stack_labels(),
                    )
                )
                self.core_api.create_namespace(body=ns)
@ -475,7 +475,7 @@ class K8sDeployer(Deployer):
            endpoints = client.V1Endpoints(
                metadata=client.V1ObjectMeta(
                    name=name,
-                    labels={"app": self.cluster_info.app_name},
+                    labels=self.cluster_info._stack_labels(),
                ),
                subsets=[
                    client.V1EndpointSubset(
@ -535,7 +535,7 @@ class K8sDeployer(Deployer):
            endpoints = client.V1Endpoints(
                metadata=client.V1ObjectMeta(
                    name=name,
-                    labels={"app": self.cluster_info.app_name},
+                    labels=self.cluster_info._stack_labels(),
                ),
                subsets=[
                    client.V1EndpointSubset(
@ -709,16 +709,27 @@ class K8sDeployer(Deployer):
            if opts.o.debug:
                print(f"Sending this job: {job}")
            if not opts.o.dry_run:
-                job_resp = self.batch_api.create_namespaced_job(
+                job_name = job.metadata.name
-                    body=job, namespace=self.k8s_namespace
+                try:
-                )
+                    job_resp = self.batch_api.create_namespaced_job(
-                if opts.o.debug:
+                        body=job, namespace=self.k8s_namespace
-                    print("Job created:")
+                    )
-                    if job_resp.metadata:
+                    if opts.o.debug:
-                        print(
+                        print("Job created:")
-                            f"  {job_resp.metadata.namespace} "
+                        if job_resp.metadata:
-                            f"{job_resp.metadata.name}"
+                            print(
-                        )
+                                f"  {job_resp.metadata.namespace} "
                                f"{job_resp.metadata.name}"
                            )
                except ApiException as e:
                    if e.status == 409:
                        # Job already exists from a prior run. Jobs are one-
                        # shot — don't recreate on restart. Delete the Job
                        # explicitly to re-run (stop --delete-volumes also
                        # clears them via label-based cleanup).
                        print(f"Job {job_name} already exists, skipping")
                    else:
                        raise
    def _find_certificate_for_host_name(self, host_name):
        all_certificates = self.custom_obj_api.list_namespaced_custom_object(
@ -901,40 +912,261 @@ class K8sDeployer(Deployer):
            call_stack_deploy_start(self.deployment_context)
-    def down(self, timeout, volumes, skip_cluster_management):
+    def down(
        self, timeout, volumes, skip_cluster_management, delete_namespace=False
    ):
        """Tear down stack-labeled resources. Phases:
        1. Delete namespaced resources (if namespace still exists).
        2. Delete cluster-scoped PVs (if --delete-volumes, regardless of (1)).
        3. Wait for everything we triggered to actually be gone.
        4. Optionally delete the namespace itself (--delete-namespace).
        5. Optionally destroy the kind cluster (--perform-cluster-management).
        Steps 1-3 scope cleanup to a single stack via app.kubernetes.io/stack,
        so multiple stacks sharing a namespace tear down independently.
        """
        self.skip_cluster_management = skip_cluster_management
        self.connect_api()
-        app_label = f"app={self.cluster_info.app_name}"
+        selector = self._stack_label_selector()
        ns = self.k8s_namespace
        ns_exists = self._namespace_exists(ns)
-        # PersistentVolumes are cluster-scoped (not namespaced), so delete by label
+        if ns_exists:
            self._delete_namespaced_labeled_resources(ns, selector, volumes)
        if volumes:
-            try:
+            self._delete_labeled_pvs(selector)
-                pvs = self.core_api.list_persistent_volume(label_selector=app_label)
+        self._wait_for_labeled_gone(
-                for pv in pvs.items:
+            ns, selector, delete_volumes=volumes, namespace_present=ns_exists
-                    if opts.o.debug:
+        )
                        print(f"Deleting PV: {pv.metadata.name}")
                    try:
                        self.core_api.delete_persistent_volume(name=pv.metadata.name)
                    except ApiException as e:
                        _check_delete_exception(e)
            except ApiException as e:
                if opts.o.debug:
                    print(f"Error listing PVs: {e}")
-        # Delete the namespace to ensure clean slate.
+        if delete_namespace and ns_exists:
-        # Resources created by older laconic-so versions lack labels, so
+            self._delete_namespace()
-        # label-based deletion can't find them. Namespace deletion is the
+            self._wait_for_namespace_gone()
        # only reliable cleanup.
        self._delete_namespace()
        # Wait for namespace to finish terminating before returning,
        # so that up() can recreate it immediately.
        self._wait_for_namespace_gone()
        if self.is_kind() and not self.skip_cluster_management:
            # Destroy the kind cluster
            destroy_cluster(self.kind_cluster_name)
    def _stack_label_selector(self) -> str:
        """Selector used for stack-scoped cleanup.
        Prefer app.kubernetes.io/stack (per-stack) and fall back to the
        legacy app= label (cluster-id scoped) for deployments that predate
        the stack label.
        """
        stack_name = self.cluster_info.stack_name
        if stack_name:
            return f"app.kubernetes.io/stack={stack_name}"
        return f"app={self.cluster_info.app_name}"
    def _namespace_exists(self, namespace: str) -> bool:
        try:
            self.core_api.read_namespace(name=namespace)
            return True
        except ApiException as e:
            if e.status == 404:
                if opts.o.debug:
                    print(f"Namespace {namespace} not found")
                return False
            raise
    def _delete_namespaced_labeled_resources(
        self, namespace: str, selector: str, delete_volumes: bool
    ):
        """Delete Ingresses, Deployments, Jobs, Services, ConfigMaps,
        Secrets, Endpoints, Pods, and (if delete_volumes) PVCs in the
        namespace. Order matters: Ingresses first so external traffic
        stops, then workloads, then support objects, then Pods, then PVCs.
        """
        if opts.o.dry_run:
            print(
                f"Dry run: would delete namespaced resources in {namespace} "
                f"matching {selector}"
            )
            return
        def swallow_404(fn):
            try:
                fn()
            except ApiException as e:
                if e.status not in (404, 405):
                    raise
        # Ingresses first so external traffic stops before pods disappear.
        swallow_404(
            lambda: self.networking_api.delete_collection_namespaced_ingress(
                namespace=namespace, label_selector=selector
            )
        )
        # Deployments (owns ReplicaSets + Pods via GC).
        swallow_404(
            lambda: self.apps_api.delete_collection_namespaced_deployment(
                namespace=namespace, label_selector=selector
            )
        )
        # Jobs — propagation=Background cascades to child pods.
        swallow_404(
            lambda: self.batch_api.delete_collection_namespaced_job(
                namespace=namespace,
                label_selector=selector,
                propagation_policy="Background",
            )
        )
        # Services have no delete_collection on core_api; list + delete.
        self._list_delete_namespaced(
            namespace,
            selector,
            list_fn=self.core_api.list_namespaced_service,
            delete_fn=self.core_api.delete_namespaced_service,
        )
        # ConfigMaps, Secrets.
        swallow_404(
            lambda: self.core_api.delete_collection_namespaced_config_map(
                namespace=namespace, label_selector=selector
            )
        )
        swallow_404(
            lambda: self.core_api.delete_collection_namespaced_secret(
                namespace=namespace, label_selector=selector
            )
        )
        # Endpoints usually GC with Services, but we create a few directly
        # (external-services) that aren't owned by a Service — clean those.
        self._list_delete_namespaced(
            namespace,
            selector,
            list_fn=self.core_api.list_namespaced_endpoints,
            delete_fn=self.core_api.delete_namespaced_endpoints,
        )
        # Stray pods (owned pods are GC'd with their Deployment/Job).
        swallow_404(
            lambda: self.core_api.delete_collection_namespaced_pod(
                namespace=namespace, label_selector=selector
            )
        )
        if delete_volumes:
            swallow_404(
                lambda: self.core_api.delete_collection_namespaced_persistent_volume_claim(  # noqa: E501
                    namespace=namespace, label_selector=selector
                )
            )
    def _list_delete_namespaced(self, namespace, selector, list_fn, delete_fn):
        """List by selector and delete each item. Use for resources where
        the k8s python client lacks delete_collection (Services, Endpoints).
        """
        try:
            items = list_fn(namespace=namespace, label_selector=selector).items
        except ApiException as e:
            if e.status == 404:
                return
            raise
        for item in items:
            try:
                delete_fn(name=item.metadata.name, namespace=namespace)
            except ApiException as e:
                if e.status not in (404, 405):
                    raise
    def _delete_labeled_pvs(self, selector: str):
        """Delete cluster-scoped PVs matching the stack label."""
        if opts.o.dry_run:
            print(f"Dry run: would delete PVs matching {selector}")
            return
        try:
            pvs = self.core_api.list_persistent_volume(label_selector=selector)
        except ApiException as e:
            if opts.o.debug:
                print(f"Error listing PVs: {e}")
            return
        for pv in pvs.items:
            if opts.o.debug:
                print(f"Deleting PV: {pv.metadata.name}")
            try:
                self.core_api.delete_persistent_volume(name=pv.metadata.name)
            except ApiException as e:
                _check_delete_exception(e)
    def _wait_for_labeled_gone(
        self,
        namespace: str,
        selector: str,
        delete_volumes: bool,
        namespace_present: bool,
        timeout_seconds: int = 120,
    ):
        """Poll until every kind we triggered a delete for is gone.
        delete_collection/delete are async — finalizers (PV bound-by-PVC,
        PVC bound-by-VolumeAttachment, pod graceful shutdown) propagate
        after the API call returns. Blocking here makes down() a
        synchronous contract for callers (tests, ansible, cryovial).
        """
        import time
        listers = []
        if namespace_present:
            listers += [
                ("deployment", lambda: self.apps_api.list_namespaced_deployment(
                    namespace=namespace, label_selector=selector)),
                ("ingress", lambda: self.networking_api.list_namespaced_ingress(
                    namespace=namespace, label_selector=selector)),
                ("job", lambda: self.batch_api.list_namespaced_job(
                    namespace=namespace, label_selector=selector)),
                ("service", lambda: self.core_api.list_namespaced_service(
                    namespace=namespace, label_selector=selector)),
                ("configmap", lambda: self.core_api.list_namespaced_config_map(
                    namespace=namespace, label_selector=selector)),
                ("secret", lambda: self.core_api.list_namespaced_secret(
                    namespace=namespace, label_selector=selector)),
                ("pod", lambda: self.core_api.list_namespaced_pod(
                    namespace=namespace, label_selector=selector)),
            ]
            if delete_volumes:
                listers.append(
                    ("persistentvolumeclaim",
                     lambda: self.core_api.list_namespaced_persistent_volume_claim(
                         namespace=namespace, label_selector=selector))
                )
        # PVs are cluster-scoped — wait for them even when the namespace
        # is already gone (orphaned from a prior --delete-namespace).
        if delete_volumes:
            listers.append(
                ("persistentvolume",
                 lambda: self.core_api.list_persistent_volume(
                     label_selector=selector))
            )
        def remaining():
            out = []
            for kind, lister in listers:
                try:
                    items = lister().items
                except ApiException as e:
                    if e.status == 404:
                        continue
                    raise
                if items:
                    out.append((kind, len(items)))
            return out
        deadline = time.monotonic() + timeout_seconds
        while time.monotonic() < deadline:
            left = remaining()
            if not left:
                return
            if opts.o.debug:
                print(f"Waiting for deletions: {left}")
            time.sleep(2)
        left = remaining()
        if left:
            print(
                f"Warning: resources still present after {timeout_seconds}s: "
                f"{left}"
            )
    def status(self):
        self.connect_api()
        # Call whatever API we need to get the running container list
--- a/tests/k8s-deploy/run-deploy-test.sh
+++ b/tests/k8s-deploy/run-deploy-test.sh
@ -23,7 +23,7 @@ wait_for_pods_started () {
    done
    # Timed out, error exit
    echo "waiting for pods to start: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 }
 wait_for_log_output () {
@ -42,15 +42,42 @@ wait_for_log_output () {
    done
    # Timed out, error exit
    echo "waiting for pods log content: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 }
-delete_cluster_exit () {
+cleanup_and_exit () {
-    $TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes
+    # Full teardown so CI runners don't leak namespaces/PVs between runs.
    $TEST_TARGET_SO deployment --dir $test_deployment_dir \
        stop --delete-volumes --delete-namespace --skip-cluster-management || true
    exit 1
 }
 assert_ns_phase () {
    local expected=$1
    local phase
    phase=$(kubectl get namespace ${deployment_ns} -o jsonpath='{.status.phase}' 2>/dev/null || echo "Missing")
    if [ "$phase" != "$expected" ]; then
        echo "namespace phase test: FAILED (expected ${expected}, got ${phase})"
        cleanup_and_exit
    fi
 }
 # Count labeled resources in the deployment namespace. down() is
 # synchronous on its own cleanup (waits for PVCs/pods to terminate
 # before returning) so callers can assert immediately.
 # Usage: assert_no_labeled_resources <kind>
 assert_no_labeled_resources () {
    local kind=$1
    local count
    count=$(kubectl get ${kind} -n ${deployment_ns} \
        -l app.kubernetes.io/stack=test --no-headers 2>/dev/null | wc -l)
    if [ "$count" -ne 0 ]; then
        echo "labeled cleanup test: FAILED (${kind} still present: ${count})"
        cleanup_and_exit
    fi
 }
 # Note: eventually this test should be folded into ../deploy/
 # but keeping it separate for now for convenience
 TEST_TARGET_SO=$( ls -t1 ./package/laconic-so* | head -1 )
@ -130,7 +157,7 @@ if [[ "$log_output_3" == *"filesystem is fresh"* ]]; then
 else
    echo "deployment logs test: FAILED"
    echo "$log_output_3"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check the config variable CERC_TEST_PARAM_1 was passed correctly
@ -138,7 +165,7 @@ if [[ "$log_output_3" == *"Test-param-1: PASSED"* ]]; then
    echo "deployment config test: passed"
 else
    echo "deployment config test: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check the config variable CERC_TEST_PARAM_2 was passed correctly from the compose file
@ -155,7 +182,7 @@ if [[ "$log_output_4" == *"/config/test_config:"* ]] && [[ "$log_output_4" == *"
    echo "deployment ConfigMap test: passed"
 else
    echo "deployment ConfigMap test: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check that the bind-mount volume is mounted.
@ -165,7 +192,7 @@ if [[ "$log_output_5" == *"/data: MOUNTED"* ]]; then
 else
    echo "deployment bind volumes test: FAILED"
    echo "$log_output_5"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check that the provisioner managed volume is mounted.
@ -175,7 +202,7 @@ if [[ "$log_output_6" == *"/data2: MOUNTED"* ]]; then
 else
    echo "deployment provisioner volumes test: FAILED"
    echo "$log_output_6"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # --- New feature tests: namespace, labels, jobs, secrets ---
@ -187,7 +214,7 @@ if [ "$ns_pod_count" -gt 0 ]; then
 else
    echo "namespace isolation test: FAILED"
    echo "Expected pod in namespace ${deployment_ns}"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check that the stack label is set on the pod
@ -196,7 +223,7 @@ if [ "$stack_label_count" -gt 0 ]; then
    echo "stack label test: passed"
 else
    echo "stack label test: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check that the job completed successfully
@ -212,7 +239,7 @@ if [ "$job_status" == "1" ]; then
 else
    echo "job completion test: FAILED"
    echo "Job status.succeeded: ${job_status}"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Check that the secrets spec results in an envFrom secretRef on the pod
@ -223,25 +250,24 @@ if [ "$secret_ref" == "test-secret" ]; then
 else
    echo "secrets envFrom test: FAILED"
    echo "Expected secretRef 'test-secret', got: ${secret_ref}"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
-# Stop then start again and check the volume was preserved.
+# Stop with --delete-volumes (but not --delete-namespace) and verify:
-# Use --skip-cluster-management to reuse the existing kind cluster instead of
+#   - namespace stays Active (no termination race on restart)
-# destroying and recreating it (which fails on CI runners due to stale etcd/certs
+#   - stack-labeled workloads are gone
-# and cgroup detection issues).
+#   - bind-mount data on the host survives; provisioner volumes are recreated
 # Use --delete-volumes to clear PVs so fresh PVCs can bind on restart.
 # Bind-mount data survives on the host filesystem; provisioner volumes are recreated fresh.
 $TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes --skip-cluster-management
-# Wait for the namespace to be fully terminated before restarting.
+
-# Without this, 'start' fails with 403 Forbidden because the namespace
+assert_ns_phase "Active"
-# is still in Terminating state.
+echo "stop preserves namespace test: passed"
-for i in {1..60}; do
+
-    if ! kubectl get namespace ${deployment_ns} 2>/dev/null | grep -q .; then
+for kind in deployment job ingress service configmap secret pvc pod; do
-        break
+    assert_no_labeled_resources "$kind"
    fi
    sleep 2
 done
 echo "stop cleans labeled resources test: passed"
 # Restart — no wait needed, the namespace is still Active.
 $TEST_TARGET_SO deployment --dir $test_deployment_dir start --skip-cluster-management
 wait_for_pods_started
 wait_for_log_output
@ -252,7 +278,7 @@ if [[ "$log_output_10" == *"/data filesystem is old"* ]]; then
    echo "Retain bind volumes test: passed"
 else
    echo "Retain bind volumes test: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
 # Provisioner volumes are destroyed when PVs are deleted (--delete-volumes on stop).
@ -263,9 +289,17 @@ if [[ "$log_output_11" == *"/data2 filesystem is fresh"* ]]; then
    echo "Fresh provisioner volumes test: passed"
 else
    echo "Fresh provisioner volumes test: FAILED"
-    delete_cluster_exit
+    cleanup_and_exit
 fi
-# Stop and clean up
+# Full teardown: --delete-namespace nukes the namespace after labeled cleanup.
-$TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes
+# Verify the namespace is actually gone.
 $TEST_TARGET_SO deployment --dir $test_deployment_dir \
    stop --delete-volumes --delete-namespace --skip-cluster-management
 if kubectl get namespace ${deployment_ns} >/dev/null 2>&1; then
    echo "delete-namespace test: FAILED (namespace still present)"
    exit 1
 fi
 echo "delete-namespace test: passed"
 echo "Test passed"