so-l2l: in-place stop/restart via label-scoped cleanup (#743)

- `down()` scopes cleanup to a single stack via `app.kubernetes.io/stack` and keeps the namespace `Active` by default
- New `stop/down --delete-namespace` flag for opt-in full teardown
- `down()` is synchronous - waits until resources are actually gone before returning. Callers can drop their own wait loops
- `up()` skip-if-exists for Jobs completes the create-or-replace coverage (other kinds already had it)
- Orphan PVs from a prior `stop --delete-namespace` get cleaned on the next `stop --delete-volumes`
- Every k8s resource SO creates now carries `app.kubernetes.io/stack` via a new `ClusterInfo._stack_labels()` helper
- Closes so-l2l, so-076.2. Also includes pebble audit: closes so-c71, so-b2b, so-k1k; files so-328
pull/744/head v1.1.0-fc5dc80-202604160643
prathamesh0 2026-04-16 12:10:04 +05:30 committed by GitHub
parent f40913d187
commit fc5dc80058
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 414 additions and 127 deletions

View File

@ -26,4 +26,16 @@
{"type":"create","timestamp":"2026-04-08T05:51:31.557582604Z","issue_id":"so-5cd","payload":{"description":"The DockerDeployer.up() in stack_orchestrator/deploy/compose/deploy_docker.py accepts image_overrides as a parameter but silently drops it — only k8s mode (deploy_k8s.py) actually applies overrides.\n\nImpact: the --image container=image CLI flag on 'laconic-so deployment start' is a no-op for compose-mode deployments. Spec-level image-overrides: keys are also ignored in compose mode (they reach up() via deployment.py but are never applied).\n\nUse case: gorchain-stacks test scripts build :local images via build-containers, but compose files reference ghcr.io/gorbagana-dev/*:latest (so prod pulls work). Without image override support in compose mode, tests either need to docker tag the builds or the compose file needs to be rewritten before start — both ugly workarounds for what should be a first-class mechanism.\n\nFix sketch: in DockerDeployer.up(), when image_overrides is non-empty, write a temporary docker-compose.override.yml with {services: {name: {image: ref}}} and construct a new DockerClient with compose_files + [override_path]. Keeps k8s path untouched, reuses existing --image CLI flag and spec-level image-overrides: plumbing.","priority":"2","title":"Compose deployer ignores image_overrides","type":"bug"}} {"type":"create","timestamp":"2026-04-08T05:51:31.557582604Z","issue_id":"so-5cd","payload":{"description":"The DockerDeployer.up() in stack_orchestrator/deploy/compose/deploy_docker.py accepts image_overrides as a parameter but silently drops it — only k8s mode (deploy_k8s.py) actually applies overrides.\n\nImpact: the --image container=image CLI flag on 'laconic-so deployment start' is a no-op for compose-mode deployments. Spec-level image-overrides: keys are also ignored in compose mode (they reach up() via deployment.py but are never applied).\n\nUse case: gorchain-stacks test scripts build :local images via build-containers, but compose files reference ghcr.io/gorbagana-dev/*:latest (so prod pulls work). Without image override support in compose mode, tests either need to docker tag the builds or the compose file needs to be rewritten before start — both ugly workarounds for what should be a first-class mechanism.\n\nFix sketch: in DockerDeployer.up(), when image_overrides is non-empty, write a temporary docker-compose.override.yml with {services: {name: {image: ref}}} and construct a new DockerClient with compose_files + [override_path]. Keeps k8s path untouched, reuses existing --image CLI flag and spec-level image-overrides: plumbing.","priority":"2","title":"Compose deployer ignores image_overrides","type":"bug"}}
{"type": "create", "timestamp": "2026-04-13T09:54:05.207241Z", "issue_id": "so-c71", "payload": {"title": "extraPortMappings maps all compose ports unconditionally", "type": "bug", "priority": "2", "description": "Commit fb69cc58 added compose service port mapping to Kind extraPortMappings. The intent was to support network_mode: host services (RPC, gossip), but the implementation maps ALL compose ports unconditionally. Internal-only ports (postgres 5432, redis 6379) get exposed on the host, causing conflicts with local services. The port mapping should only apply to services with network_mode: host, or be controlled by a spec-level opt-in.", "source_commit": "fb69cc58"}} {"type": "create", "timestamp": "2026-04-13T09:54:05.207241Z", "issue_id": "so-c71", "payload": {"title": "extraPortMappings maps all compose ports unconditionally", "type": "bug", "priority": "2", "description": "Commit fb69cc58 added compose service port mapping to Kind extraPortMappings. The intent was to support network_mode: host services (RPC, gossip), but the implementation maps ALL compose ports unconditionally. Internal-only ports (postgres 5432, redis 6379) get exposed on the host, causing conflicts with local services. The port mapping should only apply to services with network_mode: host, or be controlled by a spec-level opt-in.", "source_commit": "fb69cc58"}}
{"type": "create", "timestamp": "2026-04-14T09:53:31.040118Z", "issue_id": "so-078", "payload": {"title": "Deployments should be self-sufficient: copy hooks into deployment dir", "type": "feature", "priority": "1", "description": "deploy/commands.py hooks are resolved from the stack repo at runtime via get_stack_path. The deployment dir has no copy. This means: (1) the repo must remain at the same path after deploy create, (2) deployment start/restart fail with 'stack does not exist' if cwd differs from deploy create time (stack-source in deployment.yml is relative), (3) deployments cannot be moved or run independently of the source repo. Fix: deploy create should copy deploy/commands.py into the deployment dir alongside compose files and configmaps. call_stack_deploy_start should load from the deployment dir. The deployment becomes self-sufficient."}} {"type": "create", "timestamp": "2026-04-14T09:53:31.040118Z", "issue_id": "so-078", "payload": {"title": "Deployments should be self-sufficient: copy hooks into deployment dir", "type": "feature", "priority": "1", "description": "deploy/commands.py hooks are resolved from the stack repo at runtime via get_stack_path. The deployment dir has no copy. This means: (1) the repo must remain at the same path after deploy create, (2) deployment start/restart fail with 'stack does not exist' if cwd differs from deploy create time (stack-source in deployment.yml is relative), (3) deployments cannot be moved or run independently of the source repo. Fix: deploy create should copy deploy/commands.py into the deployment dir alongside compose files and configmaps. call_stack_deploy_start should load from the deployment dir. The deployment becomes self-sufficient."}}
{"type": "update", "timestamp": "2026-04-14T10:01:14.937483Z", "issue_id": "so-c71", "payload": {"status": "resolved", "resolution": "Fixed in commit e909357a on fix/extraport-host-only branch. Only map ports for services with network_mode: host. Ports 80/443 for Caddy always mapped."}} {"type":"comment","timestamp":"2026-04-15T06:12:45.58660796Z","issue_id":"so-c71","payload":{"body":"Fixed in commit e909357a on fix/extraport-host-only branch. Only map ports for services with network_mode: host. Ports 80/443 for Caddy always mapped."}}
{"type":"close","timestamp":"2026-04-15T06:12:45.832454065Z","issue_id":"so-c71","payload":{}}
{"type":"comment","timestamp":"2026-04-15T06:18:02.64056792Z","issue_id":"so-b2b","payload":{"body":"Fixed. create_registry_secret() in deployment_create.py:583 reads image-pull-secret from spec, resolves token via token-env/token-file. Spec key renamed from registry-credentials to image-pull-secret (spec.py:140). Documented in docs/deployment_patterns.md with REGISTRY_TOKEN usage example."}}
{"type":"close","timestamp":"2026-04-15T06:18:02.965856003Z","issue_id":"so-b2b","payload":{}}
{"type":"comment","timestamp":"2026-04-15T06:18:04.543850719Z","issue_id":"so-k1k","payload":{"body":"Largely resolved. deployment restart (deployment.py:324) now uses 'git rev-parse --show-toplevel' to find repo_root dynamically (lines 364-378), removing the fixed 4-parents-up assumption. External stacks with varying nesting depths now work for restart. deploy create still uses get_stack_path(stack_name) which is a different mechanism but works correctly with --stack-path. Closing — the underlying breakage is gone."}}
{"type":"close","timestamp":"2026-04-15T06:18:04.856542806Z","issue_id":"so-k1k","payload":{}}
{"type":"comment","timestamp":"2026-04-15T06:18:08.436540869Z","issue_id":"so-076.2","payload":{"body":"Partially mitigated by commit cc6acd5f which flipped --skip-cluster-management default to true, so 'deployment stop' no longer destroys the cluster by default. Root fix still open: down() in deploy_k8s.py:904-936 unconditionally calls _delete_namespace() (line 929) and destroy_cluster() (line 936) when --perform-cluster-management is passed. No logic distinguishes shared vs dedicated clusters."}}
{"type":"comment","timestamp":"2026-04-15T06:18:11.374723274Z","issue_id":"so-l2l","payload":{"body":"Partially addressed. Readiness probes are now generated in cluster_info.py:652-671 (part C of the original fix). Parts A and B still open: up() does not use patch/apply (delete/recreate semantics remain), and down() still calls _delete_namespace() unconditionally at deploy_k8s.py:929 on every restart. A 'fix: never delete namespace on deployment down' commit (ae2cea34) exists on a remote branch but is not merged to main."}}
{"type":"create","timestamp":"2026-04-15T11:11:15.584733236Z","issue_id":"so-328","payload":{"description":"deployment restart runs create_operation(update=True) which uses copytree(dirs_exist_ok=True) to sync the stack repo into the deployment dir (deployment_create.py:1079, 1130). This is additive only — it overwrites and adds files, but never removes them. Two resulting gaps:\n\n1. Deletions don't propagate. If a script, configmap file, or compose service is removed from the stack repo, the deployment dir keeps it, and up_operation keeps applying it. The k8s ConfigMap retains removed keys; removed Deployments/Services are not cleaned up (up() is create/patch, not full reconciliation). Operators see stale files and orphan workloads that won't disappear without manual kubectl intervention or a full teardown.\n\n2. stack.yml structural changes don't auto-surface in the spec. If a stack.yml gains a new configmap declaration or a new compose file reference, restart doesn't pull it in unless the operator's spec.yml already references it. The spec is the contract, so additions to the stack aren't applied to live deployments just by pulling the repo.\n\nTeardown + redeploy is the only reliable way to clean up today, which is not practical in production.\n\nFix direction: create_operation(update=True) should treat the deployment dir as reconciled state — diff the desired tree (from the stack repo + spec) against what's on disk and remove files that no longer exist upstream. up_operation should then delete k8s resources (Deployments, Services, ConfigMaps) that are no longer declared by any compose/configmap source, likely scoped by an 'app.kubernetes.io/managed-by: laconic-so' label to avoid nuking unrelated resources. For new stack.yml entries, consider whether the spec needs operator action or whether restart can auto-detect and warn.","priority":"3","title":"deployment restart does not propagate repo deletions or new stack.yml entries","type":"bug"}}
{"type":"comment","timestamp":"2026-04-16T06:24:38.826132538Z","issue_id":"so-l2l","payload":{"body":"Fixed in so-l2l Parts A and B on this branch:\n\n**Part A (up() as create-or-update):** Deployments, Services, ConfigMaps, Secrets, Ingresses, and Endpoints already used create-or-replace in up(). Completed coverage by adding 409 skip-if-exists for Jobs (one-shot, re-run unwanted). Readiness probes (Part C) were already present.\n\n**Part B (down() preserves namespace):** _delete_labeled_resources now deletes by 'app.kubernetes.io/stack' label and keeps the namespace Active. Full-teardown option is a new --delete-namespace flag on stop/down. down() is synchronous (waits for resources to actually be gone before returning) so tests and ansible can assume clean state on return. Orphan PVs from prior --delete-namespace runs are also cleaned on subsequent stop --delete-volumes.\n\nrestart no longer calls down() at all (deployment.py:421-468), so the original wd-d92-style namespace termination race is structurally impossible. In-cluster rolling updates work via k8s native semantics when Deployment pod specs change via replace."}}
{"type":"close","timestamp":"2026-04-16T06:24:39.175431401Z","issue_id":"so-l2l","payload":{}}
{"type":"comment","timestamp":"2026-04-16T06:24:41.70556861Z","issue_id":"so-076.2","payload":{"body":"Fixed on chore/pebble-status-audit. stop now uses label-scoped cleanup (app.kubernetes.io/stack=\u003cstack\u003e) and keeps the namespace Active by default. The Kind cluster is not destroyed unless --perform-cluster-management is passed. Full namespace teardown is opt-in via the new --delete-namespace flag. Multiple stacks sharing a namespace/cluster are now cleaned up independently, not blown away en masse."}}
{"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}

View File

@ -55,7 +55,8 @@ class DockerDeployer(Deployer):
except DockerException as e: except DockerException as e:
raise DeployerException(e) raise DeployerException(e)
def down(self, timeout, volumes, skip_cluster_management): def down(self, timeout, volumes, skip_cluster_management, delete_namespace=False):
# delete_namespace is k8s-only; ignored in compose mode.
if not opts.o.dry_run: if not opts.o.dry_run:
try: try:
return self.docker.compose.down(timeout=timeout, volumes=volumes) return self.docker.compose.down(timeout=timeout, volumes=volumes)

View File

@ -172,7 +172,13 @@ def up_operation(
) )
def down_operation(ctx, delete_volumes, extra_args_list, skip_cluster_management=False): def down_operation(
ctx,
delete_volumes,
extra_args_list,
skip_cluster_management=False,
delete_namespace=False,
):
timeout_arg = None timeout_arg = None
if extra_args_list: if extra_args_list:
timeout_arg = extra_args_list[0] timeout_arg = extra_args_list[0]
@ -182,6 +188,7 @@ def down_operation(ctx, delete_volumes, extra_args_list, skip_cluster_management
timeout=timeout_arg, timeout=timeout_arg,
volumes=delete_volumes, volumes=delete_volumes,
skip_cluster_management=skip_cluster_management, skip_cluster_management=skip_cluster_management,
delete_namespace=delete_namespace,
) )

View File

@ -24,7 +24,7 @@ class Deployer(ABC):
pass pass
@abstractmethod @abstractmethod
def down(self, timeout, volumes, skip_cluster_management): def down(self, timeout, volumes, skip_cluster_management, delete_namespace=False):
pass pass
@abstractmethod @abstractmethod

View File

@ -157,13 +157,21 @@ def prepare(ctx, skip_cluster_management):
default=True, default=True,
help="Skip cluster initialization/tear-down (only for kind-k8s deployments)", help="Skip cluster initialization/tear-down (only for kind-k8s deployments)",
) )
@click.option(
"--delete-namespace",
is_flag=True,
default=False,
help="Also delete the k8s namespace (full teardown)",
)
@click.argument("extra_args", nargs=-1) # help: command: down <service1> <service2> @click.argument("extra_args", nargs=-1) # help: command: down <service1> <service2>
@click.pass_context @click.pass_context
def down(ctx, delete_volumes, skip_cluster_management, extra_args): def down(ctx, delete_volumes, skip_cluster_management, delete_namespace, extra_args):
# Get the stack config file name # Get the stack config file name
# TODO: add cluster name and env file here # TODO: add cluster name and env file here
ctx.obj = make_deploy_context(ctx) ctx.obj = make_deploy_context(ctx)
down_operation(ctx, delete_volumes, extra_args, skip_cluster_management) down_operation(
ctx, delete_volumes, extra_args, skip_cluster_management, delete_namespace
)
# stop is the preferred alias for down # stop is the preferred alias for down
@ -176,12 +184,20 @@ def down(ctx, delete_volumes, skip_cluster_management, extra_args):
default=True, default=True,
help="Skip cluster initialization/tear-down (only for kind-k8s deployments)", help="Skip cluster initialization/tear-down (only for kind-k8s deployments)",
) )
@click.option(
"--delete-namespace",
is_flag=True,
default=False,
help="Also delete the k8s namespace (full teardown)",
)
@click.argument("extra_args", nargs=-1) # help: command: down <service1> <service2> @click.argument("extra_args", nargs=-1) # help: command: down <service1> <service2>
@click.pass_context @click.pass_context
def stop(ctx, delete_volumes, skip_cluster_management, extra_args): def stop(ctx, delete_volumes, skip_cluster_management, delete_namespace, extra_args):
# TODO: add cluster name and env file here # TODO: add cluster name and env file here
ctx.obj = make_deploy_context(ctx) ctx.obj = make_deploy_context(ctx)
down_operation(ctx, delete_volumes, extra_args, skip_cluster_management) down_operation(
ctx, delete_volumes, extra_args, skip_cluster_management, delete_namespace
)
@command.command() @command.command()

View File

@ -118,6 +118,17 @@ class ClusterInfo:
volumes.extend(named_volumes_from_pod_files(self.parsed_job_yaml_map)) volumes.extend(named_volumes_from_pod_files(self.parsed_job_yaml_map))
return volumes return volumes
def _stack_labels(self, extra: Optional[dict] = None) -> dict:
"""Standard resource labels. Use on every k8s resource SO creates so
label-based cleanup (down by stack) can find them all.
"""
labels = {"app": self.app_name}
if self.stack_name:
labels["app.kubernetes.io/stack"] = self.stack_name
if extra:
labels.update(extra)
return labels
def get_nodeports(self): def get_nodeports(self):
nodeports = [] nodeports = []
for pod_name in self.parsed_pod_yaml_map: for pod_name in self.parsed_pod_yaml_map:
@ -151,7 +162,7 @@ class ClusterInfo:
f"{self.app_name}-nodeport-" f"{self.app_name}-nodeport-"
f"{pod_port}-{protocol.lower()}" f"{pod_port}-{protocol.lower()}"
), ),
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
type="NodePort", type="NodePort",
@ -268,7 +279,7 @@ class ClusterInfo:
ingress = client.V1Ingress( ingress = client.V1Ingress(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=f"{self.app_name}-ingress", name=f"{self.app_name}-ingress",
labels={"app": self.app_name}, labels=self._stack_labels(),
annotations=ingress_annotations, annotations=ingress_annotations,
), ),
spec=spec, spec=spec,
@ -323,7 +334,7 @@ class ClusterInfo:
service = client.V1Service( service = client.V1Service(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=f"{self.app_name}-service", name=f"{self.app_name}-service",
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
type="ClusterIP", type="ClusterIP",
@ -355,10 +366,9 @@ class ClusterInfo:
self.spec.get_volume_resources_for(volume_name) or global_resources self.spec.get_volume_resources_for(volume_name) or global_resources
) )
labels = { labels = self._stack_labels(
"app": self.app_name, {"volume-label": f"{self.app_name}-{volume_name}"}
"volume-label": f"{self.app_name}-{volume_name}", )
}
if volume_path: if volume_path:
storage_class_name = "manual" storage_class_name = "manual"
k8s_volume_name = f"{self.app_name}-{volume_name}" k8s_volume_name = f"{self.app_name}-{volume_name}"
@ -418,7 +428,7 @@ class ClusterInfo:
spec = client.V1ConfigMap( spec = client.V1ConfigMap(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=f"{self.app_name}-{cfg_map_name}", name=f"{self.app_name}-{cfg_map_name}",
labels={"app": self.app_name, "configmap-label": cfg_map_name}, labels=self._stack_labels({"configmap-label": cfg_map_name}),
), ),
binary_data=data, binary_data=data,
) )
@ -482,10 +492,9 @@ class ClusterInfo:
pv = client.V1PersistentVolume( pv = client.V1PersistentVolume(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=f"{self.app_name}-{volume_name}", name=f"{self.app_name}-{volume_name}",
labels={ labels=self._stack_labels(
"app": self.app_name, {"volume-label": f"{self.app_name}-{volume_name}"}
"volume-label": f"{self.app_name}-{volume_name}", ),
},
), ),
spec=spec, spec=spec,
) )
@ -737,9 +746,7 @@ class ClusterInfo:
Returns (annotations, labels, affinity, tolerations). Returns (annotations, labels, affinity, tolerations).
""" """
annotations = None annotations = None
labels = {"app": self.app_name} labels = self._stack_labels()
if self.stack_name:
labels["app.kubernetes.io/stack"] = self.stack_name
affinity = None affinity = None
tolerations = None tolerations = None
@ -920,21 +927,11 @@ class ClusterInfo:
kind="Deployment", kind="Deployment",
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=deployment_name, name=deployment_name,
labels={ labels=self._stack_labels(
"app": self.app_name,
**(
{
"app.kubernetes.io/stack": self.stack_name,
}
if self.stack_name
else {}
),
**(
{"app.kubernetes.io/component": pod_name} {"app.kubernetes.io/component": pod_name}
if multi_pod if multi_pod
else {} else None
), ),
},
), ),
spec=spec, spec=spec,
) )
@ -1001,7 +998,7 @@ class ClusterInfo:
service = client.V1Service( service = client.V1Service(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=f"{self.app_name}-{pod_name}-service", name=f"{self.app_name}-{pod_name}-service",
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
type="ClusterIP", type="ClusterIP",
@ -1054,14 +1051,9 @@ class ClusterInfo:
# Use a distinct app label for job pods so they don't get # Use a distinct app label for job pods so they don't get
# picked up by pods_in_deployment() which queries app={app_name}. # picked up by pods_in_deployment() which queries app={app_name}.
pod_labels = { # Use a distinct app label for job pods (see comment above) so we
"app": f"{self.app_name}-job", # still build via _stack_labels then override.
**( pod_labels = self._stack_labels({"app": f"{self.app_name}-job"})
{"app.kubernetes.io/stack": self.stack_name}
if self.stack_name
else {}
),
}
template = client.V1PodTemplateSpec( template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels=pod_labels), metadata=client.V1ObjectMeta(labels=pod_labels),
spec=client.V1PodSpec( spec=client.V1PodSpec(
@ -1076,14 +1068,7 @@ class ClusterInfo:
template=template, template=template,
backoff_limit=0, backoff_limit=0,
) )
job_labels = { job_labels = self._stack_labels()
"app": self.app_name,
**(
{"app.kubernetes.io/stack": self.stack_name}
if self.stack_name
else {}
),
}
job = client.V1Job( job = client.V1Job(
api_version="batch/v1", api_version="batch/v1",
kind="Job", kind="Job",
@ -1121,7 +1106,7 @@ class ClusterInfo:
svc = client.V1Service( svc = client.V1Service(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=name, name=name,
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
type="ExternalName", type="ExternalName",
@ -1138,7 +1123,7 @@ class ClusterInfo:
svc = client.V1Service( svc = client.V1Service(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=name, name=name,
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
cluster_ip="None", cluster_ip="None",
@ -1156,7 +1141,7 @@ class ClusterInfo:
svc = client.V1Service( svc = client.V1Service(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=name, name=name,
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
spec=client.V1ServiceSpec( spec=client.V1ServiceSpec(
cluster_ip="None", cluster_ip="None",
@ -1199,7 +1184,7 @@ class ClusterInfo:
secret = client.V1Secret( secret = client.V1Secret(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=secret_name, name=secret_name,
labels={"app": self.app_name}, labels=self._stack_labels(),
), ),
data=secret_data, data=secret_data,
) )

View File

@ -189,7 +189,7 @@ class K8sDeployer(Deployer):
ns = client.V1Namespace( ns = client.V1Namespace(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=self.k8s_namespace, name=self.k8s_namespace,
labels={"app": self.cluster_info.app_name}, labels=self.cluster_info._stack_labels(),
) )
) )
self.core_api.create_namespace(body=ns) self.core_api.create_namespace(body=ns)
@ -475,7 +475,7 @@ class K8sDeployer(Deployer):
endpoints = client.V1Endpoints( endpoints = client.V1Endpoints(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=name, name=name,
labels={"app": self.cluster_info.app_name}, labels=self.cluster_info._stack_labels(),
), ),
subsets=[ subsets=[
client.V1EndpointSubset( client.V1EndpointSubset(
@ -535,7 +535,7 @@ class K8sDeployer(Deployer):
endpoints = client.V1Endpoints( endpoints = client.V1Endpoints(
metadata=client.V1ObjectMeta( metadata=client.V1ObjectMeta(
name=name, name=name,
labels={"app": self.cluster_info.app_name}, labels=self.cluster_info._stack_labels(),
), ),
subsets=[ subsets=[
client.V1EndpointSubset( client.V1EndpointSubset(
@ -709,6 +709,8 @@ class K8sDeployer(Deployer):
if opts.o.debug: if opts.o.debug:
print(f"Sending this job: {job}") print(f"Sending this job: {job}")
if not opts.o.dry_run: if not opts.o.dry_run:
job_name = job.metadata.name
try:
job_resp = self.batch_api.create_namespaced_job( job_resp = self.batch_api.create_namespaced_job(
body=job, namespace=self.k8s_namespace body=job, namespace=self.k8s_namespace
) )
@ -719,6 +721,15 @@ class K8sDeployer(Deployer):
f" {job_resp.metadata.namespace} " f" {job_resp.metadata.namespace} "
f"{job_resp.metadata.name}" f"{job_resp.metadata.name}"
) )
except ApiException as e:
if e.status == 409:
# Job already exists from a prior run. Jobs are one-
# shot — don't recreate on restart. Delete the Job
# explicitly to re-run (stop --delete-volumes also
# clears them via label-based cleanup).
print(f"Job {job_name} already exists, skipping")
else:
raise
def _find_certificate_for_host_name(self, host_name): def _find_certificate_for_host_name(self, host_name):
all_certificates = self.custom_obj_api.list_namespaced_custom_object( all_certificates = self.custom_obj_api.list_namespaced_custom_object(
@ -901,16 +912,174 @@ class K8sDeployer(Deployer):
call_stack_deploy_start(self.deployment_context) call_stack_deploy_start(self.deployment_context)
def down(self, timeout, volumes, skip_cluster_management): def down(
self, timeout, volumes, skip_cluster_management, delete_namespace=False
):
"""Tear down stack-labeled resources. Phases:
1. Delete namespaced resources (if namespace still exists).
2. Delete cluster-scoped PVs (if --delete-volumes, regardless of (1)).
3. Wait for everything we triggered to actually be gone.
4. Optionally delete the namespace itself (--delete-namespace).
5. Optionally destroy the kind cluster (--perform-cluster-management).
Steps 1-3 scope cleanup to a single stack via app.kubernetes.io/stack,
so multiple stacks sharing a namespace tear down independently.
"""
self.skip_cluster_management = skip_cluster_management self.skip_cluster_management = skip_cluster_management
self.connect_api() self.connect_api()
app_label = f"app={self.cluster_info.app_name}" selector = self._stack_label_selector()
ns = self.k8s_namespace
ns_exists = self._namespace_exists(ns)
# PersistentVolumes are cluster-scoped (not namespaced), so delete by label if ns_exists:
self._delete_namespaced_labeled_resources(ns, selector, volumes)
if volumes: if volumes:
self._delete_labeled_pvs(selector)
self._wait_for_labeled_gone(
ns, selector, delete_volumes=volumes, namespace_present=ns_exists
)
if delete_namespace and ns_exists:
self._delete_namespace()
self._wait_for_namespace_gone()
if self.is_kind() and not self.skip_cluster_management:
destroy_cluster(self.kind_cluster_name)
def _stack_label_selector(self) -> str:
"""Selector used for stack-scoped cleanup.
Prefer app.kubernetes.io/stack (per-stack) and fall back to the
legacy app= label (cluster-id scoped) for deployments that predate
the stack label.
"""
stack_name = self.cluster_info.stack_name
if stack_name:
return f"app.kubernetes.io/stack={stack_name}"
return f"app={self.cluster_info.app_name}"
def _namespace_exists(self, namespace: str) -> bool:
try: try:
pvs = self.core_api.list_persistent_volume(label_selector=app_label) self.core_api.read_namespace(name=namespace)
return True
except ApiException as e:
if e.status == 404:
if opts.o.debug:
print(f"Namespace {namespace} not found")
return False
raise
def _delete_namespaced_labeled_resources(
self, namespace: str, selector: str, delete_volumes: bool
):
"""Delete Ingresses, Deployments, Jobs, Services, ConfigMaps,
Secrets, Endpoints, Pods, and (if delete_volumes) PVCs in the
namespace. Order matters: Ingresses first so external traffic
stops, then workloads, then support objects, then Pods, then PVCs.
"""
if opts.o.dry_run:
print(
f"Dry run: would delete namespaced resources in {namespace} "
f"matching {selector}"
)
return
def swallow_404(fn):
try:
fn()
except ApiException as e:
if e.status not in (404, 405):
raise
# Ingresses first so external traffic stops before pods disappear.
swallow_404(
lambda: self.networking_api.delete_collection_namespaced_ingress(
namespace=namespace, label_selector=selector
)
)
# Deployments (owns ReplicaSets + Pods via GC).
swallow_404(
lambda: self.apps_api.delete_collection_namespaced_deployment(
namespace=namespace, label_selector=selector
)
)
# Jobs — propagation=Background cascades to child pods.
swallow_404(
lambda: self.batch_api.delete_collection_namespaced_job(
namespace=namespace,
label_selector=selector,
propagation_policy="Background",
)
)
# Services have no delete_collection on core_api; list + delete.
self._list_delete_namespaced(
namespace,
selector,
list_fn=self.core_api.list_namespaced_service,
delete_fn=self.core_api.delete_namespaced_service,
)
# ConfigMaps, Secrets.
swallow_404(
lambda: self.core_api.delete_collection_namespaced_config_map(
namespace=namespace, label_selector=selector
)
)
swallow_404(
lambda: self.core_api.delete_collection_namespaced_secret(
namespace=namespace, label_selector=selector
)
)
# Endpoints usually GC with Services, but we create a few directly
# (external-services) that aren't owned by a Service — clean those.
self._list_delete_namespaced(
namespace,
selector,
list_fn=self.core_api.list_namespaced_endpoints,
delete_fn=self.core_api.delete_namespaced_endpoints,
)
# Stray pods (owned pods are GC'd with their Deployment/Job).
swallow_404(
lambda: self.core_api.delete_collection_namespaced_pod(
namespace=namespace, label_selector=selector
)
)
if delete_volumes:
swallow_404(
lambda: self.core_api.delete_collection_namespaced_persistent_volume_claim( # noqa: E501
namespace=namespace, label_selector=selector
)
)
def _list_delete_namespaced(self, namespace, selector, list_fn, delete_fn):
"""List by selector and delete each item. Use for resources where
the k8s python client lacks delete_collection (Services, Endpoints).
"""
try:
items = list_fn(namespace=namespace, label_selector=selector).items
except ApiException as e:
if e.status == 404:
return
raise
for item in items:
try:
delete_fn(name=item.metadata.name, namespace=namespace)
except ApiException as e:
if e.status not in (404, 405):
raise
def _delete_labeled_pvs(self, selector: str):
"""Delete cluster-scoped PVs matching the stack label."""
if opts.o.dry_run:
print(f"Dry run: would delete PVs matching {selector}")
return
try:
pvs = self.core_api.list_persistent_volume(label_selector=selector)
except ApiException as e:
if opts.o.debug:
print(f"Error listing PVs: {e}")
return
for pv in pvs.items: for pv in pvs.items:
if opts.o.debug: if opts.o.debug:
print(f"Deleting PV: {pv.metadata.name}") print(f"Deleting PV: {pv.metadata.name}")
@ -918,22 +1087,85 @@ class K8sDeployer(Deployer):
self.core_api.delete_persistent_volume(name=pv.metadata.name) self.core_api.delete_persistent_volume(name=pv.metadata.name)
except ApiException as e: except ApiException as e:
_check_delete_exception(e) _check_delete_exception(e)
def _wait_for_labeled_gone(
self,
namespace: str,
selector: str,
delete_volumes: bool,
namespace_present: bool,
timeout_seconds: int = 120,
):
"""Poll until every kind we triggered a delete for is gone.
delete_collection/delete are async finalizers (PV bound-by-PVC,
PVC bound-by-VolumeAttachment, pod graceful shutdown) propagate
after the API call returns. Blocking here makes down() a
synchronous contract for callers (tests, ansible, cryovial).
"""
import time
listers = []
if namespace_present:
listers += [
("deployment", lambda: self.apps_api.list_namespaced_deployment(
namespace=namespace, label_selector=selector)),
("ingress", lambda: self.networking_api.list_namespaced_ingress(
namespace=namespace, label_selector=selector)),
("job", lambda: self.batch_api.list_namespaced_job(
namespace=namespace, label_selector=selector)),
("service", lambda: self.core_api.list_namespaced_service(
namespace=namespace, label_selector=selector)),
("configmap", lambda: self.core_api.list_namespaced_config_map(
namespace=namespace, label_selector=selector)),
("secret", lambda: self.core_api.list_namespaced_secret(
namespace=namespace, label_selector=selector)),
("pod", lambda: self.core_api.list_namespaced_pod(
namespace=namespace, label_selector=selector)),
]
if delete_volumes:
listers.append(
("persistentvolumeclaim",
lambda: self.core_api.list_namespaced_persistent_volume_claim(
namespace=namespace, label_selector=selector))
)
# PVs are cluster-scoped — wait for them even when the namespace
# is already gone (orphaned from a prior --delete-namespace).
if delete_volumes:
listers.append(
("persistentvolume",
lambda: self.core_api.list_persistent_volume(
label_selector=selector))
)
def remaining():
out = []
for kind, lister in listers:
try:
items = lister().items
except ApiException as e: except ApiException as e:
if e.status == 404:
continue
raise
if items:
out.append((kind, len(items)))
return out
deadline = time.monotonic() + timeout_seconds
while time.monotonic() < deadline:
left = remaining()
if not left:
return
if opts.o.debug: if opts.o.debug:
print(f"Error listing PVs: {e}") print(f"Waiting for deletions: {left}")
time.sleep(2)
# Delete the namespace to ensure clean slate. left = remaining()
# Resources created by older laconic-so versions lack labels, so if left:
# label-based deletion can't find them. Namespace deletion is the print(
# only reliable cleanup. f"Warning: resources still present after {timeout_seconds}s: "
self._delete_namespace() f"{left}"
# Wait for namespace to finish terminating before returning, )
# so that up() can recreate it immediately.
self._wait_for_namespace_gone()
if self.is_kind() and not self.skip_cluster_management:
# Destroy the kind cluster
destroy_cluster(self.kind_cluster_name)
def status(self): def status(self):
self.connect_api() self.connect_api()

View File

@ -23,7 +23,7 @@ wait_for_pods_started () {
done done
# Timed out, error exit # Timed out, error exit
echo "waiting for pods to start: FAILED" echo "waiting for pods to start: FAILED"
delete_cluster_exit cleanup_and_exit
} }
wait_for_log_output () { wait_for_log_output () {
@ -42,15 +42,42 @@ wait_for_log_output () {
done done
# Timed out, error exit # Timed out, error exit
echo "waiting for pods log content: FAILED" echo "waiting for pods log content: FAILED"
delete_cluster_exit cleanup_and_exit
} }
delete_cluster_exit () { cleanup_and_exit () {
$TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes # Full teardown so CI runners don't leak namespaces/PVs between runs.
$TEST_TARGET_SO deployment --dir $test_deployment_dir \
stop --delete-volumes --delete-namespace --skip-cluster-management || true
exit 1 exit 1
} }
assert_ns_phase () {
local expected=$1
local phase
phase=$(kubectl get namespace ${deployment_ns} -o jsonpath='{.status.phase}' 2>/dev/null || echo "Missing")
if [ "$phase" != "$expected" ]; then
echo "namespace phase test: FAILED (expected ${expected}, got ${phase})"
cleanup_and_exit
fi
}
# Count labeled resources in the deployment namespace. down() is
# synchronous on its own cleanup (waits for PVCs/pods to terminate
# before returning) so callers can assert immediately.
# Usage: assert_no_labeled_resources <kind>
assert_no_labeled_resources () {
local kind=$1
local count
count=$(kubectl get ${kind} -n ${deployment_ns} \
-l app.kubernetes.io/stack=test --no-headers 2>/dev/null | wc -l)
if [ "$count" -ne 0 ]; then
echo "labeled cleanup test: FAILED (${kind} still present: ${count})"
cleanup_and_exit
fi
}
# Note: eventually this test should be folded into ../deploy/ # Note: eventually this test should be folded into ../deploy/
# but keeping it separate for now for convenience # but keeping it separate for now for convenience
TEST_TARGET_SO=$( ls -t1 ./package/laconic-so* | head -1 ) TEST_TARGET_SO=$( ls -t1 ./package/laconic-so* | head -1 )
@ -130,7 +157,7 @@ if [[ "$log_output_3" == *"filesystem is fresh"* ]]; then
else else
echo "deployment logs test: FAILED" echo "deployment logs test: FAILED"
echo "$log_output_3" echo "$log_output_3"
delete_cluster_exit cleanup_and_exit
fi fi
# Check the config variable CERC_TEST_PARAM_1 was passed correctly # Check the config variable CERC_TEST_PARAM_1 was passed correctly
@ -138,7 +165,7 @@ if [[ "$log_output_3" == *"Test-param-1: PASSED"* ]]; then
echo "deployment config test: passed" echo "deployment config test: passed"
else else
echo "deployment config test: FAILED" echo "deployment config test: FAILED"
delete_cluster_exit cleanup_and_exit
fi fi
# Check the config variable CERC_TEST_PARAM_2 was passed correctly from the compose file # Check the config variable CERC_TEST_PARAM_2 was passed correctly from the compose file
@ -155,7 +182,7 @@ if [[ "$log_output_4" == *"/config/test_config:"* ]] && [[ "$log_output_4" == *"
echo "deployment ConfigMap test: passed" echo "deployment ConfigMap test: passed"
else else
echo "deployment ConfigMap test: FAILED" echo "deployment ConfigMap test: FAILED"
delete_cluster_exit cleanup_and_exit
fi fi
# Check that the bind-mount volume is mounted. # Check that the bind-mount volume is mounted.
@ -165,7 +192,7 @@ if [[ "$log_output_5" == *"/data: MOUNTED"* ]]; then
else else
echo "deployment bind volumes test: FAILED" echo "deployment bind volumes test: FAILED"
echo "$log_output_5" echo "$log_output_5"
delete_cluster_exit cleanup_and_exit
fi fi
# Check that the provisioner managed volume is mounted. # Check that the provisioner managed volume is mounted.
@ -175,7 +202,7 @@ if [[ "$log_output_6" == *"/data2: MOUNTED"* ]]; then
else else
echo "deployment provisioner volumes test: FAILED" echo "deployment provisioner volumes test: FAILED"
echo "$log_output_6" echo "$log_output_6"
delete_cluster_exit cleanup_and_exit
fi fi
# --- New feature tests: namespace, labels, jobs, secrets --- # --- New feature tests: namespace, labels, jobs, secrets ---
@ -187,7 +214,7 @@ if [ "$ns_pod_count" -gt 0 ]; then
else else
echo "namespace isolation test: FAILED" echo "namespace isolation test: FAILED"
echo "Expected pod in namespace ${deployment_ns}" echo "Expected pod in namespace ${deployment_ns}"
delete_cluster_exit cleanup_and_exit
fi fi
# Check that the stack label is set on the pod # Check that the stack label is set on the pod
@ -196,7 +223,7 @@ if [ "$stack_label_count" -gt 0 ]; then
echo "stack label test: passed" echo "stack label test: passed"
else else
echo "stack label test: FAILED" echo "stack label test: FAILED"
delete_cluster_exit cleanup_and_exit
fi fi
# Check that the job completed successfully # Check that the job completed successfully
@ -212,7 +239,7 @@ if [ "$job_status" == "1" ]; then
else else
echo "job completion test: FAILED" echo "job completion test: FAILED"
echo "Job status.succeeded: ${job_status}" echo "Job status.succeeded: ${job_status}"
delete_cluster_exit cleanup_and_exit
fi fi
# Check that the secrets spec results in an envFrom secretRef on the pod # Check that the secrets spec results in an envFrom secretRef on the pod
@ -223,25 +250,24 @@ if [ "$secret_ref" == "test-secret" ]; then
else else
echo "secrets envFrom test: FAILED" echo "secrets envFrom test: FAILED"
echo "Expected secretRef 'test-secret', got: ${secret_ref}" echo "Expected secretRef 'test-secret', got: ${secret_ref}"
delete_cluster_exit cleanup_and_exit
fi fi
# Stop then start again and check the volume was preserved. # Stop with --delete-volumes (but not --delete-namespace) and verify:
# Use --skip-cluster-management to reuse the existing kind cluster instead of # - namespace stays Active (no termination race on restart)
# destroying and recreating it (which fails on CI runners due to stale etcd/certs # - stack-labeled workloads are gone
# and cgroup detection issues). # - bind-mount data on the host survives; provisioner volumes are recreated
# Use --delete-volumes to clear PVs so fresh PVCs can bind on restart.
# Bind-mount data survives on the host filesystem; provisioner volumes are recreated fresh.
$TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes --skip-cluster-management $TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes --skip-cluster-management
# Wait for the namespace to be fully terminated before restarting.
# Without this, 'start' fails with 403 Forbidden because the namespace assert_ns_phase "Active"
# is still in Terminating state. echo "stop preserves namespace test: passed"
for i in {1..60}; do
if ! kubectl get namespace ${deployment_ns} 2>/dev/null | grep -q .; then for kind in deployment job ingress service configmap secret pvc pod; do
break assert_no_labeled_resources "$kind"
fi
sleep 2
done done
echo "stop cleans labeled resources test: passed"
# Restart — no wait needed, the namespace is still Active.
$TEST_TARGET_SO deployment --dir $test_deployment_dir start --skip-cluster-management $TEST_TARGET_SO deployment --dir $test_deployment_dir start --skip-cluster-management
wait_for_pods_started wait_for_pods_started
wait_for_log_output wait_for_log_output
@ -252,7 +278,7 @@ if [[ "$log_output_10" == *"/data filesystem is old"* ]]; then
echo "Retain bind volumes test: passed" echo "Retain bind volumes test: passed"
else else
echo "Retain bind volumes test: FAILED" echo "Retain bind volumes test: FAILED"
delete_cluster_exit cleanup_and_exit
fi fi
# Provisioner volumes are destroyed when PVs are deleted (--delete-volumes on stop). # Provisioner volumes are destroyed when PVs are deleted (--delete-volumes on stop).
@ -263,9 +289,17 @@ if [[ "$log_output_11" == *"/data2 filesystem is fresh"* ]]; then
echo "Fresh provisioner volumes test: passed" echo "Fresh provisioner volumes test: passed"
else else
echo "Fresh provisioner volumes test: FAILED" echo "Fresh provisioner volumes test: FAILED"
delete_cluster_exit cleanup_and_exit
fi fi
# Stop and clean up # Full teardown: --delete-namespace nukes the namespace after labeled cleanup.
$TEST_TARGET_SO deployment --dir $test_deployment_dir stop --delete-volumes # Verify the namespace is actually gone.
$TEST_TARGET_SO deployment --dir $test_deployment_dir \
stop --delete-volumes --delete-namespace --skip-cluster-management
if kubectl get namespace ${deployment_ns} >/dev/null 2>&1; then
echo "delete-namespace test: FAILED (namespace still present)"
exit 1
fi
echo "delete-namespace test: passed"
echo "Test passed" echo "Test passed"