Troubleshooting
Check node status
Check node status with:
[root@speech-platform ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
speech-platform.localdomain Ready control-plane,master 9s v1.27.6+k3s1
If node is not in ready state, there is usually something wrong.
Note: Node list can be empty (No resources found
) or node can be in notReady
state if virtual appliance is starting up. This is normal and should be fixed in
a few moments.
Also node has to have enough free disk and memory capacity. When this is not true, pressure events are emitted. Run following command to see the node conditions:
[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 08:06:45 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletReady kubelet is posting ready status
Disk pressure
Disk pressure node event is emitted, when kubernetes is running out of disk
capacity in the /var
filesystem. Node conditions looks like this:
[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 08:06:45 +0000 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 29 Apr 2024 08:13:54 +0000 Mon, 29 Apr 2024 07:46:39 +0000 KubeletReady kubelet is posting ready status
Follow the procedure for extending the disks.
Memory pressure
Memory pressure node event is emitted, when kubernetes is running out of free memory. Node conditions looks like this:
[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure True Mon, 29 Apr 2024 08:50:50 +0000 Mon, 29 Apr 2024 08:50:50 +0000 KubeletHasInsufficientMemory kubelet has insufficient memory available
DiskPressure False Mon, 29 Apr 2024 08:50:50 +0000 Mon, 29 Apr 2024 08:33:08 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 29 Apr 2024 08:50:50 +0000 Mon, 29 Apr 2024 08:33:08 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 29 Apr 2024 08:50:50 +0000 Mon, 29 Apr 2024 08:33:08 +0000 KubeletReady kubelet is posting ready status
You need to grant more memory to the virtual appliance
View pod logs
Logs are stored in /data/log/pods/
or in /data/logs/containers
. You can view
them via filebrowser if needed.
Alternatively you can display logs with kubectl
command:
[root@speech-platform ~]# kubectl -n speech-platform logs -f voiceprint-extraction-7867578b97-w7bzd
[2024-04-29 08:59:10.250] [Configuration] [info] model: /models/xl-5.0.0.model
[2024-04-29 08:59:10.250] [Configuration] [info] port: 8080
[2024-04-29 08:59:10.250] [Configuration] [info] device: cpu
[2024-04-29 08:59:10.250] [critical] base64_decode: invalid character ''<''
Changes in configuration are not applied
When to use this: Use this when you have made changes to
/data/speech-platform/speech-platform-values.yaml
but they do not seem to take
effect (e.g., new settings aren't reflected in the application, services don’t
start properly, etc.).
Why this happens: The Helm controller automatically watches for changes in the config file. If the YAML configuration file is invalid, the update job fails and the system continues running the old config or fails to deploy completely.
How to troubleshoot: If the configuration is incorrect, the update job will not complete successfully, and the underlying pod will either restart or be in an error state. The pod status will reflect this issue.
1. Check the Helm install job status:
kubectl get pods -n kube-system | grep -i helm-install
[root@speech-platform disks]# kubectl get pods -n kube-system | grep -i helm-install
helm-install-filebrowser-2b7pn 0/1 Completed 0 51m
helm-install-ingress-nginx-m87d4 0/1 Completed 0 51m
helm-install-nginx-nrcvk 0/1 Completed 0 51m
helm-install-dcgm-exporter-fjqzz 0/1 Completed 0 51m
helm-install-kube-prometheus-stack-jn5bz 0/1 Completed 0 51m
helm-install-keda-vsn95 0/1 Completed 0 51m
helm-install-speech-platform-9l9vj 0/1 Error 4 (46s ago) 6m15s
2. Inspect the logs of the failing job:
kubectl logs -f <failing-job-name> -n kube-system
[root@speech-platform disks]# kubectl logs -f helm-install-speech-platform-9l9vj -n kube-system
...
...
...
Upgrading speech-platform
+ helm_v3 upgrade --namespace speech-platform speech-platform https://10.43.0.1:443/static/phonexia-charts/speech-platform-0.0.0-36638f5-helm.tgz --values /config/values-10_HelmChartConfig.yaml
Error: failed to parse /config/values-10_HelmChartConfig.yaml: error converting YAML to JSON: yaml: line 494: could not find expected ':'
3. Next step: Validate the YAML (see next section).
Check configuration file validity
This section describes how to check if your configuration is valid and how to identify which line in the configuration is incorrect.
When to use this: Whenever changes are made to speech-platform-values.yaml, or if a Helm update job fails due to YAML syntax issues.
Why this matters: Helm requires a valid YAML configuration file to parse and apply configuration. A missing colon, incorrect indentation, or misplaced value can break the deployment.
1. How to validate the config:
Run:
yq .spec.valuesContent /data/speech-platform/speech-platform-values.yaml | yq .
If the configuration file is valid, the content of the file will be printed. Otherwise, the line number with an error will be printed out as follows:
[root@speech-platform ~]# yq .spec.valuesContent /data/speech-platform/speech-platform-values.yaml | yq .
Error: bad file '-': yaml: line 253: could not find expected ':'
The actual configuration is nested under spec.valuesContent, usually starting on line 7. If you see an error on line 253, add 7 (253 + 7 = 260) to get the actual line in the file.
2. View the lines around the error:
Run:
cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10
[root@speech-platform ~]# cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10
250
251 model:
252 volume:
253 hostPath:
254 path: /data/models/enhanced_speech_to_text_built_on_whisper
255
256 # Name of a model file inside the volume, for example "large_v2-1.0.0.model"
257 file: "large_v2-1.0.1.model"
258 license:
259 value:
260 "eyJ2ZX...=="
261
262 # Uncomment this to grant access to GPU on whisper pod
263 resources:
264 limits:
265 nvidia.com/gpu: "1"
266
267 # Uncomment this to run whisper on GPU
268 runtimeClassName: "nvidia"
269
270 service:
4. Fix the error
Example: This is invalid:
value:
"eyJ2ZX...=="
Correct form:
value: "eyJ2ZX...=="
There is only a license key on line 260. Error message
could not find expected ':'
which is right because there is no :
on this
line. One line above (259) there is a key named value
which should contain the
license. However, the license itself is on line 260, making this file invalid
(i.e., it is not in a valid YAML format). To fix it, simply merge lines 259
and 260. The resulting file should look like this:
[root@speech-platform ~]# cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10
250
251 model:
252 volume:
253 hostPath:
254 path: /data/models/enhanced_speech_to_text_built_on_whisper
255
256 # Name of a model file inside the volume, for example "large_v2-1.0.0.model"
257 file: "large_v2-1.0.1.model"
258 license:
259 value: "eyJ2ZX...=="
260
261 # Uncomment this to grant access to GPU on whisper pod
262 resources:
263 limits:
264 nvidia.com/gpu: "1"
265
266 # Uncomment this to run whisper on GPU
267 runtimeClassName: "nvidia"
268
269 service:
270 clusterIP: "None"
Disable DNS resolving for specific domains
When to use this: Use this when you see long response times, timeout errors, or task processing delays due to DNS lookup issues, particularly when using DHCP or custom DNS setups.
Why this happens: This happens when DHCP is used for IP address assignment
for the virtual appliance which usually configures nameserver and search domains
in /etc/resolv.conf
:
nameserver 192.168.137.1
search localdomain
Check coreDNS logs at first:
kubectl -n kube-system logs -l k8s-app=kube-dns
Following lines in the logs indicate this issue:
2024-06-05T11:00:49.55751974Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:60352->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.546562499Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:40254->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.548101103Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:47838->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.558720939Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:39526->192.168.137.1:53: i/o timeout
2024-06-05T11:00:53.547326187Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:58487->192.168.137.1:53: i/o timeout
2024-06-05T11:00:53.548836432Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:46303->192.168.137.1:53: i/o timeout
Communication within virtual appliance does not use FQDN, which means that each
DNS name is resolved with all domains. Internal kubernetes domains
(<namespace>.svc.cluster.local
, svc.cluster.local
and cluster.local
) are
resolved immediately with coreDNS, non-kubernetes domains are resolved with
nameserver provided by DHCP. If access to the nameserver is blocked (for
example, by firewall), then resolving of single name can take up to 10 seconds,
which can significantly increase task processing duration.
How to resolve: To avoid this issue, you can either allow communication from virtual appliance to DHCP-configured DNS server or configure kubernetes resolver to skip lookup for DHCP-provided domain(s):
1. Create a DNS override file:
[Virtual appliance] Create file /data/speech-platform/coredns-custom.yaml
manually with following content. Replace <domain1.com>
and <domain2.com>
for
domain you want to disable lookup for:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
custom.server: |
<domain1.com>:53 {
log
}
<domain2.com>:53 {
log
}
2. Restart CoreDNS to apply the change:
kubectl -n kube-system rollout restart deploy/coredns
3. Verify CoreDNS is healthy and coreDNS pod is running:
kubectl -n kube-system get pods -l k8s-app=kube-dns
4. Restart all speech-platform pods:
kubectl -n speech-platform rollout restart deploy
kubectl -n speech-platform rollout restart sts