Version: 4.0.2

Troubleshooting

Check node status

Check node status with:

[root@speech-platform ~]# kubectl get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
speech-platform.localdomain   Ready    control-plane,master   9s    v1.27.6+k3s1

If node is not in ready state, there is usually something wrong.

Note: Node list can be empty (No resources found) or node can be in notReady state if virtual appliance is starting up. This is normal and should be fixed in a few moments.

Also node has to have enough free disk and memory capacity. When this is not true, pressure events are emitted. Run following command to see the node conditions:

[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 08:06:45 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletReady                 kubelet is posting ready status

Disk pressure

Disk pressure node event is emitted, when kubernetes is running out of disk capacity in the /var filesystem. Node conditions looks like this:

[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 08:06:45 +0000   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 29 Apr 2024 08:13:54 +0000   Mon, 29 Apr 2024 07:46:39 +0000   KubeletReady                 kubelet is posting ready status

Follow the procedure for extending the disks.

Memory pressure

Memory pressure node event is emitted, when kubernetes is running out of free memory. Node conditions looks like this:

[root@speech-platform disks]# kubectl describe node | grep -A 6 Conditions:
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                         Message
  ----             ------  -----------------                 ------------------                ------                         -------
  MemoryPressure   True    Mon, 29 Apr 2024 08:50:50 +0000   Mon, 29 Apr 2024 08:50:50 +0000   KubeletHasInsufficientMemory   kubelet has insufficient memory available
  DiskPressure     False   Mon, 29 Apr 2024 08:50:50 +0000   Mon, 29 Apr 2024 08:33:08 +0000   KubeletHasNoDiskPressure       kubelet has no disk pressure
  PIDPressure      False   Mon, 29 Apr 2024 08:50:50 +0000   Mon, 29 Apr 2024 08:33:08 +0000   KubeletHasSufficientPID        kubelet has sufficient PID available
  Ready            True    Mon, 29 Apr 2024 08:50:50 +0000   Mon, 29 Apr 2024 08:33:08 +0000   KubeletReady                   kubelet is posting ready status

You need to grant more memory to the virtual appliance

View pod logs

Logs are stored in /data/log/pods/ or in /data/logs/containers. You can view them via filebrowser if needed.

Alternatively you can display logs with kubectl command:

[root@speech-platform ~]# kubectl -n speech-platform logs -f voiceprint-extraction-7867578b97-w7bzd
[2024-04-29 08:59:10.250] [Configuration] [info] model: /models/xl-5.0.0.model
[2024-04-29 08:59:10.250] [Configuration] [info] port: 8080
[2024-04-29 08:59:10.250] [Configuration] [info] device: cpu
[2024-04-29 08:59:10.250] [critical] base64_decode: invalid character ''<''

Changes in configuration are not applied

When to use this: Use this when you have made changes to /data/speech-platform/speech-platform-values.yaml but they do not seem to take effect (e.g., new settings aren't reflected in the application, services don’t start properly, etc.).

Why this happens: The Helm controller automatically watches for changes in the config file. If the YAML configuration file is invalid, the update job fails and the system continues running the old config or fails to deploy completely.

How to troubleshoot: If the configuration is incorrect, the update job will not complete successfully, and the underlying pod will either restart or be in an error state. The pod status will reflect this issue.

1. Check the Helm install job status: kubectl get pods -n kube-system | grep -i helm-install

[root@speech-platform disks]# kubectl get pods -n kube-system | grep -i helm-install
helm-install-filebrowser-2b7pn                  0/1     Completed   0             51m
helm-install-ingress-nginx-m87d4                0/1     Completed   0             51m
helm-install-nginx-nrcvk                        0/1     Completed   0             51m
helm-install-dcgm-exporter-fjqzz                0/1     Completed   0             51m
helm-install-kube-prometheus-stack-jn5bz        0/1     Completed   0             51m
helm-install-keda-vsn95                         0/1     Completed   0             51m
helm-install-speech-platform-9l9vj              0/1     Error       4 (46s ago)   6m15s

2. Inspect the logs of the failing job: kubectl logs -f <failing-job-name> -n kube-system

[root@speech-platform disks]# kubectl logs -f helm-install-speech-platform-9l9vj -n kube-system
...
...
...
Upgrading speech-platform
+ helm_v3 upgrade --namespace speech-platform speech-platform https://10.43.0.1:443/static/phonexia-charts/speech-platform-0.0.0-36638f5-helm.tgz --values /config/values-10_HelmChartConfig.yaml
Error: failed to parse /config/values-10_HelmChartConfig.yaml: error converting YAML to JSON: yaml: line 494: could not find expected ':'

3. Next step: Validate the YAML (see next section).

Check configuration file validity

This section describes how to check if your configuration is valid and how to identify which line in the configuration is incorrect.

When to use this: Whenever changes are made to speech-platform-values.yaml, or if a Helm update job fails due to YAML syntax issues.

Why this matters: Helm requires a valid YAML configuration file to parse and apply configuration. A missing colon, incorrect indentation, or misplaced value can break the deployment.

1. How to validate the config:

Run: yq .spec.valuesContent /data/speech-platform/speech-platform-values.yaml | yq .

If the configuration file is valid, the content of the file will be printed. Otherwise, the line number with an error will be printed out as follows:

[root@speech-platform ~]# yq .spec.valuesContent /data/speech-platform/speech-platform-values.yaml | yq .
Error: bad file '-': yaml: line 253: could not find expected ':'

Important tip

The actual configuration is nested under spec.valuesContent, usually starting on line 7. If you see an error on line 253, add 7 (253 + 7 = 260) to get the actual line in the file.

2. View the lines around the error:

Run: cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10

[root@speech-platform ~]# cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10
   250
        model:
          volume:
            hostPath:
              path: /data/models/enhanced_speech_to_text_built_on_whisper
   255
          # Name of a model file inside the volume, for example "large_v2-1.0.0.model"
          file: "large_v2-1.0.1.model"
        license:
          value:
          "eyJ2ZX...=="
   261
      # Uncomment this to grant access to GPU on whisper pod
      resources:
        limits:
          nvidia.com/gpu: "1"
   266
      # Uncomment this to run whisper on GPU
      runtimeClassName: "nvidia"
   269
      service:

4. Fix the error

Example: This is invalid:

value:
"eyJ2ZX...=="

Correct form:

value: "eyJ2ZX...=="

There is only a license key on line 260. Error message could not find expected ':' which is right because there is no : on this line. One line above (259) there is a key named value which should contain the license. However, the license itself is on line 260, making this file invalid (i.e., it is not in a valid YAML format). To fix it, simply merge lines 259 and 260. The resulting file should look like this:

[root@speech-platform ~]# cat -n /data/speech-platform/speech-platform-values.yaml | grep 260 -B 10 -A 10
   250
        model:
          volume:
            hostPath:
              path: /data/models/enhanced_speech_to_text_built_on_whisper
   255
          # Name of a model file inside the volume, for example "large_v2-1.0.0.model"
          file: "large_v2-1.0.1.model"
        license:
          value: "eyJ2ZX...=="
   260
      # Uncomment this to grant access to GPU on whisper pod
      resources:
        limits:
          nvidia.com/gpu: "1"
   265
      # Uncomment this to run whisper on GPU
      runtimeClassName: "nvidia"
   268
      service:
        clusterIP: "None"

Disable DNS resolving for specific domains

When to use this: Use this when you see long response times, timeout errors, or task processing delays due to DNS lookup issues, particularly when using DHCP or custom DNS setups.

Why this happens: This happens when DHCP is used for IP address assignment for the virtual appliance which usually configures nameserver and search domains in /etc/resolv.conf:

nameserver 192.168.137.1
search localdomain

Check coreDNS logs at first:

kubectl -n kube-system logs -l k8s-app=kube-dns

Following lines in the logs indicate this issue:

2024-06-05T11:00:49.55751974Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:60352->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.546562499Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:40254->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.548101103Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:47838->192.168.137.1:53: i/o timeout
2024-06-05T11:00:51.558720939Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:39526->192.168.137.1:53: i/o timeout
2024-06-05T11:00:53.547326187Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:58487->192.168.137.1:53: i/o timeout
2024-06-05T11:00:53.548836432Z stdout F [ERROR] plugin/errors: 2 speech-platform-envoy.localdomain. AAAA: read udp 10.42.0.27:46303->192.168.137.1:53: i/o timeout

Communication within virtual appliance does not use FQDN, which means that each DNS name is resolved with all domains. Internal kubernetes domains (<namespace>.svc.cluster.local, svc.cluster.local and cluster.local) are resolved immediately with coreDNS, non-kubernetes domains are resolved with nameserver provided by DHCP. If access to the nameserver is blocked (for example, by firewall), then resolving of single name can take up to 10 seconds, which can significantly increase task processing duration.

How to resolve: To avoid this issue, you can either allow communication from virtual appliance to DHCP-configured DNS server or configure kubernetes resolver to skip lookup for DHCP-provided domain(s):

1. Create a DNS override file:

[Virtual appliance] Create file /data/speech-platform/coredns-custom.yaml manually with following content. Replace <domain1.com> and <domain2.com> for domain you want to disable lookup for:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  custom.server: |
    <domain1.com>:53 {
      log
    }
    <domain2.com>:53 {
      log
    }

2. Restart CoreDNS to apply the change:

kubectl -n kube-system rollout restart deploy/coredns

3. Verify CoreDNS is healthy and coreDNS pod is running:

kubectl -n kube-system get pods -l k8s-app=kube-dns

4. Restart all speech-platform pods:

kubectl -n speech-platform rollout restart deploy
kubectl -n speech-platform rollout restart sts

Check node status​

Disk pressure​

Memory pressure​

View pod logs​

Changes in configuration are not applied​

Check configuration file validity​

Disable DNS resolving for specific domains​