糟！ - Blog

糟！

今天 google 給了個 email 説網站有很多 500，查了一下發現整個網站都掛掉了。

然後不知怎的 database 的東西全都沒了，走進去伺服器看看原來 volume mounts 全掉了。

啊，這是伺服器人體自燃吧，看了一下 uptime：

05:01:13 up 4 days, 1:57, 1 user, load average: 2.78, 1.69, 1.50

4 天前重啟了，log 也沒有特別説明發生了什麼事。

Jun 20 02:56:50 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:50.978009     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"app\" with CrashLoopBackOff: \"back-off 2m40s restarting>
Jun 20 02:56:53 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:53.792150     532 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed >
Jun 20 02:56:53 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:53.979162     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"init-directories\" with ImagePullBackOff: \"Back-off pul>
Jun 20 02:57:00 ssdnodes-601dac9c59cbb kubelet[532]: I0620 02:57:00.976996     532 scope.go:110] "RemoveContainer" containerID="d6110fd909aaee861e308763b80b4ca4fbc0c73596f8b1c6d95a0ca2d41585a3"
Jun 20 02:57:00 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:00.977418     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"astrocraft-br-minecraft-bedrock\" with CrashLoopBackOff:>
Jun 20 02:57:03 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:03.866636     532 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed >
Jun 20 02:57:04 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:04.980161     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"init-directories\" with ImagePullBackOff: \"Back-off pul>
Jun 20 02:57:05 ssdnodes-601dac9c59cbb kubelet[532]: I0620 02:57:05.977615     532 scope.go:110] "RemoveContainer" containerID="18be9e74e0be9d0d443da209e7436259e3debb3f357aede55aa74913f131b0c7"
Jun 20 02:57:05 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:05.977873     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"app\" with CrashLoopBackOff: \"back-off 2m40s restarting>
-- Boot b19fe9c889514469a40cc519127da728 --
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: Linux version 5.10.0-10-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.84-1 (2021-12-08)
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 root=UUID=cded99bf-532b-445b-81cd-6423d2885be2 ro quiet
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64

（上面是 UTC，也就是説在 2024-06-20 11:00 AM HKT 重啟了，這個應該是 SSDNodes 的窩吧）

總之先將 volumes mount 上去吧。

唔？密碼不正確？那這個呢？這個也不對，這個⋯⋯也不對。

忘記密碼了！

沒辦法，只能從備份中還原了。

這幾天的備份不能用⋯⋯咦？怎麼 6 月 10 — 20 日的備份都不見了？

啊！這應該是之前 k8s 的 cert 過期了，重新弄了個新證書之後似乎整個 cluster 就 GG 了

～10 天後～

又重啟了，這次又怎麼了？

唔⋯⋯可能是系統問題，反正很久沒做系統更新了，索性一次全部更新吧。

糟！kubelet 的版本是 1.23，太舊了連 deb 的 repo 都沒了！網上挖了一輪是要手動安裝 1.23 再慢慢推上去⋯⋯

糟！1.24 以後不支援 docker CRI！

這個有點難搞。好像只能整個斬掉重錬了？畢竟比起另外弄一個 cluster 搬過去好像重塑一個更快。

好麻煩，這個先擱一邊好了，得找個假期認真弄。

Tag(s): diary

斟酌鵬兄

Fri Jul 05 2024 17:59:59 GMT+0000 (Coordinated Universal Time)

Last modified: Fri Jul 05 2024 18:00:15 GMT+0000 (Coordinated Universal Time)