糟!
今天 google 給了個 email 説網站有很多 500,查了一下發現整個網站都掛掉了。

然後不知怎的 database 的東西全都沒了,走進去伺服器看看原來 volume mounts 全掉了。

啊,這是伺服器人體自燃吧,看了一下 uptime

05:01:13 up 4 days, 1:57, 1 user, load average: 2.78, 1.69, 1.50

4 天前重啟了,log 也沒有特別説明發生了什麼事。
Jun 20 02:56:50 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:50.978009     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"app\" with CrashLoopBackOff: \"back-off 2m40s restarting>
Jun 20 02:56:53 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:53.792150     532 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed >
Jun 20 02:56:53 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:56:53.979162     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"init-directories\" with ImagePullBackOff: \"Back-off pul>
Jun 20 02:57:00 ssdnodes-601dac9c59cbb kubelet[532]: I0620 02:57:00.976996     532 scope.go:110] "RemoveContainer" containerID="d6110fd909aaee861e308763b80b4ca4fbc0c73596f8b1c6d95a0ca2d41585a3"
Jun 20 02:57:00 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:00.977418     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"astrocraft-br-minecraft-bedrock\" with CrashLoopBackOff:>
Jun 20 02:57:03 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:03.866636     532 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed >
Jun 20 02:57:04 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:04.980161     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"init-directories\" with ImagePullBackOff: \"Back-off pul>
Jun 20 02:57:05 ssdnodes-601dac9c59cbb kubelet[532]: I0620 02:57:05.977615     532 scope.go:110] "RemoveContainer" containerID="18be9e74e0be9d0d443da209e7436259e3debb3f357aede55aa74913f131b0c7"
Jun 20 02:57:05 ssdnodes-601dac9c59cbb kubelet[532]: E0620 02:57:05.977873     532 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"app\" with CrashLoopBackOff: \"back-off 2m40s restarting>
-- Boot b19fe9c889514469a40cc519127da728 --
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: Linux version 5.10.0-10-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.84-1 (2021-12-08)
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 root=UUID=cded99bf-532b-445b-81cd-6423d2885be2 ro quiet
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Jun 20 03:03:26 ssdnodes-601dac9c59cbb kernel: x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
(上面是 UTC,也就是説在 2024-06-20 11:00 AM HKT 重啟了,這個應該是 SSDNodes 的窩吧)

總之先將 volumes mount 上去吧。

唔?密碼不正確?那這個呢?這個也不對,這個⋯⋯也不對。

忘記密碼了!

沒辦法,只能從備份中還原了。
這幾天的備份不能用⋯⋯咦?怎麼 6 月 10 — 20 日的備份都不見了?

啊!這應該是之前 k8s 的 cert 過期了,重新弄了個新證書之後似乎整個 cluster 就 GG 了

~10 天後~

又重啟了,這次又怎麼了?

唔⋯⋯可能是系統問題,反正很久沒做系統更新了,索性一次全部更新吧。

糟!kubelet 的版本是 1.23,太舊了連 deb 的 repo 都沒了!網上挖了一輪是要手動安裝 1.23 再慢慢推上去⋯⋯

糟!1.24 以後不支援 docker CRI!

這個有點難搞。好像只能整個斬掉重錬了?畢竟比起另外弄一個 cluster 搬過去好像重塑一個更快。

好麻煩,這個先擱一邊好了,得找個假期認真弄。
Tag(s): diary
Profile picture
斟酌 鵬兄
Fri Jul 05 2024 17:59:59 GMT+0000 (Coordinated Universal Time)
Last modified: Fri Jul 05 2024 18:00:15 GMT+0000 (Coordinated Universal Time)
Comments
No comments here.
Do you even comment?
website: 
Not a valid website
Invalid email format
Please enter your email
*Name: 
Please enter a name
Submit
抱歉,Google Recaptcha 服務被牆掉了,所以不能回覆了