cross-posted from: https://lemmy.ml/post/37817953
Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).
here is my journalctl log:
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out
I tried to check the path /sys/class/drm/card1/device/devcoredump/data
after reboot, but there isn’t any thing(in fact, devcoredump
folder dont even exist.
My specs: Distro: Arch Kernel: 6.17.3.arch2-1 Driver: Mesa 1:25.2.4-2 Gpu: rx 580 Cpu: r5 5500 PSU: EVGA 650 N1 650w I am on latest version of my bios)
Edit: my
Is there anything I can do to diagnose the issue? Any help is appreciated. Thanks you!
Hi when I run AI model it will crash when the GPU temp is around 82C for more than a few seconds, is that because of temp, or the GPU is defective? For the other info you asked I use arch, I am on kernel 6.17.3.arch2-1and mesa 1:25.2.4-2
one possibly expensive way to find out is to add an expensive cooling solution to it to see if it stays active.
what/how do you run LLM on a RX 580? I thought ROCM was for RX 6xxx and newer?
I run it on vulkan with llama-cpp