AMDGPU crash when on high load, blackscreen and gpu fan go crazy.

Kiuyn@lemmy.ml · edit-2 13 hours ago

AMDGPU crash when on high load, blackscreen and gpu fan go crazy.

IceVAN@beehaw.org · 4 hours ago

Do you have a powerful/decent/not-too-old enough PSU?.

Kiuyn@lemmy.ml · 4 hours ago

My PSU is one year old, 650w(EVGA 650 N1. The problem is there seem to be a lot of criticism towards it.(people said it is really bad) etc.

Mike Wooskey@lemmy.thewooskeys.com · 5 hours ago

When I experienced the same symptoms, i eventually found out if was because ROCm didn’t support having an AMD GPU as well as an AMD iGPU (iGPU is an integrated GPU, on the motherboard). Once i disabled the iGPU, those symptoms stopped.

l don’t remember how i disabled the iGPU. Might have been in the bios settings, might have been a kernel parameteretc in /default/grub.

If it doesn’t fix your issue, you can just re-enable the iGPU.

Kiuyn@lemmy.ml · 5 hours ago

Hi ty, for the comment, I don’t have IGPU though, so I don’t think it is my issue.

PetteriPano@lemmy.world · 8 hours ago

I have two machines running the latest kernels on EndeavourOS. One with a Radeon RX 7900 XTX has no issues.

The other one has a Radeon 6650 XT, which since a week or two ago starts getting kworker threads stuck while throwing errors about fence queues. Load can go up to the hundreds (while there’s no real load, but just blocked threads), until the machine crashes.

As I recall there was an amdgpu firmware update around the time it started happening, but the changelog on the amdgpu kernel driver hints at solving similar issues.

Kiuyn@lemmy.ml · 5 hours ago

Hi I thank you for the information, I will try reverse version of some firmware and LTS kernel to see if the issue is still persisted.

Semperverus@lemmy.world · edit-2 8 hours ago

This happens to me when I run games sometimes in 4k at max settings, with a 7900XTX. So far I have not found anything that prevents it, and I’m starting to suspect my power supply or my house’s wiring might be the issue. It almost seems like a voltage sag.

Kiuyn@lemmy.ml · 5 hours ago

I also started to suspect my PSU, because EVGA 650 n1 is a notorious to be a bad PSU(I only found out after quite a long time after I bought it). The problem is my GPU is also second-hand so I am not sure rn TBH.

frongt@lemmy.zip · 13 hours ago

Distro? Driver version? Temperature? Is it receiving enough power?

If everything checks out, it might just be defective.

Kiuyn@lemmy.ml · edit-2 13 hours ago

Hi when I run AI model it will crash when the GPU temp is around 82C for more than a few seconds, is that because of temp, or the GPU is defective? For the other info you asked I use arch, I am on kernel 6.17.3.arch2-1and mesa 1:25.2.4-2

glitching@lemmy.ml · 8 hours ago

what/how do you run LLM on a RX 580? I thought ROCM was for RX 6xxx and newer?

Kiuyn@lemmy.ml · edit-2 5 hours ago

I run it on vulkan with llama-cpp

PeeOnYou [he/him]@lemmygrad.ml · 13 hours ago

Whats your power situation? Maybe your PSU isn’t supplying enough power when everything is cranking away?

just_another_person@lemmy.world · 14 hours ago

See my other comment: https://lemmy.world/comment/20051971