cross-posted from: https://lemmy.ml/post/37817953

Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).

here is my journalctl log:

Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu:  Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out

I tried to check the path /sys/class/drm/card1/device/devcoredump/data after reboot, but there isn’t any thing(in fact, devcoredump folder dont even exist.

My specs: Distro: Arch Kernel: 6.17.3.arch2-1 Driver: Mesa 1:25.2.4-2 Gpu: rx 580 Cpu: r5 5500 PSU: EVGA 650 N1 650w I am on latest version of my bios)

Edit: my

Is there anything I can do to diagnose the issue? Any help is appreciated. Thanks you!

  • frongt@lemmy.zip
    link
    fedilink
    arrow-up
    7
    ·
    22 hours ago

    Distro? Driver version? Temperature? Is it receiving enough power?

    If everything checks out, it might just be defective.

    • Kiuyn@lemmy.mlOP
      link
      fedilink
      arrow-up
      5
      ·
      edit-2
      22 hours ago

      Hi when I run AI model it will crash when the GPU temp is around 82C for more than a few seconds, is that because of temp, or the GPU is defective? For the other info you asked I use arch, I am on kernel 6.17.3.arch2-1and mesa 1:25.2.4-2

      • eldavi@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        one possibly expensive way to find out is to add an expensive cooling solution to it to see if it stays active.