Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active September 26, 2023 20:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/296c56cc0151121af8bac920a92b0ba3 to your computer and use it in GitHub Desktop.
Save rjurney/296c56cc0151121af8bac920a92b0ba3 to your computer and use it in GitHub Desktop.
Heracles Ubuntu 20.04 Logs - Machine that Shuts Down for No Reason
# ChatGPT says:
1) The NVIDIA module is having trouble retrieving NUMA (Non-Uniform Memory Access) node information for a specific device. NUMA is a method for setting up memory in multiprocessor systems to better match memory access times with processor cores. This might be related to the multi-GPU setup you mentioned.
2) The NVIDIA kernel module for your graphics cards is being loaded. The version is 530.30.02, and it was compiled on Wed Feb 22 04:11:39 UTC 2023.
3) This is a warning from the ACPI subsystem in the kernel, related to device configuration and power management. The specific warning pertains to an argument type mismatch for a given method. Such ACPI warnings can result from inconsistencies between the computer's BIOS/UEFI firmware and the OS's expectations.
4) Lastly, the nvidia-drm, nvidia_uvm, and other NVIDIA related entries refer to the NVIDIA graphics driver's components being initialized. The Direct Rendering Manager (drm) is used in graphics pipelines, while nvidia_uvm refers to NVIDIA's Unified Virtual Memory, a feature of their CUDA platform. Given these logs and the repeated ACPI errors (AE_ALREADY_EXISTS), I suspect there might be a slight incompatibility or configuration issue between your system's BIOS/UEFI firmware and the installed version of Ubuntu. This could be amplified by the presence of multiple identical GPUs. It might help to check for a BIOS/UEFI update, as such updates can sometimes improve compatibility and resolve ACPI-related issues. If the system is working fine, these could be benign warnings. If you're facing any actual issues, I'd suggest checking NVIDIA forums and Ubuntu forums with your GPU model and the specific errors for more tailored advice.
------------------------Logs-follow---------------------------
[ 2.201430] kernel: EXT4-fs (nvme0n1p2): re-mounted. Opts: errors=remount-ro. Quota mode: none.
[ 2.201984] kernel: usb 1-11: new full-speed USB device number 6 using xhci_hcd
[ 2.202214] systemd[1]: Finished Remount Root and Kernel File Systems.
[ 2.202618] systemd[1]: Activating swap /swapfile...
[ 2.202831] systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
[ 2.203450] systemd[1]: Starting Load/Save Random Seed...
[ 2.203995] systemd[1]: Starting Create System Users...
[ 2.206252] kernel: nvidia_fs: loading out-of-tree module taints kernel.
[ 2.206284] kernel: nvidia_fs: module verification failed: signature and/or required key missing - tainting kernel
[ 2.206428] kernel: nvidia_fs: Initializing nvfs driver module
...
[ 2.206429] kernel: nvidia_fs: registered correctly with major number 236
[ 2.206769] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:02:00.0
[ 2.206772] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:03:00.0
[ 2.206791] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:02:00.0
[ 2.206806] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:06:00.0
[ 2.206820] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:02:00.0
[ 2.206829] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:07:00.0
[ 2.206837] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:02:00.0
[ 2.206846] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:09:00.0
[ 2.206857] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:01:00.0
[ 2.206866] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:03:00.0
[ 2.206874] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:01:00.0
[ 2.206883] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:06:00.0
[ 2.206891] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:01:00.0
[ 2.206900] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:07:00.0
[ 2.206908] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:01:00.0
[ 2.206917] kernel: nvidia-fs:warning: error retrieving numa node for device 0000:09:00.0
...
[ 3.030228] kernel: thermal thermal_zone2: failed to read out thermal zone (-61)
[ 3.040338] kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 530.30.02 Wed Feb 22 04:11:39 UTC 2023
[ 3.047667] kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 530.30.02 Wed Feb 22 03:45:40 UTC 2023
[ 3.049037] kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 3.066900] kernel: ACPI Warning: \_SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[ 3.082046] kernel: apple 0003:05AC:0220.0003: input,hidraw4: USB HID v1.11 Device [Apple, Inc Apple Keyboard] on usb-0000:00:14.0-7.2/input1
[ 3.090402] kernel: Bluetooth: hci0: MSFT filter_enable is already on
[ 3.177673] kernel: iwlwifi 0000:08:00.0: loaded PNVM version 0x5a8dfca
[ 3.189916] kernel: iwlwifi 0000:08:00.0: Detected RF GF, rfid=0x10d000
[ 3.242652] kernel: NET: Registered PF_ALG protocol family
[ 3.245306] kernel: intel_rapl_common: Found RAPL domain package
[ 3.245307] kernel: intel_rapl_common: Found RAPL domain core
[ 3.249626] kernel: aufs 5.15.5-20211129
[ 3.259982] kernel: iwlwifi 0000:08:00.0: base HW address: 2c:0d:a7:de:22:0e
[ 3.265433] kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.265451] kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
[ 3.265461] kernel:
[ 3.265462] kernel: No Local Variables are initialized for Method [_DSM]
[ 3.265462] kernel:
[ 3.265463] kernel: Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
[ 3.265463] kernel: Arg0: 00000000be1f3f11 <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
[ 3.265467] kernel: Arg1: 00000000f2d76158 <Obj> Integer 0000000000000102
[ 3.265469] kernel: Arg2: 000000003988fe13 <Obj> Integer 0000000000000010
[ 3.265470] kernel: Arg3: 000000005055a5cc <Obj> Buffer(4) 00 10 52 44
[ 3.265472] kernel:
[ 3.265473] kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 3.265617] kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.265630] kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
[ 3.265639] kernel:
[ 3.265639] kernel: No Local Variables are initialized for Method [_DSM]
[ 3.265640] kernel:
[ 3.265640] kernel: Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
[ 3.265640] kernel: Arg0: 000000005055a5cc <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
[ 3.265643] kernel: Arg1: 000000003988fe13 <Obj> Integer 0000000000000102
[ 3.265644] kernel: Arg2: 00000000f2d76158 <Obj> Integer 0000000000000010
[ 3.265646] kernel: Arg3: 00000000be1f3f11 <Obj> Buffer(4) 00 20 52 44
[ 3.265648] kernel:
[ 3.265648] kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 3.265790] kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.265802] kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
... lots of ...
[ 3.269891] kernel: No Local Variables are initialized for Method [_DSM]
[ 3.269891] kernel:
[ 3.269891] kernel: Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
[ 3.269891] kernel: Arg0: 000000005055a5cc <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
[ 3.269894] kernel: Arg1: 000000003988fe13 <Obj> Integer 0000000000000102
[ 3.269896] kernel: Arg2: 00000000f2d76158 <Obj> Integer 0000000000000010
[ 3.269897] kernel: Arg3: 00000000be1f3f11 <Obj> Buffer(4) 00 80 52 44
[ 3.269899] kernel:
[ 3.269899] kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 3.271233] kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.272020] kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
[ 3.272424] kernel:
[ 3.272424] kernel: No Local Variables are initialized for Method [_DSM]
[ 3.272424] kernel:
[ 3.272425] kernel: Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
[ 3.272425] kernel: Arg0: 00000000be1f3f11 <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
[ 3.272428] kernel: Arg1: 00000000f2d76158 <Obj> Integer 0000000000000102
[ 3.272429] kernel: Arg2: 000000003988fe13 <Obj> Integer 0000000000000010
[ 3.272430] kernel: Arg3: 000000005055a5cc <Obj> Buffer(4) 00 90 52 44
[ 3.272432] kernel:
[ 3.272433] kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 3.273390] kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
[ 3.274433] kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
... ending in ...
[ 3.930684] kernel: No Local Variables are initialized for Method [_DSM]
[ 3.930684] kernel:
[ 3.930685] kernel: Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
[ 3.930686] kernel: Arg0: 0000000075b229ee <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
[ 3.930689] kernel: Arg1: 00000000c31cd1cf <Obj> Integer 0000000000000102
[ 3.930691] kernel: Arg2: 00000000e42d8972 <Obj> Integer 0000000000000010
[ 3.930692] kernel: Arg3: 000000005055a5cc <Obj> Buffer(4) 00 00 50 4F
[ 3.930694] kernel:
[ 3.930695] kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
[ 4.363840] kernel: NFSD: Using UMH upcall client tracking operations.
[ 4.363845] kernel: NFSD: starting 90-second grace period (net f0000000)
[ 4.836443] kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 4.836781] kernel: [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[ 4.837468] kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 1
[ 4.854094] kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 4.856824] kernel: nvidia-uvm: Loaded the UVM driver, major device number 505.
heracles Tue Sep 26 13:24:52 2023 530.30.02
[0] NVIDIA GeForce RTX 3060 | 40'C, 0 % | 613 / 12288 MB | gdm(101M) rjurney(207M) rjurney(44M) rjurney(2M)
[1] NVIDIA GeForce RTX 3060 | 39'C, 0 % | 254 / 12288 MB | gdm(4M) rjurney(4M)
# Note: ChatGPT says - The kernel log entries you provided are related to ACPI (Advanced Configuration and Power Interface) and the NVIDIA GPU driver on your system.
------------------------Logs-follow---------------------------
Sep 25 11:22:10 localhost kernel: [ 3.924315] ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
Sep 25 11:22:10 localhost kernel: [ 3.925958] ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
Sep 25 11:22:10 localhost kernel: [ 3.927158]
Sep 25 11:22:10 localhost kernel: [ 3.927160] No Local Variables are initialized for Method [_DSM]
Sep 25 11:22:10 localhost kernel: [ 3.927160]
Sep 25 11:22:10 localhost kernel: [ 3.927161] Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
Sep 25 11:22:10 localhost kernel: [ 3.927162] Arg0: 000000005055a5cc <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
Sep 25 11:22:10 localhost kernel: [ 3.927170] Arg1: 00000000e42d8972 <Obj> Integer 0000000000000102
Sep 25 11:22:10 localhost kernel: [ 3.927173] Arg2: 00000000c31cd1cf <Obj> Integer 0000000000000010
Sep 25 11:22:10 localhost kernel: [ 3.927176] Arg3: 0000000075b229ee <Obj> Buffer(4) 00 00 4B 56
Sep 25 11:22:10 localhost kernel: [ 3.927180]
Sep 25 11:22:10 localhost kernel: [ 3.927182] ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
Sep 25 11:22:10 localhost kernel: [ 3.928723] ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20210730/dsfield-184)
Sep 25 11:22:10 localhost kernel: [ 3.929891] ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20210730/dswload2-477)
Sep 25 11:22:10 localhost kernel: [ 3.930683]
Sep 25 11:22:10 localhost kernel: [ 3.930684] No Local Variables are initialized for Method [_DSM]
Sep 25 11:22:10 localhost kernel: [ 3.930684]
Sep 25 11:22:10 localhost kernel: [ 3.930685] Initialized Arguments for Method [_DSM]: (4 arguments defined for method invocation)
Sep 25 11:22:10 localhost kernel: [ 3.930686] Arg0: 0000000075b229ee <Obj> Buffer(16) 75 0B A5 D4 C7 65 F7 46
Sep 25 11:22:10 localhost kernel: [ 3.930689] Arg1: 00000000c31cd1cf <Obj> Integer 0000000000000102
Sep 25 11:22:10 localhost kernel: [ 3.930691] Arg2: 00000000e42d8972 <Obj> Integer 0000000000000010
Sep 25 11:22:10 localhost kernel: [ 3.930692] Arg3: 000000005055a5cc <Obj> Buffer(4) 00 00 50 4F
Sep 25 11:22:10 localhost kernel: [ 3.930694]
Sep 25 11:22:10 localhost kernel: [ 3.930695] ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20210730/psparse-529)
Sep 25 11:22:11 localhost kernel: [ 4.363840] NFSD: Using UMH upcall client tracking operations.
Sep 25 11:22:11 localhost kernel: [ 4.363845] NFSD: starting 90-second grace period (net f0000000)
Sep 25 11:22:11 localhost kernel: [ 4.836443] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Sep 25 11:22:11 localhost kernel: [ 4.836781] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
Sep 25 11:22:11 localhost kernel: [ 4.837468] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 1
Sep 25 11:22:11 localhost kernel: [ 4.854094] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
Sep 25 11:22:11 localhost kernel: [ 4.856824] nvidia-uvm: Loaded the UVM driver, major device number 505.
Sep 25 11:22:17 localhost kernel: [ 10.854179] atlantic 0000:07:00.0 eno1: atlantic: link change old 0 new 1000
Sep 25 11:22:17 localhost kernel: [ 10.854376] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
Sep 25 11:22:19 localhost kernel: [ 12.336963] TCP: eno1: Driver has suspect GRO implementation, TCP performance may be compromised.
Sep 25 11:22:22 localhost kernel: [ 15.411945] bpfilter: Loaded bpfilter_umh pid 1352
Sep 25 11:22:22 localhost kernel: [ 15.646997] kauditd_printk_skb: 44 callbacks suppressed
Sep 25 11:22:22 localhost kernel: [ 15.647004] audit: type=1400 audit(1695666142.455:55): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=1396 comm="apparmor_parser"
Sep 25 11:22:23 localhost kernel: [ 16.452563] atlantic 0000:07:00.0 eno1: atlantic: link change old 1000 new 0
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri Aug 18 12:49:41 2023
installer version: 535.98
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
Using: nvidia-installer ncurses v6 user interface
-> Detected 24 CPUs online; setting concurrency level to 24.
ERROR: An NVIDIA kernel module 'nvidia-modeset' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Tue Sep 26 12:04:24 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 39C P8 14W / 170W| 367MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 On | 00000000:02:00.0 Off | N/A |
| 0% 38C P8 14W / 170W| 10MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1482 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 5615 G /usr/lib/xorg/Xorg 207MiB |
| 0 N/A N/A 5745 G /usr/bin/gnome-shell 44MiB |
| 0 N/A N/A 6065 G /usr/libexec/gnome-initial-setup 2MiB |
| 1 N/A N/A 1482 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 5615 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment