Please write out the GPU model to make debugging easier


Message boards : Wish list : Please write out the GPU model to make debugging easier

Message board moderation

To post messages, you must log in.
AuthorMessage
Profile JStateson
Avatar

Send message
Joined: 16 Jan 14
Posts: 17
Credit: 27,583,507
RAC: 14,866
Message 7728 - Posted: 8 Feb 2023, 18:57:46 UTC

Last modified: 8 Feb 2023, 18:58:19 UTC
On two systems, I have several Nvidia boards GTX-1060 (3 & 6 gb), 1660, 1070, p102-100 and occassionally a work unit fails to run with the message (for example)

<core_client_version>7.21.0</core_client_version>
<![CDATA[
<message>
The system cannot find the file specified.
 (0x2) - exit code 2 (0x2)</message>
<stderr_txt>

Error: Number of lc points is greater than POINTS_MAX = 1000
</stderr_txt>
]]>


I have no idea which board had the problem.
ID: 7728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 16 Nov 22
Posts: 99
Credit: 58,060,173
RAC: 399,135
Message 7729 - Posted: 8 Feb 2023, 20:38:29 UTC - in response to Message 7728.  
Is this just for errored tasks? Don't you get the board identified in the stderr.txt output for validated tasks?
Like this:

<core_client_version>7.19.0</core_client_version>
<![CDATA[
<stderr_txt>
BOINC client version 7.19.0
BOINC GPU type 'NVIDIA', deviceId=0, slot=14
CUDA version: 11080
CUDA Device number: 0
CUDA Device: NVIDIA GeForce RTX 3080 12037MB
CUDA Device driver: 525.78.01
Compute capability: 8.6
Shared memory per Block | per SM: 49152 | 102400
Multiprocessors: 70
Resident blocks per multiprocessor: 16
Grid dim: 1120 = 70*16
Block dim: 128
11:17:52 (229427): called boinc_finish(0)

</stderr_txt>
]]

And this is what I get for an errored task:

<core_client_version>7.19.0</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 48606.68 (138002300.00G/2839.16G)</message>
<stderr_txt>
BOINC client version 7.19.0
BOINC GPU type 'NVIDIA', deviceId=1, slot=2
malloc(): invalid size (unsorted)
SIGABRT: abort called
BOINC client version 7.19.0
BOINC GPU type 'NVIDIA', deviceId=0, slot=2
malloc(): invalid size (unsorted)
SIGABRT: abort called

</stderr_txt>
]]>

A proud member of the OFA (Old Farts Association)
ID: 7729 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 16 Jan 14
Posts: 17
Credit: 27,583,507
RAC: 14,866
Message 7730 - Posted: 9 Feb 2023, 17:21:02 UTC - in response to Message 7729.  
Hi Keith!

I am making some progress. I found the windows version of nvidia-smi and it shows one of my GPUs is "lost"

C:\Program Files\NVIDIA Corporation>nvidia-smi
Unable to determine the device handle for GPU0000:02:00.0: GPU is lost.  Reboot the system to recover this GPU


I have removed that GPU. There was no indication from the windows device manager of any problem
ID: 7730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Wish list : Please write out the GPU model to make debugging easier