Just a data point...
Hi All,
I ran a distributed test across several machines today, each one tasked with computing 100 512-bit strong prime numbers in an array. The computation spreads out the array calculation into background execute threads using MP:FUNCALL-ASYNC so that all the CPU cores are kept close to 100% busy, including the HyperThread cores. These all happen to be various aged Mac Minis and iMacs. Most were running OS X of various ages, while one remote box was a Mac Mini running Win 7/64.
On 3 of the machines, the computation proceeded without incident. On the fourth machine (the local Mac Mini), after several repeated runs, the machine started producing errors stating that the destination array variable was unbound, from several of the async threads. That array variable (xv below) would have been in the closure vars of each execution instance:
(let ((xv (make-array nbr)))
(loop for ix from 0 below nbr do
(setf (aref xv ix) ix))
(um:npvmap (lambda (ix)
(let ((ans (generate-safe-prime 512)))
(format t "~&~3D ~128,'0x" ix ans)
ans))
xv)
xv)
There is nothing wrong with the code. A gazillion cycles have been successfully performed on it. And in general, I find the MP stuff in LW to be extremely robust.
About the only thing I can think of is that my CPU & Memory could have been too heated after so much work, so that the internal timing became skewed, or else a row-hammer sort of situation developed.
Just a point of curiosity at this time. Exiting LW and restarting seems to have cleared it up. So that makes me think it was memory corruption in the GC (i.e., row-hammer incident).
Has anyone else ever had curious incidents like these?
- DM
_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html