Mar 18, 2016

Then why's the paper[1] say different? The odds of a double-flip are smaller than s single-flip; a single flip gets corrected, a double stops the box (depending on the bios etc), an even rarer triple may not get detected... the odds are exactly why it just takes longer. This paper reports observing it happening.


Mar 17, 2016

The memory can recover one, detect two, and will throw up loud warning bells to any competent admin as long as it is doing so.

I don't have personal experience here, but one of the important claims in the paper is that this warning is not given on all servers:

  Unfortunately, server vendors routinely use a technique   
  called ECC threshold or the 'leaky bucket' algorithm where 
  they count ECC errors for a period of time and report them 
  only if they reach certain levels of failure. From what we 
  understand, this threshold is commonly above 100 per hour, 
  but this remains a trade secret and varies based on the 
  server vendor. So, to see ECC errors (MCE in Linux or
  WHEA in Windows), there generally needs to be 100 bit flips 
  per hour or greater. This makes “seeing” Rowhammer on 
  server error logs more difficult.

  In addition, we have observed some server vendors will 
  NEVER report ECC events back to the OS,although they might 
  get logged into IPMI. Typically, users expect to see 
  correctable ECC errors logged directly to the OS or that 
  halt the system when they cannot be corrected. During our 
  investigation into this phenomenon, we even encountered one 
  server that neither reported ECC events to the OS nor
  halted when bit flips were not correctable. The end result 
  was data corruption at the application level.
  This is something, in our opinion, that should never happen 
  on an ECC protected server system.

Mar 17, 2016

Link to actual paper -