Cosmic Ray Showers Crash Supercomputers. Here's What to Do About It
The Cray-1 supercomputer, the world’s fastest back in the 1970s, does not look like a supercomputer. It looks like a mod version of that carnival ride The Round Up, the one where you stand, strapped in, as it dizzies you up. It’s surrounded by a padded bench that conceals its power supplies, like a cake donut, if the hole was capable of providing insights about nuclear weapons.
After Seymour Cray first built this computer, he gave Los Alamos National Laboratory a six-month free trial. But during that half-year, a funny thing happened: The computer experienced 152 unattributable memory errors. Later, researchers would learn that cosmic-ray neutrons can slam into processor parts, corrupting their data. The higher you are, and the bigger your computers, the more significant a problem this is. And Los Alamos—7,300 feet up and home to some of the world’s swankiest processors—is a prime target.
The world has changed a lot since then, and so have computers. But space has not. And so Los Alamos has had to adapt—having its engineers account for space particles in its hard- and software. “This is not really a problem we’re having,” explains Nathan DeBardeleben of the High Performance Computing Design group. “It’s a problem we’re keeping at bay.”
For modern supercomputers, starting with one called Q, this is a big deal. Installed in 2003, Q was much quicker than the Cray-1, and it churned through calculations on the country's nest-egg of nuclear weapons. But it crashed more than expected—the first failures that caused Los Alamos scientists to really worry about cosmic rays, charged particles that come from outer space. They collide with the chemicals in the atmosphere, and the whole mess breaks apart into smaller particles. “They literally make these showers that just rain down on us,” says Sean Blanchard, from the High Performance Computing Design group. And some of the raindrops are neutrons—which are bad news.
“They can cause computer memory to flip bits,” says De Bardeleben, “a 0 to 1 or 1 to 0.” That doesn’t much matter for your home computer. But Los Alamos has big number-crunchers. The early-aughts' Q, for instance, called to mind grocery-store aisles. And today, the facility has racks of computers the size of a football field, and all the computers in that football field may be working on solving the same problem. Just as a football field sees a larger volume of rain than a back yard, supercomputers see more cosmic ray neutrons than your MacBook.
After Q, the lab's engineers truly understood that neutrons are not neutral parties, so now they try to preempt problems. Before Los Alamos installs new equipment, like its Trinity machine, engineers perform a kind of cosmic stress-test, placing the electronics in a beam of neutrons—many more than cascade from the sky at any given time—and watching what happens. “We take parts and make them radioactive and make them crash,” explains Blanchard. They will also soon place neutron detectors inside the supercomputing center, to measure the strength of the storm. If you know how many neutrons you’re getting, and you know how they make computer parts behave, "you can predict the lifetime of your electronics,” says Suzanne Nowicki, a physicist in the lab’s space science and applications group.
Supercomputers are usually smart enough to know if something has gone wrong, to feel that flipped bit like you’d feel someone tugging on a single strand of hair. And when that happens, the system simply usually reports the error and rights itself. But sometimes, says Blanchard, the computer is more pessimistic. “I have an error. Too many bits flipped,” he mimics. “I can’t fix it, but I wanted you to know it happened.”
When that occurs at Los Alamos, they crash the computers—intentionally. It's like falling down on purpose when you're skiing, because that will hurt less than whatever else is about to happen. But you don't have to walk back to the top of the slope and start all over again: The engineers have created "checkpoints" throughout the quest to answers. It's like the save-spots in video games: If you die, you don't have to start all over. You start at the last spot you cached your achievements. Supercomputers can do the same kind of save.
The real problem, though, is “silent data corruption.” That's when the bits flip, and no one notices. The answer that you think is right may actually be a neutron-induced dream. That's why the preemptive work is so important: They know what to expect, how often, and can keep an eye out for it. At the same time, with that knowledge, the team hopes to turn what could have been silent errors into screaming errors. But if something does slip through, it's possible the flesh will catch it. Usually, Los Alamos doesn't say, "Here's your answer!" until an actual human checks out the results to see if they make sense.
That personal intervention happens partly because Los Alamos does crucial research on topics that affect many other persons. “The laboratory—the Department of Energy in general—studies climate change, new drugs, epidemiology, the spreading of diseases, wildfire modeling, all kinds of disease modeling, materials science, fragility of new metals,” explains Blanchard. And, as Blanchard adds after this list, the reason Los Alamos exists is because humans (some here, actually, at this lab) created nuclear weapons. “We’re a nuclear weapons lab,” Blanchard says. “Our job is stockpile stewardship. Our job is to make sure it’s secure and works as designed and doesn’t work when it isn’t supposed to.”
Because of nuclear test bans, the only legit way to stop worrying and learn to steward the bomb supply is to simulate—on a supercomputer—what’s going on inside. And so this place that concerns itself with radiation on Earth also must concern itself with radiation from space. Because whatever work supercomputers do in the future, one thing is certain: “They’re a bigger target every year,” says Blanchard.