A rack of electronic equipment attached to an oscilloscope.

It was DNS power sequencing

A headshot of Liam McSherry

Liam

Looking back, I often find a lot of the ‘critical’ issues I dealt with were much simpler than they felt at the time.

I worked at Oxford Ionics for a little while, where they're building quantum computers based on trapped ions (part of a test system, Sandcastle, is pictured).

Almost everything at OI was driven by the science, so things getting in the way of the science tended to be high priority. We were in the process of rolling out Kasli-SoC cards, little embedded systems that run the hard real-time parts of ‘experiments,’ when we noticed reports of failures coming in. This was a two-for-one on importance—we were being limited by the lower-powered Kasli cards and, now, it looked like we might not have a viable solution—so it fell to me as the resident FPGA person to figure out what the actual situation was and fix it.

As ever, the first step in fixing a bug is finding it.

A Kasli-SoC, in action.
A Kasli-SoC, in action.

Kasli-SoC has 12 headers for connecting other cards, and the bug-reporter had noticed that the failures only seemed to happen with certain headers connected to distinct banks of pins. That's better than nothing—even though their concern about signalling standards definitely isn't the issue,[1] it could still be something to do with the banks:

What followed was a mind-numbing process of attaching some Fastino DACs to some Kasli-SoCs and power-cycling them hundreds of times in different positions. But, intermittently, we did see failures! And there was a very loose correlation with the pin banks, like had been reported.

Frustratingly, why we were seeing failures wasn't obvious.

The correlation was a red herring, later tests would show. There was no fault in the Kasli-SoC header circuits or how they'd been laid out, all power supplies were firmly within spec, and there were no strange transients during power-up. We took apart some cables to directly probe the LVDS in-circuit, but the waveforms were compliant and almost entirely unremarkable.[3] All the while, the failures were sporadic enough that running any test took ages—so long, as it happens, that a fix to the originally reported issue was figured out: at some point, differential terminations for Mirny and Sampler, two other kinds of card you could plug into Kasli-SoC, were forgotten.

Most likely, this distorted the signal enough that it was no longer proper LVDS.

But wait—they are enabled for Fastino. So why could we produce these failures?

The fix for Mirny and Sampler was a real help here as it told me I needed to stop looking at Kasli-SoC. All the evidence was telling me it was fine, and the difference in test setups was that I had been using Fastino. As a first step I repeated the power-cycle testing, this time with Fastinos from different batches to try to rule out a manufacturing defect. This was probably a misguided choice and a bit of a waste of time, but the issue was still reproducible so it gave me confidence that there really was a Fastino issue.

My next idea was properly simple: the FPGA design for Fastino is open source, so I added three toggling output pins:

  1. One from the 35.71MHz link word clock, wired combinatorially through a buffer
  2. One from the 250MHz link bit clock, toggling a register each cycle
  3. One from the 47.62MHz DAC serial clock, toggling a register each cycle

I knew that the failures were ‘all or nothing,’ and that Kasli-SoC was outputting the right clock. I also knew that, under failure, we couldn't even control the front-panel LEDs, so the comms link was entirely down. I suspected that one of the PLLs on Fastino was failing to lock, so a mix of combinatorial and sequential logic would show that when only the sequential logic did nothing.

An oscilloscope trace of nothing happening.
An oscillscope trace showing (top 4 traces) the output clock from Kasli-SoC, (next 4) the Fastino outputs from my additions, and (centre) detailed view of an output clock from Kasli-SoC.

After a minor struggle to provoke another failure, none of the outputs toggled!

This was interesting because the combinatorial buffer had failed. There isn't much to go wrong there: the signal enters the FPGA, is buffered, and exits again. It seemed unlikely to be a faulty buffer or FPGA fabric, so that could mean they weren't being set up properly. But it's a buffer, and the user-proddable configuration is largely just ‘in’ and ‘out.’ And I'd copied my change from a working example. The actual configuration is done when the FPGA loads its bitstream.

Fastino uses a Lattice iCE40HX, which is very small and fairly simple. It can act as an SPI master and pull bitstreams from a flash chip. Could something be going wrong there?

In honesty, at this point I didn't really know what I was looking for. We had a device intermittently failing to ‘boot’ and the only real moving part was the storage. That sort of thing is usually timing, right? So I looked over the FPGA and SPI flash datasheets for timing parameters until a little note, tucked away in the conditions for the iCE40's power supply ramp rate, caught my eye:

Configuring from MSPI. Vcc and Vpp_SPI to be powered 0.25 ms before Vpp_2V5.

Lattice DS 02029, pg. 22

So, there was a specific power supply startup order for the SPI master.

And while I didn't know Fastino like the back of my hand, I did know that it didn't use a fancy PMIC or anything to do sequencing. From there, since Fastino is open source hardware as well as RTL, I should only need to look at the schematic to be able to tell if whether this sequencing is happening. Helpfully, the schematics even include a note.

Power supply sequencing note on the Fastino schematics.
A note in the Fastino schematics listing in which order the power supplies start up.

Unfortunately, this note confirms that things are not ideal.

For the iCE40, Vcc and Vpp_SPI are both the 3.3-volt supply, which starts at the same time as the 2.5-volt one. While Fastino does derive its 2.5-volt supply from its 3.3-volt supply, and while the AUR9705 (pdf) regulator does have a soft start, it seems like it isn't deterministic enough to ensure consistent correct start-up.

At that point at Oxford Ionics, we had 30–40 Fastinos in a mix of quantum systems, test systems, and storage. Replacing them would be painful, a new PCB spin would be slow, and in the interim we had the issue that we couldn't tell when a Fastino was broken. ARTIQ didn't support the data-return link to Kasli-SoC and in our use-case[4] we couldn't easily tell experimentally when one of them had broken. Plus, since this was a Fastino fault and not a Kasli-SoC fault, it meant that our existing Kasli-based racks could be vulnerable to it, too. That's a fair few compounding factors, so what could we do to resolve things sensibly?

Well, in the spirit of things being simpler than they first seem, it didn't matter:

But, even if simple, it was satisfying to close the book on it.

It put me in the mind of the sysadmin's haiku:

It's not DNS

There's no way it's DNS

It was DNS