At Riverlane, we’re building the quantum error correction stack to make reliable quantum computers from unreliable qubits.
Building the quantum error correction stack requires a range of multidisciplinary work from the cutting edge of theoretical physics to solving the world’s most complex engineering challenges.
My work sits on the engineering side. In this new blog series Engineering Quantum Error Correction, the team and I will deep dive into problems that must be solved to make quantum computers useful, sooner.
In this post, we’re examining deadtime and a new patent application from Riverlane that tackles this issue head on, helping to speed up your computations to the point where you won't even have time to put the kettle on (sorry). If you missed out (or just want some more insights into quantum error correction), I’m going to give a summary of the themes and emerging trends from the conference. Here's how:
Quantum error correction is a set of techniques used to protect the information stored in qubits from errors and decoherence caused by noise. To achieve quantum error correction at scale (and with the required levels of accuracy, speed and resource efficiency) is a monumental engineering challenge.
Our Quantum Error Correction Stack is roughly comprised of two systems: Control and Decode. Control reads and writes the qubits, similar to the read and write functions in a classical computer. Decode ensures the quality and reliability of the qubits. You can think of them together as a Solid-State Disk controller: one part reads and writes the memory cells; another identifies and corrects any errors encountered.
In this post, we’ll look at Control.
Qubits must be controlled and manipulated using control and measurement signals to run their complex calculations. These signals are high accuracy, high speed Radio Frequency (RF) pulses, produced using a combination of off-the-shelf and custom hardware.
The electronic systems behind these control and measurement signals serve as a bridge between quantum programming language and quantum information processors.
In Figure 1, we have a typical configuration. The interface CPU (Central Processing Unit) is where the user essentially writes the algorithms to control the qubits. It can be a user’s laptop or a compute node of an HPC cluster. This code is then compiled and sent to a second CPU that handles the configuration of one or more FPGAs (field- programmable gate arrays).
FPGAs are types of integrated, programmable circuits. Because the circuitry inside an FPGA processor is not hard-etched, then the FPGA processor can be programmed and updated as needed to best meet the needs of the Control system.
In a super-dynamic quantum ecosystem, this flexibility is key and alternative electronics are not an option. ASIC (application-specific integrated circuit) solutions are still cost-prohibitive, for example, and CPUs are not equipped with the necessary components to do the job.
Figure 1: Standard quantum stack implementation.
The FPGA itself interfaces with the qubits and implements the quantum gates, i.e., a transformation of the state of the underlying qubits. As a NOT operation transforms a ‘0’ into a ‘1’, quantum gates transform inputs into outputs, thus becoming building blocks for full quantum computations (jargon: quantum circuits).
A quantum circuit could take anything from a few microseconds to hundreds of milliseconds to run, depending on the qubit type and on the complexity of the calculation (jargon: circuit depth) and other factors.
This poses an interesting challenge: how can you minimise the deadtime between consecutive quantum routines and consequently increase the number of routines run per hour?
Or phrased from a business perspective: how can you maximise your return on investment?
This is a tangible problem, waiting time on these systems can be extremely long (minutes or even hours) and that has a knock-on effect on productivity (but gives an excuse for a long tea break).
The deadtime challenge
If we can reduce deadtime between quantum routines, then the efficiency of our quantum computers increases. We achieve more runs per day and that means research can move faster as the time between developing and testing an idea shortens.
Reducing deadtime is a difficult problem. When we looked at it for the first time, we realised that even well-known approaches will fall short:
- Boost utilisation by running parallel jobs on the same QPU. If we could allow multiple computations on the same quantum computer to happen in parallel, we would increase utilisation per hour and reduce the impact of deadtime. But quantum processing units are not yet ready for multi-user access. User A’s computations would easily “contaminate” User B’s ones because qubits are highly sensitive to their environmental conditions. It would be like allowing multiple users to have undisciplined and concurrent access to a shared memory in a classical computer. Things would quickly get messy and (for a quantum computer) the calculation would be rendered useless.
- Increase the speed at which we talk to the Quantum Stack. We need to efficiently transmit data from the classical CPU to the front-end electronics of a quantum computer in the QPU. Increasing the speed of the links is not per-se the solution: in a distributed, heterogenous system any bottleneck in the data transmission chain will invalidate any speed-up associated with faster links.
Robust partitioning of quantum resources could help with challenge #1 and a multi-chip quantum processor (like the one developed by our partner, Rigetti), might lead the way to user partitioning. Our work in this area will also lead to more flexible control stacks (which we will write about in a future blog post).
When it comes to challenge #2, this is a reincarnation of the classic consumer/producer problem: if the consumer is not ready for the data, then the system will either stall or throw an error.
Why might a consumer not be ready? This comes down to the way in which the control and measurement signals are run in the electronic system. All incarnations of quantum stacks run a quantum routine in stages (as shown in Figure 1). These stages are:
- The quantum circuit (written by the user) is compiled either on their machine or on an HPC compute node and passed to the device CPU (0).
- The device CPU talks to various FPGAs and sends them their new configuration (1,2,3).
- The compiled code is converted into FPGAs/low-level electronics instructions/read-write operations (3).
- This kickstarts the activities and execution of the quantum routines and the resulting measurement data is passed back to the device CPU where the next control signal is compiled (4).
- This loop is executed until the calculation (aka all the quantum routines) is complete.
Various solutions could reduce this deadtime, including:
- When a quantum routine is run, we can parallelise the compilation phase code (i.e., allowing multiple routines to be compiled in parallel) and store this information in a ready-to-transmit buffer. When one routine is complete, we are ready to fire out the next one with a reduced deadtime.
- The transmission between the user CPU to the device CPU can be accelerated by leveraging low latency/high speed connections with limited effort.
But there are more bottlenecks to overcome – bottlenecks that our Control system is uniquely able to address.
The device CPU needs to propagate (part of) the configuration to multiple FPGAs, wait for the internal configuration to complete and finally start the quantum routine when all FPGAs are configured by sending some form of start command.
The first bottleneck occurs when we configure multiple FPGAs. Configuration entails setting various internal memories, normally using high-throughput internal connections (up to 4GB/s or 8 times as fast as a standard SSD) in a sequential way. Even setting 4KB of memories requires 2μs (more specifically 500 clock cycles at 250Mhz).
Our second bottleneck occurs when we buffer the configuration on the FPGA. Transmitting the same 4KB over an ultra-low-latency link will result in approximately 1μs of deadtime.
Even if we used highly optimised network interfaces, this still brings us to a total of 3μs of deadtime.
But there’s one last bottleneck to consider: when the configuration is complete a “job done” signal must go back to the device CPU. This will require transmitting back a small ACK (message sent to indicate whether the data packet was received), adding approximately 100ns to the deadtime. So, we now have a total of 3.1μs.
The solution? Build a fully streaming interface for configurations. Figure 2 depicts our approach.
Figure 2: Optimised stack for high throughput execution
The compilation (0) remains unchanged using this approach and improvements happen under the hood. The best way to explain these changes is to bring in the concept of a distributed system digital twin.
Digital twins are normally virtual representations (software) of a physical system (hardware or system). They allow different forms of control/monitoring of resources and have been instrumental in fields like factory automation. Our Control Stack is a fully distributed system that can be captured as a centralised software model, i.e., as a digital twin. But how does it help saving resources/reducing deadtime?
By knowing the current configuration at any point in time, we can implement a three-step approach to remove existing bottlenecks.
First, we no longer need to transmit a full configuration using this approach (3-4). If only a subset of the memory changes, then we only need to transmit that increment. The Digital Twin (2) always has an up-to-date view of the system configuration and keeps track of all the changes happening (as it drives them).
As a result, the updates become much smaller and each FPGA can store a larger number of updates, creating a queue of ready-to-execute instructions at the lowest level of the hierarchy (i.e., where they need to be consumed).
This is important because internal FPGA memory is a precious commodity, that must be prioritised for storing low-level signals and results. Now, we can keep this incremental information in the configurations buffer (5) that is only one stop away from the final destination.
Second, we are removing the impact of retransmissions. In the classical scheme, if we send a configuration and a network error occurs, we need to retransmit and hold until completed.
Now, by leveraging our multi-stage buffering approach we can reduce (close to zero) the chances of not having the next entry in the configuration buffer ready when needed (by allowing sufficiently deep buffers).
This potentially confines the configuration latency to the first execution, whilst the remaining quantum routines will be executed back-to-back. In other words, there is a small period of deadtime when the configuration first runs – but subsequent routines are not affected by deadtime.
Third, if we combine the digital twin with the compilation information (i.e., expected duration of the quantum routine) then this information can be fed back to the compilation stage to allow for further optimisation of compiler buffers, implement smart scheduling policies and more. The entire system can optimise itself, further reducing deadtime and improving the overall user experience by reducing the time they wait for the results to come back.
If we run the same analysis as before, and we assume that on average we can reduce the size of the configuration down to 2KB (i.e., only 50% of the configuration needs to change on average), we get down to 1μs per configuration, speeding up the system by a factor of 3.
In the current era of prototyping of new algorithms and solutions, where fast turnaround is key, this speedup will help you get more “innovation x $” done.
So, you won’t have an excuse to the put the kettle on (sorry, again) as your computation won’t have to wait as long in a queue and you will get back your results much sooner than before. You can find out more about our Control system here.