Categories
Offsites

Introducing Omnimattes: A New Approach to Matte Generation using Layered Neural Rendering

Image and video editing operations often rely on accurate mattes — images that define a separation between foreground and background. While recent computer vision techniques can produce high-quality mattes for natural images and videos, allowing real-world applications such as generating synthetic depth-of-field, editing and synthesising images, or removing backgrounds from images, one fundamental piece is missing: the various scene effects that the subject may generate, like shadows, reflections, or smoke, are typically overlooked.

In “Omnimatte: Associating Objects and Their Effects in Video”, presented at CVPR 2021, we describe a new approach to matte generation that leverages layered neural rendering to separate a video into layers called omnimattes that include not only the subjects but also all of the effects related to them in the scene. Whereas a typical state-of-the-art segmentation model extracts masks for the subjects in a scene, for example, a person and a dog, the method proposed here can isolate and extract additional details associated with the subjects, such as shadows cast on the ground.

A state-of-the-art segmentation network (e.g., MaskRCNN) takes an input video (left) and produces plausible masks for people and animals (middle), but misses their associated effects. Our method produces mattes that include not only the subjects, but their shadows as well (right; individual channels for person and dog visualized as blue and green).

Also unlike segmentation masks, omnimattes can capture partially-transparent, soft effects such as reflections, splashes, or tire smoke. Like conventional mattes, omnimattes are RGBA images that can be manipulated using widely-available image or video editing tools, and can be used wherever conventional mattes are used, for example, to insert text into a video underneath a smoke trail.

Layered Decomposition of Video
To generate omnimattes, we split the input video into a set of layers: one for each moving subject, and one additional layer for stationary background objects. In the example below, there is one layer for the person, one for the dog, and one for the background. When merged together using conventional alpha blending, these layers reproduce the input video.

Besides reproducing the video, the decomposition must capture the correct effects in each layer. For example, if the person’s shadow appears in the dog’s layer, the merged layers would still reproduce the input video, but inserting an additional element between the person and dog would produce an obvious error. The challenge is to find a decomposition where each subject’s layer captures only that subject’s effects, producing a true omnimatte.

Our solution is to apply our previously developed layered neural rendering approach to train a convolutional neural network (CNN) to map the subject’s segmentation mask and a background noise image into an omnimatte. Due to their structure, CNNs are naturally inclined to learn correlations between image effects, and the stronger the correlation between the effects, the easier for the CNN to learn. In the above video, for example, the spatial relationships between the person and their shadow, and the dog and its shadow, remain similar as they walk from right to left. The relationships change more (hence, the correlations are weaker) between the person and the dog’s shadow, or the dog and the person’s shadow. The CNN learns the stronger correlations first, leading to the correct decomposition.

The omnimatte system is shown in detail below. In a preprocess, the user chooses the subjects and specifies a layer for each. A segmentation mask for each subject is extracted using an off-the-shelf segmentation network, such as MaskRCNN, and camera transformations relative to the background are found using standard camera stabilization tools. A random noise image is defined in the background reference frame and sampled using the camera transformations to produce per-frame noise images. The noise images provide image features that are random but consistently track the background over time, providing a natural input for the CNN to learn to reconstruct the background colors.

The rendering CNN takes as input the segmentation mask and the per-frame noise images and produces the RGB color images and alpha maps, which capture the transparency of each layer. These outputs are merged using conventional alpha-blending to produce the output frame. The CNN is trained from scratch to reconstruct the input frames by finding and associating the effects not captured in a mask (e.g., shadows, reflections or smoke) with the given foreground layer, and to ensure the subject’s alpha roughly includes the segmentation mask. To make sure the foreground layers only capture the foreground elements and none of the stationary background, a sparsity loss is also applied on the foreground alpha.

A new rendering network is trained for each video. Because the network is only required to reconstruct the single input video, it is able to capture fine structures and fast motion in addition to separating the effects of each subject, as seen below. In the walking example, the omnimatte includes the shadow cast on the slats of the park bench. In the tennis example, the thin shadow and even the tennis ball are captured. In the soccer example, the shadow of the player and the ball are decomposed into their proper layers (with a slight error when the player’s foot is occluded by the ball).

This basic model already works well, but one can improve the results by augmenting the input of the CNN with additional buffers such as optical flow or texture coordinates.

Applications
Once the omnimattes are generated, how can they be used? As shown above, we can remove objects, simply by removing their layer from the composition. We can also duplicate objects, by repeating their layer in the composition. In the example below, the video has been “unwrapped” into a panorama, and the horse duplicated several times to produce a stroboscopic photograph effect. Note that the shadow that the horse casts on the ground and onto the obstacle is correctly captured.

A more subtle, but powerful application is to retime the subjects. Manipulation of time is widely used in film, but usually requires separate shots for each subject and a controlled filming environment. A decomposition into omnimattes makes retiming effects possible for everyday videos using only post-processing, simply by independently changing the playback rate of each layer. Since the omnimattes are standard RGBA images, this retiming edit can be done using conventional video editing software.

The video below is decomposed into three layers, one for each child. The children’s initial, unsynchronized jumps are aligned by simply adjusting the playback rate of their layers, producing realistic retiming for the splashes and reflections in the water.

In the original video (left), each child jumps at a different time. After editing (right), everyone jumps together.

It’s important to consider that any novel technique for manipulating images should be developed and applied responsibly, as it could be misused to produce fake or misleading information. Our technique was developed in accordance with our AI Principles and only allows rearrangement of content already present in the video, but even simple rearrangement can significantly alter the effect of a video, as shown in these examples. Researchers should be aware of these risks.

Future Work
There are a number of exciting directions to improve the quality of the omnimattes. On a practical level, this system currently only supports backgrounds that can be modeled as panoramas, where the position of the camera is fixed. When the camera position moves, the panorama model cannot accurately capture the entire background, and some background elements may clutter the foreground layers (sometimes visible in the above figures). Handling fully general camera motion, such as walking through a room or down a street, would require a 3D background model. Reconstruction of 3D scenes in the presence of moving objects and effects is still a difficult research challenge, but one that has seen promising recent progress.

On a theoretical level, the ability of CNNs to learn correlations is powerful, but still somewhat mysterious, and does not always lead to the expected layer decomposition. While our system allows for manual editing when the automatic result is imperfect, a better solution would be to fully understand the capabilities and limitations of CNNs to learn image correlations. Such an understanding could lead to improved denoising, inpainting, and many other video editing applications besides layer decomposition.

Acknowledgements
Erika Lu, from the University of Oxford, developed the omnimatte system during two internships at Google, in collaboration with Google researchers Forrester Cole, Tali Dekel, Michael Rubinstein, William T. Freeman and David Salesin, and University of Oxford researchers Weidi Xie and Andrew Zisserman.

Thank you to the friends and families of the authors who agreed to appear in the example videos. The “horse jump low”, “lucia”, and “tennis” videos are from the DAVIS 2016 dataset. The soccer video is used by permission from Online Soccer Skills. The car drift video was licensed from Shutterstock.

Categories
Offsites

Recreating Natural Voices for People with Speech Impairments

On June 2nd, 2021, Major League Baseball in the United States celebrated Lou Gehrig Day, commemorating both the day in 1925 that Lou Gehrig became the Yankees’ starting first baseman, and the day in 1941 that he passed away from amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig’s disease) at the age of 37. ALS is a progressive neurodegenerative disease that affects motor neurons, which connect the brain with the muscles throughout the body, and govern muscle control and voluntary movements. When voluntary muscle control is affected, people may lose their ability to speak, eat, move and breathe.

In honor of Lou Gehrig, former NFL player and ALS advocate Steve Gleason, who lost his ability to speak due to ALS, recited Gehrig’s famous “Luckiest Man” speech at the June 2nd event using a recreation of his voice generated by a machine learning (ML) model. Gleason’s voice recreation was developed in collaboration with Google’s Project Euphonia, which aims to empower people who have impaired speaking ability due to ALS to better communicate using their own voices.

Steve Gleason, who lost his voice to ALS, worked with Google’s Project Euphonia to generate a speech in his own voice in honor of Lou Gehrig. A portion of Gleason’s speech was broadcast in ballparks across the country during the 4th inning on June 2nd, 2021.

Today we describe PnG NAT, the model adopted by Project Euphonia to recreate Steve Gleason’s voice. PnG NAT is a new text-to-speech synthesis (TTS) model that merges two state-of-the-art technologies, PnG BERT and Non-Attentive Tacotron (NAT), into a single model. It demonstrates significantly better quality and fluency than previous technologies, and represents a promising approach that can be extended to a wider array of users.

Recreating a Voice
Non-Attentive Tacotron (NAT) is the successor to Tacotron 2, a sequence-to-sequence neural TTS model proposed in 2017. Tacotron 2 used an attention module to connect the input text sequence and the output speech spectrogram frame sequence, so that the model knows which part of the text to pay attention to when generating each time step of the synthesized speech spectrogram. Tacotron 2 was the first TTS model that was able to synthesize speech that sounds as natural as a person speaking. However, with extensive experimentation we discovered that there is a small probability that the model can suffer from robustness issues — such as babbling, repeating, or skipping part of the text — due to the inherent flexibility of the attention mechanism.

NAT improves upon Tacotron 2 by replacing the attention module with a duration-based upsampler, which predicts a duration for each input phoneme and upsamples the encoded phoneme representation so that the output length corresponds to the length of the predicted speech spectrogram. Such a change both resolves the robustness issue, and improves the naturalness of the synthesized speech. This approach also enables precise control of the speech duration for each phoneme of the input text while still maintaining highly natural synthesis quality. Because recordings of people with ALS often exhibit disfluent speech, this ability to exert per-phoneme control is key for achieving the fluency of the recreated voice.

Non-Attentive Tacotron (NAT) model.

While NAT addresses the robustness issue and enables precise duration control in neural TTS, we build upon it to further improve the natural language understanding of the TTS input. For this, we apply PnG BERT, which uses an approach similar to BERT, but is specifically designed for TTS. It is pre-trained with self-supervision on both the phoneme representation and the grapheme representation of the same content from a large text corpus, and then is used as the encoder of the TTS model. This results in a significant improvement of the prosody and pronunciation of the synthesized speech, especially in difficult cases.

Take, for example, the following audio, which was synthesized from a regular NAT model that takes only phonemes as input:

In comparison, the audio synthesized from PnG NAT on the same input text includes an additional pause that makes the meaning more clear.

The input text to both models is, “To cancel the payment, press one; or to continue, two.” Notice the different pause lengths before the ending “two” in the two versions. The word “two” in the version output by the regular NAT model could be confused for “too”. Because “too” and “two” have identical pronunciation (and thus the same phoneme representation), the regular NAT model does not understand which of the two is appropriate, and assumes it to be the word that more frequently follows a comma, “too”. In contrast, the PnG NAT model can more easily tell the difference, because it takes graphemes in addition to phonemes as input, and thus makes more appropriate pause.

The PnG NAT model integrates the pre-trained PnG BERT model as the encoder to the NAT model. The hidden representations output from the encoder are used by NAT to predict the duration of each phoneme, and are then upsampled to match the length of the audio spectrogram, as outlined above. In the final step, a non-attentive decoder converts the upsampled hidden representations into audio speech spectrograms, which are finally converted into audio waveforms by a neural vocoder.

PnG BERT and the pre-training objectives. Yellow boxes represent phonemes, and pink boxes represent graphemes.
PnG NAT: PnG BERT replaces the original encoder in the NAT model. The random masking for the Masked Language Model (MLM) pre-training is removed.

To recreate Steve Gleason’s voice, we first trained a PnG NAT model with recordings from 31 professional speakers, and then fine-tuned it with 30 minutes of Gleason’s recordings. Because these latter recordings were made after he was diagnosed with ALS, they exhibit signs of slurring. The fine tuned model was able to synthesize speech that sounds very similar to these recordings. However, because the symptoms of ALS were already present in Gleason’s speech, they exhibited some similar disfluencies.

To mitigate this, we leveraged the phoneme duration control of NAT as well as the model trained with professional speakers. We first predicted the durations of each phoneme for both a professional speaker and for Gleason, and then used the geometric mean of the two durations for each phoneme to guide the NAT output. As a result, the model is able to speak in Gleason’s voice, but more fluently than in the original recordings.

Here is the full version of the synthesized Lou Gehrig speech in Gleason’s voice:

<!– As a comparison, following is one of Gleason’s recordings that was used to train the model:

–>

Besides recreating voices for people with ALS, PnG NAT is also powering voices for a variety of customers through Google Cloud Custom Voice.

Project Euphonia
Of the millions of people around the world who have neurologic conditions that may impact their speech, such as ALS, cerebral palsy or Down syndrome, many may find it difficult to be understood, which can make face-to-face communication challenging. Using voice-activated technologies can be frustrating too, as they don’t always work reliably. Project Euphonia is a Google Research initiative focused on helping people with impaired speech be better understood. The team is researching ways to improve speech recognition for individuals with speech impairments (see recent blog post and segment in TODAY show), as well as customized text-to-speech technology (see Age of AI documentary featuring former NFL player Tim Shaw).

Acknowledgements
Many people across Google Research, Google Cloud and Consumer Apps, and Google Accessibility teams contributed to this project and the event, including Michael Brenner, Bob MacDonald, Heiga Zen, Yu Zhang, Jonathan Shen, Isaac Elias‎, Yonghui Wu, Anne Keck, Danielle Notaro, Kevin Hogan, Zack Kaplan, KR Liu, Kyndra Price, Zoe Ortiz.

Categories
Offsites

Beyond alchemy: A first look at geometric deep learning

Geometric deep learning is a “program” that aspires to situate deep learning architectures and techniques in a framework of mathematical priors. The priors, such as various types of invariance, first arise in some physical domain. A neural network that well matches the domain will preserve as many invariances as possible. In this post, we present a very conceptual, high-level overview, and highlight a few applications.

Categories
Offsites

SoundStream: An End-to-End Neural Audio Codec

Audio codecs are used to efficiently compress audio to reduce either storage requirements or network bandwidth. Ideally, audio codecs should be transparent to the end user, so that the decoded audio is perceptually indistinguishable from the original and the encoding/decoding process does not introduce perceivable latency.

Over the past few years, different audio codecs have been successfully developed to meet these requirements, including Opus and Enhanced Voice Services (EVS). Opus is a versatile speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been widely deployed across applications ranging from video conferencing platforms, like Google Meet, to streaming services, like YouTube. EVS is the latest codec developed by the 3GPP standardization body targeting mobile telephony. Like Opus, it is a versatile codec operating at multiple bitrates, 5.9 kbps to 128 kbps. The quality of the reconstructed audio using either of these codecs is excellent at medium-to-low bitrates (12–20 kbps), but it degrades sharply when operating at very low bitrates (⪅3 kbps). While these codecs leverage expert knowledge of human perception as well as carefully engineered signal processing pipelines to maximize the efficiency of the compression algorithms, there has been recent interest in replacing these handcrafted pipelines by machine learning approaches that learn to encode audio in a data-driven manner.

Earlier this year, we released Lyra, a neural audio codec for low-bitrate speech. In “SoundStream: an End-to-End Neural Audio Codec”, we introduce a novel neural audio codec that extends those efforts by providing higher-quality audio and expanding to encode different sound types, including clean speech, noisy and reverberant speech, music, and environmental sounds. SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.

Learning an Audio Codec from Data
The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network.

SoundStream training and inference. During training, the encoder, quantizer and decoder parameters are optimized using a combination of reconstruction and adversarial losses, computed by a discriminator, which is trained to distinguish between the original input audio and the reconstructed audio. During inference, the encoder and quantizer on a transmitter client send the compressed bitstream to a receiver client that can then decode the audio signal.

Learning a Scalable Codec with Residual Vector Quantization
The encoder of SoundStream produces vectors that can take an indefinite number of values. In order to transmit them to the receiver using a limited number of bits, it is necessary to replace them by close vectors from a finite set (called a codebook), a process known as vector quantization. This approach works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.

In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers (up to 80 in our experiments). The first layer quantizes the code vectors with moderate resolution, and each of the following layers processes the residual error from the previous one. By splitting the quantization process in several layers, the codebook size can be reduced drastically. As an example, with 100 vectors per second at 3 kbps, and using 5 quantizer layers, the codebook size goes from 1 billion to 320. Moreover, we can easily increase or decrease the bitrate by adding or removing quantizer layers, respectively.

Because network conditions can vary while transmitting audio, ideally a codec should be “scalable” so that it can change its bitrate from low to high depending on the state of the network. While most traditional codecs are scalable, previous learnable codecs need to be trained and deployed specifically for each bitrate.

To circumvent this limitation, we leverage the fact that the number of quantization layers in SoundStream controls the bitrate, and propose a new method called “quantizer dropout”. During training, we randomly drop some quantization layers to simulate a varying bitrate. This pushes the decoder to perform well at any bitrate of the incoming audio stream, and thus helps SoundStream to become “scalable” so that a single trained model can operate at any bitrate, performing as well as models trained specifically for these bitrates.

Comparison of SoundStream models (higher is better) that are trained at 18 kbps with quantizer dropout (bitrate scalable), without quantizer dropout (not bitrate scalable) and evaluated with a variable number of quantizers, or trained and evaluated at a fixed bitrate (bitrate specific). The bitrate-scalable model (a single model for all bitrates) does not lose any quality when compared to bitrate-specific models (a different model for each bitrate), thanks to quantizer dropout.

A State-of-the-Art Audio Codec
SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches the quality of EVS at 9.6 kbps, while using 3.2x–4x fewer bits. This means that encoding audio with SoundStream can provide a similar quality while using a significantly lower amount of bandwidth. Moreover, at the same bitrate, SoundStream outperforms the current version of Lyra, which is based on an autoregressive network. Unlike Lyra, which is already deployed and optimized for production usage, SoundStream is still at an experimental stage. In the future, Lyra will incorporate the components of SoundStream to provide both higher audio quality and reduced complexity.

SoundStream at 3kbps vs. state-of-the-art codecs. MUSHRA score is an indication of subjective quality (the higher the better).

The demonstration of SoundStream’s performance compared to Opus, EVS, and the original Lyra codec is presented in these audio examples, a selection of which are provided below.

Speech

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  

Music

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  

Joint Audio Compression and Enhancement
In traditional audio processing pipelines, compression and enhancement (the removal of background noise) are typically performed by different modules. For example, it is possible to apply an audio enhancement algorithm at the transmitter side, before audio is compressed, or at the receiver side, after audio is decoded. In such a setup, each processing step contributes to the end-to-end latency. Conversely, we design SoundStream in such a way that compression and enhancement can be carried out jointly by the same model, without increasing the overall latency. In the following examples, we show that it is possible to combine compression with background noise suppression, by activating and deactivating denoising dynamically (no denoising for 5 seconds, denoising for 5 seconds, no denoising for 5 seconds, etc.).

Original noisy audio  
Denoised output*
* Demonstrated by turning denoising on and off every 5 seconds.

Conclusion
Efficient compression is necessary whenever one needs to transmit audio, whether when streaming a video, or during a conference call. SoundStream is an important step towards improving machine learning-driven audio codecs. It outperforms state-of-the-art codecs, such as Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model, rather than many.

SoundStream will be released as a part of the next, improved version of Lyra. By integrating SoundStream with Lyra, developers can leverage the existing Lyra APIs and tools for their work, providing both flexibility and better sound quality. We will also release it as a separate TensorFlow model for experimentation.

AcknowledgmentsThe work described here was authored by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi. We are grateful for all discussions and feedback on this work that we received from our colleagues at Google.

Categories
Offsites

Demonstrating the Fundamentals of Quantum Error Correction

The Google Quantum AI team has been building quantum processors made of superconducting quantum bits (qubits) that have achieved the first beyond-classical computation, as well as the largest quantum chemical simulations to date. However, current generation quantum processors still have high operational error rates — in the range of 10-3 per operation, compared to the 10-12 believed to be necessary for a variety of useful algorithms. Bridging this tremendous gap in error rates will require more than just making better qubits — quantum computers of the future will have to use quantum error correction (QEC).

The core idea of QEC is to make a logical qubit by distributing its quantum state across many physical data qubits. When a physical error occurs, one can detect it by repeatedly checking certain properties of the qubits, allowing it to be corrected, preventing any error from occurring on the logical qubit state. While logical errors may still occur if a series of physical qubits experience an error together, this error rate should exponentially decrease with the addition of more physical qubits (more physical qubits need to be involved to cause a logical error). This exponential scaling behavior relies on physical qubit errors being sufficiently rare and independent. In particular, it’s important to suppress correlated errors, where one physical error simultaneously affects many qubits at once or persists over many cycles of error correction. Such correlated errors produce more complex patterns of error detections that are more difficult to correct and more easily cause logical errors.

Our team has recently implemented the ideas of QEC in our Sycamore architecture using quantum repetition codes. These codes consist of one-dimensional chains of qubits that alternate between data qubits, which encode the logical qubit, and measure qubits, which we use to detect errors in the logical state. While these repetition codes can only correct for one kind of quantum error at a time1, they contain all of the same ingredients as more sophisticated error correction codes and require fewer physical qubits per logical qubit, allowing us to better explore how logical errors decrease as logical qubit size grows.

In “Removing leakage-induced correlated errors in superconducting quantum error correction”, published in Nature Communications, we use these repetition codes to demonstrate a new technique for reducing the amount of correlated errors in our physical qubits. Then, in “Exponential suppression of bit or phase flip errors with repetitive error correction”, published in Nature, we show that the logical errors of these repetition codes are exponentially suppressed as we add more and more physical qubits, consistent with expectations from QEC theory.

Layout of the repetition code (21 qubits, 1D chain) and distance-2 surface code (7 qubits) on the Sycamore device.

Leaky Qubits
The goal of the repetition code is to detect errors on the data qubits without measuring their states directly. It does so by entangling each pair of data qubits with their shared measure qubit in a way that tells us whether those data qubit states are the same or different (i.e., their parity) without telling us the states themselves. We repeat this process over and over in rounds that last only one microsecond. When the measured parities change between rounds, we’ve detected an error.

However, one key challenge stems from how we make qubits out of superconducting circuits. While a qubit needs only two energy states, which are usually labeled |0⟩ and |1⟩, our devices feature a ladder of energy states, |0⟩, |1⟩, |2⟩, |3⟩, and so on. We use the two lowest energy states to encode our qubit with information to be used for computation (we call these the computational states). We use the higher energy states (|2⟩, |3⟩ and higher) to help achieve high-fidelity entangling operations, but these entangling operations can sometimes allow the qubit to “leak” into these higher states, earning them the name leakage states.

Population in the leakage states builds up as operations are applied, which increases the error of subsequent operations and even causes other nearby qubits to leak as well — resulting in a particularly challenging source of correlated error. In our early 2015 experiments on error correction, we observed that as more rounds of error correction were applied, performance declined as leakage began to build.

Mitigating the impact of leakage required us to develop a new kind of qubit operation that could “empty out” leakage states, called multi-level reset. We manipulate the qubit to rapidly pump energy out into the structures used for readout, where it will quickly move off the chip, leaving the qubit cooled to the |0⟩ state, even if it started in |2⟩ or |3⟩. Applying this operation to the data qubits would destroy the logical state we’re trying to protect, but we can apply it to the measure qubits without disturbing the data qubits. Resetting the measure qubits at the end of every round dynamically stabilizes the device so leakage doesn’t continue to grow and spread, allowing our devices to behave more like ideal qubits.

Applying the multi-level reset gate to the measure qubits almost totally removes leakage, while also reducing the growth of leakage on the data qubits.

Exponential Suppression
Having mitigated leakage as a significant source of correlated error, we next set out to test whether the repetition codes give us the predicted exponential reduction in error when increasing the number of qubits. Every time we run our repetition code, it produces a collection of error detections. Because the detections are linked to pairs of qubits rather than individual qubits, we have to look at all of the detections to try to piece together where the errors have occurred, a procedure known as decoding. Once we’ve decoded the errors, we then know which corrections we need to apply to the data qubits. However, decoding can fail if there are too many error detections for the number of data qubits used, resulting in a logical error.

To test our repetition codes, we run codes with sizes ranging from 5 to 21 qubits while also varying the number of error correction rounds. We also run two different types of repetition codes — either a phase-flip code or bit-flip code — that are sensitive to different kinds of quantum errors. By finding the logical error probability as a function of the number of rounds, we can fit a logical error rate for each code size and code type. In our data, we see that the logical error rate does in fact get suppressed exponentially as the code size is increased.

Probability of getting a logical error after decoding versus number of rounds run, shown for various sizes of phase-flip repetition code.

We can quantify the error suppression with the error scaling parameter Lambda (Λ), where a Lambda value of 2 means that we halve the logical error rate every time we add four data qubits to the repetition code. In our experiments, we find Lambda values of 3.18 for the phase-flip code and 2.99 for the bit-flip code. We can compare these experimental values to a numerical simulation of the expected Lambda based on a simple error model with no correlated errors, which predicts values of 3.34 and 3.78 for the bit- and phase-flip codes respectively.

Logical error rate per round versus number of qubits for the phase-flip (X) and bit-flip (Z) repetition codes. The line shows an exponential decay fit, and Λ is the scale factor for the exponential decay.

This is the first time Lambda has been measured in any platform while performing multiple rounds of error detection. We’re especially excited about how close the experimental and simulated Lambda values are, because it means that our system can be described with a fairly simple error model without many unexpected errors occurring. Nevertheless, the agreement is not perfect, indicating that there’s more research to be done in understanding the non-idealities of our QEC architecture, including additional sources of correlated errors.

What’s Next
This work demonstrates two important prerequisites for QEC: first, the Sycamore device can run many rounds of error correction without building up errors over time thanks to our new reset protocol, and second, we were able to validate QEC theory and error models by showing exponential suppression of error in a repetition code. These experiments were the largest stress test of a QEC system yet, using 1000 entangling gates and 500 qubit measurements in our largest test. We’re looking forward to taking what we learned from these experiments and applying it to our target QEC architecture, the 2D surface code, which will require even more qubits with even better performance.


1A true quantum error correcting code would require a two dimensional array of qubits in order to correct for all of the errors that could occur. 

Categories
Offsites

The C4_200M Synthetic Dataset for Grammatical Error Correction

Grammatical error correction (GEC) attempts to model grammar and other types of writing errors in order to provide grammar and spelling suggestions, improving the quality of written output in documents, emails, blog posts and even informal chats. Over the past 15 years, there has been a substantial improvement in GEC quality, which can in large part be credited to recasting the problem as a “translation” task. When introduced in Google Docs, for example, this approach resulted in a significant increase in the number of accepted grammar correction suggestions.

One of the biggest challenges for GEC models, however, is data sparsity. Unlike other natural language processing (NLP) tasks, such as speech recognition and machine translation, there is very limited training data available for GEC, even for high-resource languages like English. A common remedy for this is to generate synthetic data using a range of techniques, from heuristic-based random word- or character-level corruptions to model-based approaches. However, such methods tend to be simplistic and do not reflect the true distribution of error types from actual users.

In “Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models”, presented at the EACL 16th Workshop on Innovative Use of NLP for Building Educational Applications, we introduce tagged corruption models. Inspired by the popular back-translation data synthesis technique for machine translation, this approach enables the precise control of synthetic data generation, ensuring diverse outputs that are more consistent with the distribution of errors seen in practice. We used tagged corruption models to generate a new 200M sentence dataset, which we have released in order to provide researchers with realistic pre-training data for GEC. By integrating this new dataset into our training pipeline, we were able to significantly improve on GEC baselines.

Tagged Corruption Models
The idea behind applying a conventional corruption model to GEC is to begin with a grammatically correct sentence and then to “corrupt” it by adding errors. A corruption model can be easily trained by switching the source and target sentences in existing GEC datasets, a method that previous studies have shown that can be very effective for generating improved GEC datasets.

A conventional corruption model generates an ungrammatical sentence (red) given a clean input sentence (green).

The tagged corruption model that we propose builds on this idea by taking a clean sentence as input along with an error type tag that describes the kind of error one wishes to reproduce. It then generates an ungrammatical version of the input sentence that contains the given error type. Choosing different error types for different sentences increases the diversity of corruptions compared to a conventional corruption model.

Tagged corruption models generate corruptions (red) for the clean input sentence (green) depending on the error type tag. A determiner error may lead to dropping the “a”, whereas a noun-inflection error may produce the incorrect plural “sheeps”.

To use this model for data generation we first randomly selected 200M clean sentences from the C4 corpus, and assigned an error type tag to each sentence such that their relative frequencies matched the error type tag distribution of the small development set BEA-dev. Since BEA-dev is a carefully curated set that covers a wide range of different English proficiency levels, we expect its tag distribution to be representative for writing errors found in the wild. We then used a tagged corruption model to synthesize the source sentence.

Synthetic data generation with tagged corruption models. The clean C4 sentences (green) are paired with the corrupted sentences (red) in the synthetic GEC training corpus. The corrupted sentences are generated using a tagged corruption model by following the error type frequencies in the development set (bar chart).

Results
In our experiments, tagged corruption models outperformed untagged corruption models on two standard development sets (CoNLL-13 and BEA-dev) by more than three F0.5-points (a standard metric in GEC research that combines precision and recall with more weight on precision), advancing the state-of-the-art on the two widely used academic test sets, CoNLL-14 and BEA-test.

In addition, the use of tagged corruption models not only yields gains on standard GEC test sets, it is also able to adapt GEC systems to the proficiency levels of users. This could be useful, for example, because the error tag distribution for native English writers often differs significantly from the distributions for non-native English speakers. For example, native speakers tend to make more punctuation and spelling mistakes, whereas determiner errors (e.g., missing or superfluous articles, like “a”, “an” or “the”) are more common in text from non-native writers.

Conclusion
Neural sequence models are notoriously data-hungry, but the availability of annotated training data for grammatical error correction is rare. Our new C4_200M corpus is a synthetic dataset containing diverse grammatical errors, which yields state-of-the-art performance when used to pre-train GEC systems. By releasing the dataset we hope to provide GEC researchers with a valuable resource to train strong baseline systems.

Categories
Offsites

torch: Just-in-time compilation (JIT) for R-less model deployment

Using the torch just-in-time (JIT) compiler, it is possible to query a model trained in R from a different language, provided that language can make use of the low-level libtorch library. This post shows how. In addition, we try to untangle a bit of the terminological jumble surrounding the topic.

Categories
Offsites

A Dataset Exploration Case Study with Know Your Data

Data underlies much of machine learning (ML) research and development, helping to structure what a machine learning algorithm learns and how models are evaluated and benchmarked. However, data collection and labeling can be complicated by unconscious biases, data access limitations and privacy concerns, among other challenges. As a result, machine learning datasets can reflect unfair social biases along dimensions of race, gender, age, and more.

Methods of examining datasets that can surface information about how different social groups are represented within are a key component of ensuring development of ML models and datasets is aligned with our AI Principles. Such methods can inform the responsible use of ML datasets and point toward potential mitigations of unfair outcomes. For example, prior research has demonstrated that some object recognition datasets are biased toward images sourced from North America and Western Europe, prompting Google’s Crowdsource effort to balance out image representations in other parts of the world.

Today, we demonstrate some of the functionality of a dataset exploration tool, Know Your Data (KYD), recently introduced at Google I/O, using the COCO Captions dataset as a case study. Using this tool, we find a range of gender and age biases in COCO Captions — biases that can be traced to both dataset collection and annotation practices. KYD is a dataset analysis tool that complements the growing suite of responsible AI tools being developed across Google and the broader research community. Currently, KYD only supports analysis of a small set of image datasets, but we’re working hard to make the tool accessible beyond this set.

Introducing Know Your Data
Know Your Data helps ML research, product and compliance teams understand datasets, with the goal of improving data quality, and thus helping to mitigate fairness and bias issues. KYD offers a range of features that allow users to explore and examine machine learning datasets — users can filter, group, and study correlations based on annotations already present in a given dataset. KYD also presents automatically computed labels from Google’s Cloud Vision API, providing users with a simple way to explore their data based on signals that weren’t originally present in the dataset.

A KYD Case Study
As a case study, we explore some of these features using the COCO Captions dataset, an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.

Exploring Gender Bias
Previous research has demonstrated undesirable gender biases within computer vision datasets, including pornographic imagery of women and image label correlations that align with harmful gender stereotypes. We use KYD to explore gender biases within COCO Captions by examining gendered correlations within the image captions. We find a gender bias in the depiction of different activities across the images in the dataset, as well as biases relating to how people of different genders are described by annotators.

The first part of our analysis aimed to surface gender biases with respect to different activities depicted in the dataset. We examined images captioned with words describing different activities and analyzed their relation to gendered caption words, such as “man” or “woman”. The KYD Relations tab makes it easy to examine the relation between two different signals in a dataset by visualizing the extent to which two signals co-occur more (or less) than would be expected by chance. Each cell indicates either a positive (blue color) or negative (orange color) correlation between two specific signal values along with the strength of that correlation.

KYD also allows users to filter rows of a relations table based on substring matching. Using this functionality, we initially probed for caption words containing “-ing”, as a simple way to filter by verbs. We immediately saw strong gendered correlations:

Using KYD to analyze the relationship between any word and gendered words. Each cell shows if the two respective words co-occur in the same caption more (up arrow) or less often (down arrow) than pure chance.

Digging further into these correlations, we found that several activities stereotypically associated with women, such as “shopping” and “cooking”, co-occur with images captioned with “women” or “woman” at a higher rate than with images captioned with “men” or “man”. In contrast captions describing many physically intensive activities, such as “skateboarding”, “surfing”, and “snowboarding”, co-occur with images captioned with “man” or “men” at higher rates.

While individual image captions may not use stereotypical or derogatory language, such as with the example below, if certain gender groups are over (or under) represented within a particular activity across the whole dataset, models developed from the dataset risk learning stereotypical associations. KYD makes it easy to surface, quantify, and make plans to mitigate this risk.

An image with one of the captions: “Two women cooking in a beige and white kitchen.” Image licensed under CC-BY 2.0.

In addition to examining biases with respect to the social groups depicted with different activities, we also explored biases in how annotators described the appearance of people they perceived as male or female. Inspired by media scholars who have examined the “male gaze” embedded in other forms of visual media, we examined the frequency with which individuals perceived as women in COCO are described using adjectives that position them as an object of desire. KYD allowed us to easily examine co-occurrences between words associated with binary gender (e.g. “female/girl/woman” vs. “male/man/boy”) and words associated with evaluating physical attractiveness. Importantly, these are captions written by human annotators, who are making subjective assessments about the gender of people in the image and choosing a descriptor for attractiveness. We see that the words “attractive”, “beautiful”, “pretty”, and “sexy” are overrepresented in describing people perceived as women as compared to those perceived as men, confirming what prior work has said about how gender is viewed in visual media.

A screenshot from KYD showing the relationship between words that describe attractiveness and gendered words. For example, “attractive” and “male/man/boy” co-occur 12 times, but we expect ~60 times by chance (the ratio is 0.2x). On the other hand, “attractive” and “female/woman/girl” co-occur 2.62 times more than chance.

KYD also allows us to manually inspect images for each relation by clicking on the relation in question. For example, we can see images whose captions include female terms (e.g. “woman”) and the word “beautiful”.

Exploring Age Bias
Adults older than 65 have been shown to be underrepresented in datasets relative to their presence in the general population — a first step toward improving age representation is to allow developers to assess it in their datasets. By looking at caption words describing different activities and analyzing their relation to caption words describing age, KYD helped us to assess the range of example captions depicting older adults. Having example captions of adults in a range of environments and activities is important for a variety of tasks, such as image captioning or pedestrian detection.

The first trend that KYD made clear is how rarely annotators described people as older adults in captions detailing different activities. The relations tab also shows a trend wherein “elderly”, “old”, and “older” tend not to occur with verbs that describe a variety of physical activities that might be important for a system to be able to detect. Important to note is that, relative to “young”, “old” is more often used to describe things other than people, such as belongings or clothing, so these relations are also capturing some uses that don’t describe people.

The relationship between words associated with age and movement from a screenshot of KYD.

The underrepresentation of captions containing the references to older adults that we examined here could be rooted in a relative lack of images depicting older adults as well as in a tendency for annotators to omit older age-related terms when describing people in images. While manual inspection of the intersection of “old” and “running” shows a negative relation, we notice that it shows no older people and a number of locomotives. KYD makes it easy to quantitatively and qualitatively inspect relations to identify dataset strengths and areas for improvement.

Conclusion
Understanding the contents of ML datasets is a critical first step to developing suitable strategies to mitigate the downstream impact of unfair dataset bias. The above analysis points towards several potential mitigations. For example, correlations between certain activities and social groups, which can lead trained models to reproduce social stereotypes, can be potentially mitigated by “dataset balancing” — increasing the representation of under-represented group/activity combinations. However, mitigations focused exclusively on dataset balancing are not sufficient, as our analysis of how different genders are described by annotators demonstrated. We found annotators’ subjective judgements of people portrayed in images were reflected within the final dataset, suggesting a deeper look at methods of image annotations are needed. One solution for data practitioners who are developing image captioning datasets is to consider integrating guidelines that have been developed for writing image descriptions that are sensitive to race, gender, and other identity categories.

The above case studies highlight only some of the KYD features. For example, Cloud Vision API signals are also integrated into KYD and can be used to infer signals that annotators haven’t labeled directly. We encourage the broader ML community to perform their own KYD case studies and share their findings.

KYD complements other dataset analysis tools being developed across the ML community, including Google’s growing Responsible AI toolkit. We look forward to ML practitioners using KYD to better understand their datasets and mitigate potential bias and fairness concerns. If you have feedback on KYD, please write to knowyourdata-feedback@google.com.

Acknowledgements
The analysis and write-up in this post were conducted with equal contribution by Emily Denton, Mark Díaz, and Alex Hanna. We thank Marie Pellat, Ludovic Peran, Daniel Smilkov, Nikhil Thorat and Tsung-Yi for their contributions to and reviews of this post.

Categories
Offsites

Improved Detection of Elusive Polyps via Machine Learning

With the increasing ability to consistently and accurately process large amounts of data, particularly visual data, computer-aided diagnostic systems are more frequently being used to assist physicians in their work. This, in turn, can lead to meaningful improvements in health care. An example of where this could be especially useful is in the diagnosis and treatment of colorectal cancer (CRC), which is especially deadly and results in over 900K deaths per year, globally. CRC originates in small pre-cancerous lesions in the colon, called polyps, the identification and removal of which is very successful in preventing CRC-related deaths.

The standard procedure used by gastroenterologists (GIs) to detect and remove polyps is the colonoscopy, and about 19 million such procedures are performed annually in the US alone. During a colonoscopy, the gastroenterologist uses a camera-containing probe to check the intestine for pre-cancerous polyps and early signs of cancer, and removes tissue that looks worrisome. However, complicating factors, such as incomplete detection (in which the polyp appears within the field of view, but is missed by the GI, perhaps due to its size or shape) and incomplete exploration (in which the polyp does not appear in the camera’s field of view), can lead to a high fraction of missed polyps. In fact, studies suggest that 22%–28% of polyps are missed during colonoscopies, of which 20%–24% have the potential to become cancerous (adenomas).

Today, we are sharing progress made in using machine learning (ML) to help GIs fight colorectal cancer by making colonoscopies more effective. In “Detection of Elusive Polyps via a Large Scale AI System”, we present an ML model designed to combat the problem of incomplete detection by helping the GI detect polyps that are within the field of view. This work adds to our previously published work that maximizes the coverage of the colon during the colonoscopy by flagging for GI follow-up areas that may have been missed. Using clinical studies, we show that these systems significantly improve polyp detection rates.

Incomplete Exploration
To help the GI detect polyps that are outside the field of view, we previously developed an ML system that reduces the rate of incomplete exploration by estimating the fractions of covered and non-covered regions of a colon during a colonoscopy. This earlier work uses computer vision and geometry in a technique we call colonoscopy coverage deficiency via depth, to compute segment-by-segment coverage for the colon. It does so in two phases: first computing depth maps for each frame of the colonoscopy video, and then using these depth maps to compute the coverage in real time.

The ML system computes a depth image (middle) from a single RGB image (left). Then, based on the computation of depth images for a video sequence, it calculates local coverage (right), and detects where the coverage has been deficient and a second look is required (blue color indicates observed segments where red indicates uncovered ones). You can learn more about this work in our previous blog post.

This segment-by-segment work yields the ability to estimate what fraction of the current segment has been covered. The helpfulness of such functionality is clear: during the procedure itself, a physician may be alerted to segments with deficient coverage, and can immediately return to review these areas, potentially reducing the rates of missed polyps due to incomplete exploration.

Incomplete Detection
In our most recent paper, we look into the problem of incomplete detection. We describe an ML model that aids a GI in detecting polyps that are within the field of view, so as to reduce the rate of incomplete detection. We developed a system that is based on convolutional neural networks (CNN) with an architecture that combines temporal logic with a single frame detector, resulting in more accurate detection.

This new system has two principal advantages. The first is that the system improves detection performance by reducing the number of false negatives detections of elusive polyps, those polyps that are particularly difficult for GIs to detect. The second advantage is the very low false positive rate of the system. This low false positive rate makes these systems more likely to be adopted in the clinic.

Examples of the variety of polyps detected by the ML system.

We trained the system on 3600 procedures (86M video frames) and tested it on 1400 procedures (33M frames). All the videos and metadata were de-identified. The system detected 97% of the polyps (i.e., it yielded 97% sensitivity) at 4.6 false alarms per procedure, which is a substantial improvement over previously published results. Of the false alarms, follow-up review showed that some were, in fact, valid polyp detections, indicating that the system was able to detect polyps that were missed by the performing endoscopist and by those who annotated the data. The performance of the system on these elusive polyps suggests its generalizability in that the system has learned to detect examples that were initially missed by all who viewed the procedure.

We evaluated the system performance on polyps that are in the field of view for less than five seconds, which makes them more difficult for the GI to detect, and for which models typically have much lower sensitivity. In this case the system attained a sensitivity that is about three times that of the sensitivity that the original procedure achieved. When the polyps were present in the field of view for less than 2 seconds, the difference was even more stark — the system exhibited a 4x improvement in sensitivity.

It is also interesting to note that the system is fairly insensitive to the choice of neural network architecture. We used two architectures: RetinaNet and  LSTM-SSD. RetinaNet is a leading technique for object detection on static images (used for video by applying it to frames in a consecutive fashion). It is one of the top performers on a variety of benchmarks, given a fixed computational budget, and is known for balancing speed of computation with accuracy. LSTM-SSD is a true video object detection architecture, which can explicitly account for the temporal character of the video (e.g., temporal consistency of detections, ability to deal with blur and fast motion, etc.). It is known for being robust and very computationally lightweight and can therefore run on less expensive processors. Comparable results were also obtained on the much heavier Faster R-CNN architecture. The fact that results are similar across different architectures implies that one can choose the network meeting the available hardware specifications.

Prospective Clinical Research Study
As part of the research reported in our detection paper we ran a clinical validation on 100 procedures in collaboration with Shaare Zedek Medical Center in Jerusalem, where our system was used in real time to help GIs. The system helped detect an average of one polyp per procedure that would have otherwise been missed by the GI performing the procedure, while not missing any of the polyps detected by the GIs, and with 3.8 false alarms per procedure. The feedback from the GIs was consistently positive.

We are encouraged by the potential helpfulness of this system for improving polyp detection, and we look forward to working together with the doctors in the procedure room to further validate this research.

Acknowledgements
The research was conducted by teams from Google Health and Google Research, Israel with support from Verily Life Sciences, and in collaboration with Shaare Zedek Medical Center. Verily is advancing this research via a newly established center in Israel, led by Ehud Rivlin. This research was conducted by Danny Veikherman, Tomer Golany, Dan M. Livovsky, Amit Aides, Valentin Dashinsky, Nadav Rabani, David Ben Shimol, Yochai Blau, Liran Katzir, Ilan Shimshoni, Yun Liu, Ori Segol, Eran Goldin, Greg Corrado, Jesse Lachter, Yossi Matias, Ehud Rivlin, and Daniel Freedman. Our appreciation also goes to several institutions and GIs who provided advice along the way and tested our system prototype. We would like to thank all of our team members and collaborators who worked on this project with us, including: Chen Barshai, Nia Stoykova, and many others.

Categories
Offsites

Two New Datasets for Conversational NLP: TimeDial and Disfl-QA

A key challenge in natural language processing (NLP) is building conversational agents that can understand and reason about different language phenomena that are unique to realistic speech. For example, because people do not always premeditate exactly what they are going to say, a natural conversation often includes interruptions to speech, called disfluencies. Such disfluencies can be simple (like interjections, repetitions, restarts, or corrections), which simply break the continuity of a sentence, or more complex semantic disfluencies, in which the underlying meaning of a phrase changes. In addition, understanding a conversation also often requires knowledge of temporal relationships, like whether an event precedes or follows another. However, conversational agents built on today’s NLP models often struggle when confronted with temporal relationships or with disfluencies, and progress on improving their performance has been slow. This is due, in part, to a lack of datasets that involve such interesting conversational and speech phenomena.

To stir interest in this direction within the research community, we are excited to introduce TimeDial, for temporal commonsense reasoning in dialog, and Disfl-QA, which focuses on contextual disfluencies. TimeDial presents a new multiple choice span filling task targeted for temporal understanding, with an annotated test set of over ~1.1k dialogs. Disfl-QA is the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages, with ~12k human annotated disfluent questions. These benchmark datasets are the first of their kind and show a significant gap between human performance and current state of the art NLP models.

TimeDial
While people can effortlessly reason about everyday temporal concepts, such as duration, frequency, or relative ordering of events in a dialog, such tasks can be challenging for conversational agents. For example, current NLP models often make a poor selection when tasked with filling in a blank (as shown below) that assumes a basic level of world knowledge for reasoning, or that requires understanding explicit and implicit inter-dependencies between temporal concepts across conversational turns.

It is easy for a person to judge that “half past one” and “quarter to two” are more plausible options to fill in the blank than “half past three” and “half past nine”. However, performing such temporal reasoning in the context of a dialog is not trivial for NLP models, as it requires appealing to world knowledge (i.e., knowing that the participants are not yet late for the meeting) and understanding the temporal relationship between events (“half past one” is before “three o’clock”, while “half past three” is after it). Indeed, current state-of-the-art models like T5 and BERT end up picking the wrong answers — “half past three” (T5) and “half past nine” (BERT).

The TimeDial benchmark dataset (derived from the DailyDialog multi-turn dialog corpus) measures models’ temporal commonsense reasoning abilities within a dialog context. Each of the ~1.5k dialogs in the dataset is presented in a multiple choice setup, in which one temporal span is masked out and the model is asked to find all correct answers from a list of four options to fill in the blank.

In our experiments we found that while people can easily answer these multiple choice questions (at 97.8% accuracy), state-of-the-art pre-trained language models still struggle on this challenge set. We experiment across three different modeling paradigms: (i) classification over the provided 4 options using BERT, (ii) mask filling for the masked span in the dialog using BERT-MLM, (iii) generative methods using T5. We observe that all the models struggle on this challenge set, with the best variant only scoring 73%.

Model   2-best Accuracy
Human   97.8%
BERT – Classification   50.0%
BERT – Mask Filling   68.5%
T5 – Generation   73.0%

Qualitative error analyses show that the pre-trained language models often rely on shallow, spurious features (particularly text matching), instead of truly doing reasoning over the context. It is likely that building NLP models capable of performing the kind of temporal commonsense reasoning needed for TimeDial requires rethinking how temporal objects are represented within general text representations.

Disfl-QA
As disfluency is inherently a speech phenomenon, it is most commonly found in text output from speech recognition systems. Understanding such disfluent text is key to building conversational agents that understand human speech. Unfortunately, research in the NLP and speech community has been impeded by the lack of curated datasets containing such disfluencies, and the datasets that are available, like Switchboard, are limited in scale and complexity. As a result, it’s difficult to stress test NLP models in the presence of disfluencies.

Disfluency   Example
Interjection   When is, uh, Easter this year?
Repetition   When is EasEaster this year?
Correction   When is Lent, I mean Easter, this year?
Restart   How much, no wait, when is Easter this year?
Different kinds of disfluencies. The reparandum (words intended to be corrected or ignored; in red), interregnum (optional discourse cues; in grey) and repair (the corrected words; in blue).

Disfl-QA is the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages from SQuAD. Disfl-QA is a targeted dataset for disfluencies, in which all questions (~12k) contain disfluencies, making for a much larger disfluent test set than prior datasets. Over 90% of the disfluencies in Disfl-QA are corrections or restarts, making it a much more difficult test set for disfluency correction. In addition, compared to earlier disfluency datasets, it contains a wider variety of semantic distractors, i.e., distractors that carry semantic meaning as opposed to simpler speech disfluencies. 

Passage: …The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, …
Q1:   In what country is Normandy located?   France ✓
DQ1:   In what country is Norse found no wait Normandy not Norse?   Denmark X
Q2:   When were the Normans in Normandy?   10th and 11th centuries ✓
DQ2:   From which countries no tell me when were the Normans in Normandy?   Denmark, Iceland and Norway X
A passage and questions (Qi) from SQuAD dataset, along with their disfluent versions (DQi), consisting of semantic distractors (like “Norse” and “from which countries”) and predictions from a T5 model.

Here, the first question (Q1) is seeking an answer about the location of Normandy. In the disfluent version (DQ1) Norse is mentioned before the question is corrected. The presence of this correctional disfluency confuses the QA model, which tends to rely on shallow textual cues from the question for making predictions.

Disfl-QA also includes newer phenomena, such as coreference (expression referring to the same entity) between the reparandum and the repair.

SQuAD    Disfl-QA
Who does BSkyB have an operating license from?    Who removed [BSkyB’s] operating license, no scratch that, who do [they] have [their] operating license from?

Experiments show that the performance of existing state-of-the-art language model–based question answering systems degrades significantly when tested on Disfl-QA and heuristic disfluencies (presented in the paper) in a zero-shot setting.

Dataset   F1
SQuAD   89.59
Heuristics   65.27 (-24.32)
Disfl-QA   61.64 (-27.95)

We show that data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using human-annotated training data for fine-tuning. We argue that researchers need large-scale disfluency datasets in order for NLP models to be robust to disfluencies.

Conclusion
Understanding language phenomena that are unique to human speech, like disfluencies and temporal reasoning, among others, is a key ingredient for enabling more natural human–machine communication in the near future. With TimeDial and Disfl-QA, we aim to fill a major research gap by providing these datasets as testbeds for NLP models, in order to evaluate their robustness to ubiquitous phenomena across different tasks. It is our hope that the broader NLP community will devise generalized few-shot or zero-shot approaches to effectively handle these phenomena, without requiring task-specific human-annotated training datasets, constructed specifically for these challenges.

Acknowledgments
The TimeDial work has been a team effort involving Lianhui Qi, Luheng He, Yenjin Choi, Manaal Faruqui and the authors. The Disfl-QA work has been a collaboration involving Jiacheng Xu, Diyi Yang, Manaal Faruqui.