The topic of AI fairness metrics is as important to society as it is confusing. Confusing it is due to a number of reasons: terminological proliferation, abundance of formulae, and last not least the impression that everyone else seems to know what they’re talking about. This text hopes to counteract some of that confusion by starting from a common-sense approach of contrasting two basic positions: On the one hand, the assumption that dataset features may be taken as reflecting the underlying concepts ML practitioners are interested in; on the other, that there inevitably is a gap between concept and measurement, a gap that may be bigger or smaller depending on what is being measured. In contrasting these fundamental views, we bring together concepts from ML, legal science, and political philosophy.
Category: Offsites
Stuff I find valuable at key websites
Supervised learning, the machine learning task of training predictive models using data points with known outcomes (i.e., labeled data), is generally the preferred approach in industry because of its simplicity. However, supervised learning requires accurately labeled data, the collection of which is often labor intensive. In addition, as model efficiency improves with better architectures, algorithms, and hardware (GPUs / TPUs), training large models to achieve better quality becomes more accessible, which, in turn, requires even more labeled data for continued progress.
To mitigate such data acquisition challenges, semi-supervised learning, a machine learning paradigm that combines a small amount of labeled data with a large amount of unlabeled data, has recently seen success with methods such as UDA, SimCLR, and many others. In our previous work, we demonstrated for the first time that a semi-supervised learning approach, Noisy Student, can achieve state-of-the-art performance on ImageNet, a large-scale academic benchmark for image classification, by utilizing many more unlabeled examples.
Inspired by these results, today we are excited to present semi-supervised distillation (SSD), a simplified version of Noisy Student, and demonstrate its successful application to the language domain. We apply SSD to language understanding within the context of Google Search, resulting in high performance gains. This is the first successful instance of semi-supervised learning applied at such a large scale and demonstrates the potential impact of such approaches for production-scale systems.
Noisy Student Training
Prior to our development of Noisy Student, there was a large body of research into semi-supervised learning. In spite of this extensive research, however, such systems typically worked well only in the low-data regime, e.g., CIFAR, SVHN, and 10% ImageNet. When labeled data were abundant, such models were unable to compete with fully supervised learning systems, which prevented semi-supervised approaches from being applied to important applications in production, such as search engines and self-driving cars. This shortcoming motivated our development of Noisy Student Training, a semi-supervised learning approach that worked well in the high-data regime, and at the time achieved state-of-the-art accuracy on ImageNet using 130M additional unlabeled images.
Noisy Student Training has 4 simple steps:
- Train a classifier (the teacher) on labeled data.
- The teacher then infers pseudo-labels on a much larger unlabeled dataset.
- Then, it trains a larger classifier on the combined labeled and pseudo-labeled data, while also adding noise (noisy student).
- (Optional) Going back to step 2, the student may be used as a new teacher.
![]() |
An illustration of Noisy Student Training through four simple steps. We use two types of noise: model noise (Dropout, Stochastic Depth) and input noise (data augmentation, such as RandAugment). |
One can view Noisy Student as a form of self-training, because the model generates pseudo-labels with which it retrains itself to improve performance. A surprising property of Noisy Student Training is that the trained models work extremely well on robustness test sets for which it was not optimized, including ImageNet-A, ImageNet-C, and ImageNet-P. We hypothesize that the noise added during training not only helps with the learning, but also makes the model more robust.
![]() |
Examples of images that are classified incorrectly by the baseline model, but correctly by Noisy Student. Left: An unmodified image from ImageNet-A. Middle and Right: Images with noise added, selected from ImageNet-C. For more examples including ImageNet-P, please see the paper. |
Connections to Knowledge Distillation
Noisy Student is similar to knowledge distillation, which is a process of transferring knowledge from a large model (i.e., the teacher) to a smaller model (the student). The goal of distillation is to improve speed in order to build a model that is fast to run in production without sacrificing much in quality compared to the teacher. The simplest setup for distillation involves a single teacher and uses the same data, but in practice, one can use multiple teachers or a separate dataset for the student.
![]() |
Simple illustrations of Noisy Student and knowledge distillation. |
Unlike Noisy Student, knowledge distillation does not add noise during training (e.g., data augmentation or model regularization) and typically involves a smaller student model. In contrast, one can think of Noisy Student as the process of “knowledge expansion”.
Semi-Supervised Distillation
Another strategy for training production models is to apply Noisy Student training twice: first to get a larger teacher model T’ and then to derive a smaller student S. This approach produces a model that is better than either training with supervised learning or with Noisy Student training alone. Specifically, when applied to the vision domain for a family of EfficientNet models, ranging from EfficientNet-B0 with 5.3M parameters to EfficientNet-B7 with 66M parameters, this strategy achieves much better performance for each given model size (see Table 9 of the Noisy Student paper for more details).
Noisy Student training needs data augmentation, e.g., RandAugment (for vision) or SpecAugment (for speech), to work well. But in certain applications, e.g., natural language processing, such types of input noise are not readily available. For those applications, Noisy Student Training can be simplified to have no noise. In that case, the above two-stage process becomes a simpler method, which we call Semi-Supervised Distillation (SSD). First, the teacher model infers pseudo-labels on the unlabeled dataset from which we then train a new teacher model (T’) that is of equal-or-larger size than the original teacher model. This step, which is essentially self-training, is then followed by knowledge distillation to produce a smaller student model for production.
![]() |
An illustration of Semi-Supervised Distillation (SSD), a 2-stage process that self-trains an equal-or-larger teacher (T’) before distilling to a student (S). |
Improving Search
Having succeeded in the vision domain, an application in the language understanding domain, like Google Search, is a logical next step with broader user impact. In this case, we focus on an important ranking component in Search, which builds on BERT to better understand languages. This task turns out to be well-suited for SSD. Indeed, applying SSD to the ranking component to better understand the relevance of candidate search results to queries achieved one of the highest performance gains among top launches at Search in 2020. Below is an example of a query where the improved model demonstrates better language understanding.
![]() |
With the implementation of SSD, Search is able to find documents that are more relevant to user queries. |
Future Research & Challenges
We have presented a successful instance of semi-supervised distillation (SSD) in the production scale setting of Search. We believe SSD will continue changing the landscape of machine learning usage in the industry from predominantly supervised learning to semi-supervised learning. While our results are promising, there is still much research needed in how to efficiently utilize unlabeled examples in the real world, which is often noisy, and apply them to various domains.
Acknowledgements
Zhenshuai Ding, Yanping Huang, Elizabeth Tucker, Hai Qian, and Steve He contributed immensely to this successful launch. The project would not have succeeded without contributions from members of both the Brain and Search teams: Shuyuan Zhang, Rohan Anil, Zhifeng Chen, Rigel Swavely, Chris Waterson, Avinash Atreya. Thanks to Qizhe Xie and Zihang Dai for feedback on the work. Also, thanks to Quoc Le, Yonghui Wu, Sundeep Tirumalareddy, Alexander Grushetsky, Pandu Nayak for their leadership support.
It is widely accepted that the enormous growth of deep reinforcement learning research, which combines traditional reinforcement learning with deep neural networks, began with the publication of the seminal DQN algorithm. This paper demonstrated the potential of this combination, showing that it could produce agents that could play a number of Atari 2600 games very effectively. Since then, there have been several approaches that have built on and improved the original DQN. The popular Rainbow algorithm combined a number of these recent advances to achieve state-of-the-art performance on the ALE benchmark. This advance, however, came at a very high computational cost, which has the unfortunate side effect of widening the gap between those with ample access to computational resources and those without.
In “Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research”, to be presented at ICML 2021, we revisit this algorithm on a set of small- and medium-sized tasks. We first discuss the computational cost associated with the Rainbow algorithm. We explore how the same conclusions regarding the benefits of combining the various algorithmic components can be reached with smaller-scale experiments, and further generalize that idea to how research done on a smaller computational budget can provide valuable scientific insights.
The Cost of Rainbow
A major reason for the computational cost of Rainbow is that the standards in academic publishing often require evaluating new algorithms on large benchmarks like ALE, which consists of 57 Atari 2600 games that reinforcement learning agents may learn to play. For a typical game, it takes roughly five days to train a model using a Tesla P100 GPU. Furthermore, if one wants to establish meaningful confidence bounds, it is common to perform at least five independent runs. Thus, to train Rainbow on the full suite of 57 games required around 34,200 GPU hours (or 1425 days) in order to provide convincing empirical performance statistics. In other words, such experiments are only feasible if one is able to train on multiple GPUs in parallel, which can be prohibitive for smaller research groups.
Revisiting Rainbow
As in the original Rainbow paper, we evaluate the effect of adding the following components to the original DQN algorithm: double Q-learning, prioritized experience replay, dueling networks, multi-step learning, distributional RL, and noisy nets.
We evaluate on a set of four classic control environments, which can be fully trained in 10-20 minutes (compared to five days for ALE games):
![]() |
Upper left: In CartPole, the task is to balance a pole on a cart that the agent can move left and right. Upper right: In Acrobot, there are two arms and two joints, where the agent applies force to the joint between the two arms in order to raise the lower arm above a threshold. Lower left: In LunarLander, the agent is meant to land the spaceship between the two flags. Lower right: In MountainCar, the agent must build up momentum between two hills to drive to the top of the rightmost hill. |
We investigated the effect of both independently adding each of the components to DQN, as well as removing each from the full Rainbow algorithm. As in the original Rainbow paper, we find that, in aggregate, the addition of each of these algorithms does improve learning over the base DQN. However, we also found some important differences, such as the fact that distributional RL — commonly thought to be a positive addition on its own — does not always yield improvements on its own. Indeed, in contrast to the ALE results in the Rainbow paper, in the classic control environments, distributional RL only yields an improvement when combined with another component.
![]() |
Each plot shows the training progress when adding the various components to DQN. The x-axis is training steps,the y-axis is performance (higher is better). |
![]() |
Each plot shows the training progress when removing the various components from Rainbow. The x-axis is training steps,the y-axis is performance (higher is better). |
We also re-ran the Rainbow experiments on the MinAtar environment, which consists of a set of five miniaturized Atari games, and found qualitatively similar results. The MinAtar games are roughly 10 times faster to train than the regular Atari 2600 games on which the original Rainbow algorithm was evaluated, but still share some interesting aspects, such as game dynamics and having pixel-based inputs to the agent. As such, they provide a challenging mid-level environment, in between the classic control and the full Atari 2600 games.
When viewed in aggregate, we find our results to be consistent with those of the original Rainbow paper — the impact resulting from each algorithmic component can vary from environment to environment. If we were to suggest a single agent that balances the tradeoffs of the different algorithmic components, our version of Rainbow would likely be consistent with the original, in that combining all components produces a better overall agent. However, there are important details in the variations of the different algorithmic components that merit a more thorough investigation.
Beyond the Rainbow
When DQN was introduced, it made use of the Huber loss and the RMSProp Optimizer. It has been common practice for researchers to use these same choices when building on DQN, as most of their effort is spent on other algorithmic design decisions. In the spirit of reassessing these assumptions, we revisited the loss function and optimizer used by DQN on a lower-cost, small-scale classic control and MinAtar environments. We ran some initial experiments using the Adam optimizer, which has lately been the most popular optimizer choice, combined with a simpler loss function, the mean-squared error loss (MSE). Since the selection of optimizer and loss function is often overlooked when developing a new algorithm, we were surprised to see that we observed a dramatic improvement on all the classic control and MinAtar environments.
We thus decided to evaluate the different ways of combining the two optimizers (RMSProp and Adam) with the two losses (Huber and MSE) on the full ALE suite (60 Atari 2600 games). We found that Adam+MSE is a superior combination than RMSProp+Huber.
![]() |
Measuring the improvement Adam+MSE gives over the default DQN settings (RMSProp + Huber); higher is better. |
Additionally, when comparing the various optimizer-loss combinations, we find that when using RMSProp, the Huber loss tends to perform better than MSE (illustrated by the gap between the solid and dotted orange lines).
![]() |
Normalized scores aggregated over all 60 Atari 2600 games, comparing the different optimizer-loss combinations. |
Conclusion
On a limited computational budget we were able to reproduce, at a high-level, the findings of the Rainbow paper and uncover new and interesting phenomena. Evidently it is much easier to revisit something than to discover it in the first place. Our intent with this work, however, was to argue for the relevance and significance of empirical research on small- and medium-scale environments. We believe that these less computationally intensive environments lend themselves well to a more critical and thorough analysis of the performance, behaviors, and intricacies of new algorithms.
We are by no means calling for less emphasis to be placed on large-scale benchmarks. We are simply urging researchers to consider smaller-scale environments as a valuable tool in their investigations, and reviewers to avoid dismissing empirical work that focuses on smaller-scale environments. By doing so, in addition to reducing the environmental impact of our experiments, we will get both a clearer picture of the research landscape and reduce the barriers for researchers from diverse and often underresourced communities, which can only help make our community and scientific advances stronger.
Acknowledgments
Thank you to Johan, the first author of this paper, for his hard work and persistence in seeing this through! We would also like to thank Marlos C. Machado, Sara Hooker, Matthieu Geist, Nino Vieillard, Hado van Hasselt, Eleni Triantafillou, and Brian Tanner for their insightful comments on this work.
We are excited to announce the availability of sparklyr.sedona, a sparklyr extension making geospatial functionalities of the Apache Sedona library easily accessible from R.
Sparklyr 1.7 delivers much-anticipated improvements, including R interfaces for image and binary data sources, several new spark_apply() capabilities, and better integration with sparklyr extensions.
CodeReading – 4. Python Code Style
Code Reading은 잘 작성되어 있는 프레임워크, 라이브러리, 툴킷 등의 다양한 프로젝트의 내부를 살펴보는 시리즈 입니다. 프로젝트의 아키텍처, 디자인철학이나 코드 스타일 등을 살펴보며, 구체적으로 하나하나 살펴보는 것이 아닌 전반적이면서 간단하게 살펴봅니다.
이번 포스트에서는 프로젝트는 아니지만, 코드 자체와 관련이 깊은 Python 언어의 코드 스타일에 대해서 알아봅니다.
Series.
- CodeReading – 1. PyTorch
- CodeReading – 2. Flask
- CodeReading – 3. ABC
- CodeReading – 4. Python Code Style
Code Style의 중요성
코드스타일이 의미하는 것은 무엇이고, 왜 필요할까요?
코드스타일은 코드 가독성을 위해서 네이밍, 라인수, Indentation 등의 코드의 형식을 맞추는 것을 의미합니다. 이 활동은 특히 ‘협업’에 중점두고 있습니다. 즉 개인이 아닌 팀으로 일하는 경우에 사용하는 하나의 룰 혹은 가이드입니다.
이 Code Reading 시리즈에서 코드를 읽고 있는 것처럼, 코드는 기계 뿐만 아니라 사람들에게도 읽히게 됩니다. 특히 가장 많이 보게 되는 코드는 같이 협업을 하는 동료의 코드일 것 입니다. 이렇게 읽어야 하는 코드가 개인의 코드스타일에 따라서 작성되어 있다면 어떨까요? 스타일에 따라서 다르겠지만, 기본적으로 코드를 해석하는데 시간이 더 걸리게 될 것입니다. 하지만 팀 내부적으로 코드 스타일을 맞춰놓았다면, 훨씬 빠르게 로직을 이해할 수 있고 새롭게 기능을 추가하거나, 코드 리뷰를 하는 등의 다양한 활동을 조금 더 쉽게 진행할 수 있을 것 입니다.
이렇게 명확한 장점을 가지고 있는 코드스타일을 도입하는데 있어서, 한가지 주의할 점이 있습니다. 코드 스타일을 지원하는 다양한 툴이 있기도 하고, 회사마다 스타일 가이드가 있습니다. 즉 누군가에게 조금 더 선호되는 스타일이 있을 수 있지만, 딱 ‘이 스타일이 정답이다’ 라고 말할 수 없는 논술 문제에 가깝습니다. 그래서 팀 내부적으로 의논을 통해, 우리 팀에 맞는 스타일을 모두의 동의하에 정의하고 서로 맞춰나가는 것이 가장 중요합니다.
그렇다면 Pyhon Code Style에 대해서 조금 더 자세히 알아볼까요?
PEP8
Python 언어의 코드 스타일의 가장 기본이 되는 것은 PEP8 입니다. PEP(Python Enhance Proposal)은 공식적으로 운영되고 있는 개선 제안서로서, 그 중에서 PEP8 은 Style Guide for Python Code 이라는 제목으로 작성이 되어있습니다.
여기에는 다음과 같은 가이드들이 작성되어 있습니다. 간략하게 다뤄보면 다음과 같습니다.
- Indentation
# Correct:
# Aligned with opening delimiter.
foo = long_function_name(var_one, var_two,
var_three, var_four)
# Add 4 spaces (an extra level of indentation) to distinguish arguments from the rest.
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
# -------------------------
# Wrong:
# Arguments on first line forbidden when not using vertical alignment.
foo = long_function_name(var_one, var_two,
var_three, var_four)
# Further indentation required as indentation is not distinguishable.
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
indentation의 경우, 함수의 arguments 들이 명확히 구별되도록 가이드하고 있습니다.
- Tabs vs Spaces
PEP8에서는 Space를 써야한다고 이야기하고 있고, 두가지를 혼합해서 사용하는 것을 금지하고 있습니다! (이 포스트 시작에 있는 그림과 연결되죠)
- 권장사항
아래와 같이 empty sequence 는 False 인 것을 이용해서, length를 기반으로 체크하지 않도록 권장하기도 합니다.
# Correct:
if not seq:
if seq:
# Wrong:
if len(seq):
if not len(seq):
가장 기본이 되는 스타일 가이드이기 때문에 전체를 한번 읽어보실 것을 추천드립니다. 대부분의 내용은 지금도 사용되고 있습니다. 다만, PEP8이 제안될 당시와 비교했을 때 개발 환경은 많이 달라졌습니다. 작은 모니터에서 다양한 크기의 모니터를 여러개 연결해서 개발을 하는 것이 흔해진 상황이죠. 그래서 Maximum Line Length의 경우에는 의견이 분분하기도 합니다. 모든 행을 79자 제한으로 제안되었지만, 더 길게 셋팅을 하는 경우도 흔하게 찾아볼 수 있습니다.
Tools
코드스타일에 관련한 도구들은 위의 PEP8을 기본으로 지원하면서 환경셋팅을 통해서 원하는대로 수정할 수 있도록 되어있습니다. 여러가지 도구들이 있지만, 최근에 가장 많이 사용하고 있는 black과 isort 두가지를 대표적으로 소개해드려고 합니다.
black
The Uncompromising Code Formatter
black은 비교적 최근(2018년)에 개발된 오픈소스로 정해진 코드의 규격에 맞춰서 정리해주는 Code Formatter 입니다. black의 가장 큰 특징은 바로 ‘타협하지 않는’ 입니다. 타협하지 않는 부분은 바로 PEP8 에서 커버하지 않고 있는, 사람마다 각각 스타일이 다른 부분입니다. 그래서 black의 공식 문서에서도 PEP을 준수하는 엄격한 하위집단이라고 이야기 하고 있습니다.
그러면 black에서 추구하는 코드 스타일에 대해서 간단하게 살펴보겠습니다.
- Maximum Line Length : 88 (80 에서 10% 늘어난 값)
- 줄 바꿈 방식
# [1] in:
ImportantClass.important_method(exc, limit, lookup_lines, capture_locals, extra_argument)
# [1] out:
ImportantClass.important_method(
exc, limit, lookup_lines, capture_locals, extra_argument
)
# [2] in:
def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):
"""Applies `variables` to the `template` and writes to `file`."""
with open(file, 'w') as f:
...
# [2] out:
def very_important_function(
template: str,
*variables,
file: os.PathLike,
engine: str,
header: bool = True,
debug: bool = False,
):
"""Applies `variables` to the `template` and writes to `file`."""
with open(file, "w") as f:
...
# [3] in:
if some_long_rule1
and some_long_rule2:
...
# [3] out:
if (
some_long_rule1
and some_long_rule2
):
...
줄 길이에 따라서 위와 같이, 들여쓰기[1]가 되거나 파라미터가 더욱 많은 경우에는 하나하나 새로운 줄[2]에 위치하게 됩니다. 그리고 문장 길이를 맞추기 위해서 사용하는 백슬래시()가 파싱오류 등의 이슈가 있기 때문에 이를 사용하지 않는다고 말하고 있습니다.
- String: 작은따옴표(‘) 보다는 큰따옴표(“)를 선호
doc string에서도 큰따옴표(“)를 사용하고 있고, 큰따옴표로 구성된 empty string의 경우 (“”) 혼동을 줄 여지가 없기 때문이죠.
- Call chains
def example(session):
result = (
session.query(models.Customer.id)
.filter(
models.Customer.account_id == account_id,
models.Customer.email == email_address,
)
.order_by(models.Customer.id.asc())
.all()
)
마지막으로 많은 언어에서 사용되고 있는 call chain 형식 또한 사용하고 있습니다. 저는 개인적으로 Java 와 Scala에서 이런 패턴을 많이 봐서 더 익숙하기도 하고, 어떤 로직인지 이해가기가 쉬워서 선호하는 스타일이기도 합니다.
그 외에도 다양한 주제들이 있으니 문서를 참고해보시기를 추천드립니다!
black 에서 추구하고 있는 코드스타일이 모호한 부분들을 다루고 있는 것처럼, 모두가 여기서 제안하는 스타일이 만족스럽지 않을 수 있습니다. 협업을 위한 코드스타일에 좋은 가이드이자 시작점이 아닐까 싶습니다.
isort
isort는 import 를 관리해주는 Code Formatter 입니다. 아래 예시를 보시면 바로 감이 잡히실 것 같네요.
# Before isort:
from my_lib import Object
import os
from my_lib import Object3
from my_lib import Object2
import sys
from third_party import lib15, lib1, lib2, lib3, lib4, lib5, lib6, lib7, lib8, lib9, lib10, lib11, lib12, lib13, lib14
import sys
from __future__ import absolute_import
from third_party import lib3
print("Hey")
print("yo")
# ----------------------------------------------------------------------
# After isort:
from __future__ import absolute_import
import os
import sys
from third_party import (lib1, lib2, lib3, lib4, lib5, lib6, lib7, lib8,
lib9, lib10, lib11, lib12, lib13, lib14, lib15)
from my_lib import Object, Object2, Object3
print("Hey")
print("yo")
위의 예제에서 보는 것처럼, import 되는 package의 종류를 아래 4가지로 관리하고 있습니다.
__future__
: Python2 에서 3의 기능을 사용하기 위해서 Import 하는 모듈built-in modules
: os, sys, collections 등 Python의 기본 내장 모듈third party
: pip를 통해서 설치한 외부 library 들 (e.g. requests, pytorch 등)my library
: 스스로 만든 package
black 과의 큰 차이점은 다양한 옵션들을 제공하는 것입니다. 위의 예시에서 third_party
의 모듈 여러개를 import 하는 방식에서 한 줄로 나열하는 방식, 하나씩 줄을 바꾸는 경우 등 다양한 옵션들이 있는 것을 확인하실 수 있습니다.
# 0 - Grid
from third_party import (lib1, lib2, lib3,
lib4, lib5, ...)
# 1 - Vertical
from third_party import (lib1,
lib2,
lib3
lib4,
lib5,
...)
...
# 3 - Vertical Hanging Indent (Black Style)
from third_party import (
lib1,
lib2,
lib3,
lib4,
)
...
관련해서 옵션이 11개 있으니, 원하는 대로 설정해서 사용하시면 될 것 같습니다. 만약에 위에서 소개했던 black과 같이 사용한다면 3번 옵션으로 셋팅해서 사용하면 스타일을 유지할 수 있게 됩니다. isort 에서는 다른 Code Formatter 라이브러리들끼리는 서로가 호환되도록 개발을 모두 해놓았기 때문에 간단한 설정으로 양쪽 모두 사용하실 수 있을 것 입니다.
끝으로
이번 포스트에서는 Code Style을 왜 맞추어야 하는지 그리고 Python 코드 스타일 가이드 PEP8에서 시작해서 쉽게 적용해볼 수 있는 Tool 을 소개시켜 드렸습니다. 서문에서 이야기 한 것처럼, 코드 스타일은 다른 무엇이 아닌 협업을 위한 것 입니다. 그래서 코드 스타일 자체보다는 이것을 도입하기 까지의 과정이 더 중요하다고 생각이 드네요. ‘함께’ 더 잘 일하기 위한 것이니까요.
References
In the last two decades, dramatic advances in compute and connectivity have allowed game developers to create works of ever-increasing scope and complexity. Simple linear levels have evolved into photorealistic open worlds, procedural algorithms have enabled games with unprecedented variety, and expanding internet access has transformed games into dynamic online services. Unfortunately, scope and complexity have grown more rapidly than the size of quality assurance teams or the capabilities of traditional automated testing. This poses a challenge to both product quality (such as delayed releases and post-launch patches) and developer quality of life.
Machine learning (ML) techniques offer a possible solution, as they have demonstrated the potential to profoundly impact game development flows – they can help designers balance their game and empower artists to produce high-quality assets in a fraction of the time traditionally required. Furthermore, they can be used to train challenging opponents that can compete at the highest levels of play. Yet some ML techniques can pose requirements that currently make them impractical for production game teams, including the design of game-specific network architectures, the development of expertise in implementing ML algorithms, or the generation of billions of frames of training data. Conversely, game developers operate in a setting that offers unique advantages to leverage ML techniques, such as direct access to the game source, an abundance of expert demonstrations, and the uniquely interactive nature of video games.
Today, we present a ML-based system that game developers can use to quickly and efficiently train game-testing agents, helping developers find serious bugs quickly while allowing human testers to focus on more complex and intricate problems. The resulting solution requires no ML expertise, works on many of the most popular game genres, and can train an ML policy, which generates game actions from game state, in less than an hour on a single game instance. We have also released an open source library that demonstrates a functional application of these techniques.
![]() |
Supported genres include arcade, action/adventure, and racing games. |
The Right Tool for the Right Job
The most elemental form of video game testing is to simply play the game. A lot. Many of the most serious bugs (such as crashes or falling out of the world) are easy to detect and fix; the challenge is finding them within the vast state space of a modern game. As such, we decided to focus on training a system that could “just play the game” at scale.
We found that the most effective way to do this was not to try to train a single, super effective agent that could play the entire game from end-to-end, but to provide developers with the ability to train an ensemble of game-testing agents, each of which could effectively accomplish tasks of a few minutes each, which game developers refer to as “gameplay loops”.
![]() |
These core gameplay behaviors are often expensive to program through traditional means, but are much more efficient to train than a single end-to-end ML model. In practice, commercial games create longer loops by repeating and remixing core gameplay loops, which means that developers can test large stretches of gameplay by combining ML policies with a small amount of simple scripting.
Simulation-centric, Semantic API
One of the most fundamental challenges in applying ML to game development is bridging the chasm between the simulation-centric world of video games and the data-centric world of ML. Rather than ask developers to directly convert the game state into custom, low-level ML features (which would be too labor intensive) or attempting to learn from raw pixels (which would require too much data to train), our system provides developers with an idiomatic, game-developer friendly API that allows them to describe their game in terms of the essential state that a player observes and the semantic actions they can perform. All of this information is expressed via concepts that are familiar to game developers, such as entities, raycasts, 3D positions and rotations, buttons and joysticks.
As you can see in the example below, the API allows the specification of observations and actions in just a few lines of code.
![]() |
Example actions and observations for a racing game. |
From API to Neural NetworkThis high level, semantic API is not just easy to use but also allows the system to flexibly adapt to the specific game being developed – the specific combination of API building blocks employed by the game developer informs our choice of network architecture, since it provides information about the type of gaming scenario in which the system is deployed. Some examples of this include: handling action outputs differently depending on whether they represent a digital button or analog joystick, or using techniques from image processing to handle observations that result from an agent probing its environment with raycasts (similar to how autonomous vehicles probe their environment with LIDAR).
Our API is sufficiently general to allow modeling of many common control-schemes (the configuration of action outputs that control movement) in games, such as first-person games, third-person games with camera-relative controls, racing games, twin stick shooters, etc. Since 3D movement and aiming are often an integral aspect of gameplay in general, we create networks that automatically tend towards simple behaviors such as aiming, approach or avoidance in these games. The system accomplishes this by analyzing the game’s control scheme to create neural network layers that perform custom processing of observations and actions in that game. For example, positions and rotations of objects in the world are automatically translated into directions and distances from the point of view of the AI-controlled game entity. This transformation typically increases the speed of learning and helps the learned network generalize better.
Learning From The Experts in Real Time
After generating a neural network architecture, the network needs to be trained to play the game using an appropriate choice of learning algorithm.
Reinforcement learning (RL), in which an ML policy is trained directly to maximize a reward, may seem like the obvious choice since they have been successfully used to train highly competent ML policies for games. However, RL algorithms tend to require more data than a single game instance can produce in a reasonable amount of time, and achieving good results in a new domain often requires hyperparameter tuning and strong ML domain knowledge.
Instead, we found that imitation learning (IL), which trains ML policies based by observing experts play the game, works well for our use case. Unlike RL, where the agent needs to discover a good policy on its own, IL only needs to recreate the behavior of a human expert. Since game developers and testers are experts in their own games, they can easily provide demonstrations of how to play the game.
We use an IL approach inspired by the DAgger algorithm, which allows us to take advantage of video games’ most compelling quality – interactivity. Thanks to the reductions in training time and data requirements enabled by our semantic API, training is effectively realtime, giving a developer the ability to fluidly switch between providing gameplay demonstrations and watching the system play. This results in a natural feedback loop, in which a developer iteratively provides corrections to a continuous stream of ML policies.
From the developer’s perspective, providing a demonstration or a correction to faulty behavior is as simple as picking up the controller and starting to play the game. Once they are done, they can put the controller down and watch the ML policy play. The result is a training experience that is real-time, interactive, highly experiential, and, very often, more than a little fun.
![]() |
ML policy for an FPS game, trained with our system. |
Conclusion
We present a system which combines a high-level semantic API with a DAgger-inspired interactive training flow that enables training of useful ML policies for video game testing in a wide variety of genres. We have released an open source library as a functional illustration of our system. No ML expertise is required and training of agents for test applications often takes less than an hour on a single developer machine. We hope that this work will help inspire the development of ML techniques that can be deployed in real-world game-development flows in ways that are accessible, effective, and fun to use.
Acknowledgements
We’d like to thank the core members of the project: Dexter Allen, Leopold Haller, Nathan Martz, Hernan Moraldo, Stewart Miles and Hina Sakazaki. Training algorithms are provided by TF Agents, and on-device inference by TF Lite. Special thanks to our research advisors, Olivier Bachem, Erik Frey, and Toby Pohlen, and to Eugene Brevdo, Jared Duke, Oscar Ramirez and Neal Wu who provided helpful guidance and support.
Despite recent leaps in imaging technology, especially on mobile devices, image noise and limited sharpness remain two of the most important levers for improving the visual quality of a photograph. These are particularly relevant when taking pictures in poor light conditions, where cameras may compensate by increasing the ISO or slowing the shutter speed, thereby exacerbating the presence of noise and, at times, increasing image blur. Noise can be associated with the particle nature of light (shot noise) or be introduced by electronic components during the readout process (read noise). The captured noisy signal is then processed by the camera image processor (ISP) and later may be further enhanced, amplified, or distorted by a photographic editing process. Image blur can be caused by a wide variety of phenomena, from inadvertent camera shake during capture, an incorrect setting of the camera’s focus (automatic or not), or due to the finite lens aperture, sensor resolution or the camera’s image processing.
It is far easier to minimize the effects of noise and blur within a camera pipeline, where details of the sensor, optical hardware and software blocks are understood. However, when presented with an image produced from an arbitrary (possibly unknown) camera, improving noise and sharpness becomes much more challenging due to the lack of detailed knowledge and access to the internal parameters of the camera. In most situations, these two problems are intrinsically related: noise reduction tends to eliminate fine structures along with unwanted details, while blur reduction seeks to boost structures and fine details. This interconnectedness increases the difficulty of developing image enhancement techniques that are computationally efficient to run on mobile devices.
Today, we present a new approach for camera-agnostic estimation and elimination of noise and blur that can improve the quality of most images. We developed a pull-push denoising algorithm that is paired with a deblurring method, called polyblur. Both of these components are designed to maximize computational efficiency, so users can successfully enhance the quality of a multi-megapixel image in milliseconds on a mobile device. These noise and blur reduction strategies are critical components of the recent Google Photos editor updates, which includes “Denoise” and “Sharpen” tools that enable users to enhance images that may have been captured under less than ideal conditions, or with older devices that may have had more noisy sensors or less sharp optics.
![]() |
A demonstration of the “Denoise” and “Sharpen” tools now available in the Google Photos editor. |
How Noisy is An Image?
In order to accurately process a photographic image and successfully reduce the unwanted effects of noise and blur, it is vitally important to first characterize the types and levels of noise and blur found in the image. So, a camera-agnostic approach for noise reduction begins by formulating a method to gauge the strength of noise at the pixel level from any given image, regardless of the device that created it. The noise level is modeled as a function of the brightness of the underlying pixel. That is, for each possible brightness level, the model estimates a corresponding noise level in a manner agnostic to either the actual source of the noise or the processing pipeline.
To estimate this brightness-based noise level, we sample a number of small patches across the image and measure the noise level within each patch, after roughly removing any underlying structure in the image. This process is repeated at multiple scales, making it robust to artifacts that may arise from compression, image resizing, or other non-linear camera processing operations.
Reducing Noise Selectively with a Pull-Push Method
We take advantage of self-similarity of patches across the image to denoise with high fidelity. The general principle behind such so-called “non-local” denoising is that noisy pixels can be denoised by averaging pixels with similar local structure. However, these approaches typically incur high computational costs because they require a brute force search for pixels with similar local structure, making them impractical for on-device use. In our “pull-push” approach1, the algorithmic complexity is decoupled from the size of filter footprints thanks to effective information propagation across spatial scales.
The first step in pull-push is to build an image pyramid (i.e., multiscale representation) in which each successive level is generated recursively by a “pull” filter (analogous to downsampling). This filter uses a per-pixel weighting scheme to selectively combine existing noisy pixels together based on their patch similarities and estimated noise, thus reducing the noise at each successive, “coarser” level. Pixels at coarser levels (i.e., with lower resolution) pull and aggregate only compatible pixels from higher resolution, “finer” levels. In addition to this, each merged pixel in the coarser layers also includes an estimated reliability measure computed from the similarity weights used to generate it. Thus, merged pixels provide a simple per-pixel, per-level characterization of the image and its local statistics. By efficiently propagating this information through each level (i.e., each spatial scale), we are able to track a model of the neighborhood statistics for increasingly larger regions in a multiscale manner.
After the pull stage is evaluated to the coarsest level, the “push” stage fuses the results, starting from the coarsest level and generating finer levels iteratively. At a given scale, the push stage generates “filtered” pixels following a process similar to that of the pull stage, but going from coarse to finer levels. The pixels at each level are fused with those of coarser levels by doing a weighted average of same-level pixels along with coarser-level filtered pixels using the respective reliability weights. This enables us to reduce pixel noise while preserving local structure, because only average reliable information is included. This selective filtering and reliability (i.e. information) multiscale propagation is what makes push-pull different from existing frameworks.
The pull-push approach has a low computational cost, because the algorithm to selectively filter similar pixels over a very large neighborhood has a complexity that is only linear with the number of image pixels. In practice, the quality of this denoising approach is comparable to traditional non-local methods with much larger kernel footprints, but operates at a fraction of the computational cost.
![]() |
![]() |
Image enhanced using the pull-push denoising method. |
How Blurry Is an Image?
An image with poor sharpness can be thought of as being a more pristine latent image that was operated on by a blur kernel. So, if one can identify the blur kernel, it can be used to reduce the effect. This is referred to as “deblurring”, i.e., the removal or reduction of an undesired blur effect induced by a particular kernel on a particular image. In contrast, “sharpening” refers to applying a sharpening filter, built from scratch and without reference to any particular image or blur kernel. Typical sharpening filters are also, in general, local operations that do not take account of any other information from other parts of the image, whereas deblurring algorithms estimate the blur from the whole image. Unlike arbitrary sharpening, which can result in worse image quality when applied to an image that is already sharp, deblurring a sharp image with a blur kernel accurately estimated from the image itself will have very little effect.
We specifically target relatively mild blur, as this scenario is more technically tractable, more computationally efficient, and produces consistent results. We model the blur kernel as an anisotropic (elliptical) Gaussian kernel, specified by three parameters that control the strength, direction and aspect ratio of the blur.
Computing and removing blur without noticeable delay for the user requires an algorithm that is much more computationally efficient than existing approaches, which typically cannot be executed on a mobile device. We rely on an intriguing empirical observation: the maximal value of the image gradient across all directions at any point in a sharp image follows a particular distribution. Finding the maximum gradient value is efficient, and can yield a reliable estimate of the strength of the blur in the given direction. With this information in hand, we can directly recover the parameters that characterize the blur.
Polyblur: Removing Blur by Re-blurring
To recover the sharp image given the estimated blur, we would (in theory) need to solve a numerically unstable inverse problem (i.e., deblurring). The inversion problem grows exponentially more unstable with the strength of the blur. As such, we target the case of mild blur removal. That is, we assume that the image at hand is not so blurry as to be beyond practical repair. This enables a more practical approach — by carefully combining different re-applications of an operator we can approximate its inverse.
![]() |
![]() |
Mild blur, as shown in these examples, can be effectively removed by combining multiple applications of the estimated blur. |
This means, rather counterintuitively, that we can deblur an image by re-blurring it several times with the estimated blur kernel. Each application of the (estimated) blur corresponds to a first order polynomial, and the repeated applications (adding or subtracting) correspond to higher order terms in a polynomial. A key aspect of this approach, which we call polyblur, is that it is very fast, because it only requires a few applications of the blur itself. This allows it to operate on megapixel images in a fraction of a second on a typical mobile device. The degree of the polynomial and its coefficients are set to invert the blur without boosting noise and other unwanted artifacts.
![]() |
The deblurred image is generated by adding and subtracting multiple re-applications of the estimated blur (polyblur). |
Integration with Google Photos
The innovations described here have been integrated and made available to users in the Google Photos image editor in two new adjustment sliders called “Denoise” and “Sharpen”. These features allow users to improve the quality of everyday images, from any capture device. The features often complement each other, allowing both denoising to reduce unwanted artifacts, and sharpening to bring clarity to the image subjects. Try using this pair of tools in tandem in your images for best results. To learn more about the details of the work described here, check out our papers on polyblur and pull-push denoising. To see some examples of the effect of our denoising and sharpening up close, have a look at the images in this album.
Acknowledgements
The authors gratefully acknowledge the contributions of Ignacio Garcia-Dorado, Ryan Campbell, Damien Kelly, Peyman Milanfar, and John Isidoro. We are also thankful for support and feedback from Navin Sarma, Zachary Senzer, Brandon Ruffin, and Michael Milne.
1 The original pull-push algorithm was developed as an efficient scattered data interpolation method to estimate and fill in the missing pixels in an image where only a subset of the pixels are specified. Here, we extend its methodology and present a data-dependent multiscale algorithm for denoising images efficiently. ↩
In fall of 2019, we demonstrated that the Sycamore quantum processor could outperform the most powerful classical computers when applied to a tailor-made problem. The next challenge is to extend this result to solve practical problems in materials science, chemistry and physics. But going beyond the capabilities of classical computers for these problems is challenging and will require new insights to achieve state-of-the-art accuracy. Generally, the difficulty in performing quantum simulations of such physical problems is rooted in the wave nature of quantum particles, where deviations in the initial setup, interference from the environment, or small errors in the calculations can lead to large deviations in the computational result.
In two upcoming publications, we outline a blueprint for achieving record levels of precision for the task of simulating quantum materials. In the first work, we consider one-dimensional systems, like thin wires, and demonstrate how to accurately compute electronic properties, such as current and conductance. In the second work, we show how to map the Fermi-Hubbard model, which describes interacting electrons, to a quantum processor in order to simulate important physical properties. These works take a significant step towards realizing our long-term goal of simulating more complex systems with practical applications, like batteries and pharmaceuticals.
![]() |
A bottom view of one of the quantum dilution refrigerators during maintenance. During the operation, the microwave wires that are floating in this image are connected to the quantum processor, e.g., the Sycamore chip, bringing the temperature of the lowest stage to a few tens of milli-degrees above absolute zero temperature. |
Computing Electronic Properties of Quantum Materials
In “Accurately computing electronic properties of a quantum ring”, to be published in Nature, we show how to reconstruct key electronic properties of quantum materials. The focus of this work is on one-dimensional conductors, which we simulate by forming a loop out of 18 qubits on the Sycamore processor in order to mimic a very narrow wire. We illustrate the underlying physics through a series of simple text-book experiments, starting with a computation of the “band-structure” of this wire, which describes the relationship between the energy and momentum of electrons in the metal. Understanding such structure is a key step in computing electronic properties such as current and conductance. Despite being an 18-qubit algorithm consisting of over 1,400 logical operations, a significant computational task for near-term devices, we are able to achieve a total error as low as 1%.
The key insight enabling this level of accuracy stems from robust properties of the Fourier transform. The quantum signal that we measure oscillates in time with a small number of frequencies. Taking a Fourier transform of this signal reveals peaks at the oscillation frequencies (in this case, the energy of electrons in the wire). While experimental imperfections affect the height of the observed peaks (corresponding to the strength of the oscillation), the center frequencies are robust to these errors. On the other hand, the center frequencies are especially sensitive to the physical properties of the wire that we hope to study (e.g., revealing small disorders in the local electric field felt by the electrons). The essence of our work is that studying quantum signals in the Fourier domain enables robust protection against experimental errors while providing a sensitive probe of the underlying quantum system.
Quantum Simulation of the Fermi-Hubbard Model
In “Observation of separated dynamics of charge and spin in the Fermi-Hubbard model”, we focus on the dynamics of interacting electrons. Interactions between particles give rise to novel phenomena such as high temperature superconductivity and spin-charge separation. The simplest model that captures this behavior is known as the Fermi-Hubbard model. In materials such as metals, the atomic nuclei form a crystalline lattice and electrons hop from lattice site to lattice site carrying electrical current. In order to accurately model these systems, it is necessary to include the repulsion that electrons feel when getting close to one another. The Fermi-Hubbard model captures this physics with two simple parameters that describe the hopping rate (J) and the repulsion strength (U).
We realize the dynamics of this model by mapping the two physical parameters to logical operations on the qubits of the processor. Using these operations, we simulate a state of the electrons where both the electron charge and spin densities are peaked near the center of the qubit array. As the system evolves, the charge and spin densities spread at different rates due to the strong correlations between electrons. Our results provide an intuitive picture of interacting electrons and serve as a benchmark for simulating quantum materials with superconducting qubits.
Conclusion
Quantum processors hold the promise to solve computationally hard tasks beyond the capability of classical approaches. However, in order for these engineered platforms to be considered as serious contenders, they must offer computational accuracy beyond the current state-of-the-art classical methods. In our first experiment, we demonstrate an unprecedented level of accuracy in simulating simple materials, and in our second experiment, we show how to embed realistic models of interacting electrons into a quantum processor. It is our hope that these experimental results help progress the goal of moving beyond the classical computing horizon.
Advances on neural machine translation (NMT) have enabled more natural and fluid translations, but they still can reflect the societal biases and stereotypes of the data they’re trained on. As such, it is an ongoing goal at Google to develop innovative techniques to reduce gender bias in machine translation, in alignment with our AI Principles.
One research area has been using context from surrounding sentences or passages to improve gender accuracy – this is a challenge because traditional NMT methods translate sentences individually, but gendered information is not always explicitly stated in each individual sentence. For example, in the following passage in Spanish (a language where subjects aren’t always explicitly mentioned), the first sentence refers explicitly to Marie Curie as the subject, but the second one doesn’t explicitly mention the subject. In isolation, this second sentence could refer to a person of any gender. When translating to English, however, a pronoun needs to be picked, and the information needed for an accurate translation is in the first sentence.
Spanish Text | Translation to English |
Marie Curie nació en Varsovia. Fue la primera persona en recibir dos premios Nobel en distintas especialidades. | Marie Curie was born in Warsaw. She was the first person to receive two Nobel Prizes in different specialties. |
Advancing translation techniques beyond single sentences requires new metrics for measuring progress and new datasets with the most common context-related errors. Adding to this challenge is the fact that translation errors related to gender (such as picking the correct pronoun or having gender agreement) are particularly sensitive because they may directly refer to people and how they self identify.
To help facilitate progress against the common challenges on contextual translation (e.g., pronoun drop, gender agreement and accurate possessives), we are releasing the Translated Wikipedia Biographies dataset, which can be used to evaluate the gender bias of translation models. Our intent with this release is to support long-term improvements on ML systems focused on pronouns and gender in translation by providing a benchmark in which translations’ accuracy can be measured pre- and post-model changes.
A Source of Common Translation Errors
Because they are well-written, geographically diverse, contain multiple sentences, and refer to subjects in the third person (so contain plenty of pronouns), Wikipedia biographies offer a high potential for common translation errors associated with gender. These often occur when articles refer to a person explicitly in early sentences of a paragraph, but there is no explicit mention of the person in later sentences. Some examples:
Translation Error | Text | Translation | ||
Pro-drop in Spanish → English | Marie Curie nació en Varsovia. Recibió el Premio Nobel en 1903 y en 1911. | Marie Curie was born in Warsaw. He received the Nobel Prize in 1903 and in 1911. | ||
Neutral possessives in Spanish → English | Marie Curie nació en Varsovia. Su carrera profesional fue desarrollada en Francia. | Marie Curie was born in Warsaw. His professional career was developed in France. | ||
Gender agreement in English → German | Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911. | Marie Curie wurde in Varsovia geboren. Der angesehene Wissenschaftler erhielt 1903 und 1911 den Nobelpreis. | ||
Gender agreement in English → Spanish | Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911. | Marie Curie nació en Varsovia. El distinguido científico recibió el Premio Nobel en 1903 y en 1911. | ||
Building the Dataset
The Translated Wikipedia Biographies dataset has been designed to analyze common gender errors in machine translation such as those illustrated above. Each instance of the dataset represents a person (identified in the biographies as feminine or masculine), a rock band or a sports team (considered genderless). Each instance is represented by a long text translation of 8 to 15 connected sentences referring to that central subject (the person, rock band, or sports team). Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations were optimized for pronoun-drop, so the same set could be used to analyze pro-drop (Spanish → English) and gender agreement (English → Spanish).
The dataset was built by selecting a group of instances that has equal representation across geographies and genders. To do this, we extracted biographies from Wikipedia according to occupation, profession, job and/or activity. To ensure an unbiased selection of occupations, we chose 9 occupations that represented a range of stereotypical gender associations (either feminine, masculine, or neither) based on Wikipedia statistics. Then, to mitigate any geography-based bias, we divided all these instances based on geographical diversity. For each occupation category, we looked to have one candidate per region (using regions from census.gov as a proxy of geographical diversity). When an instance was associated with a region, we checked that the selected person had a relevant relationship with a country that belongs to a designated region (nationality, place of birth, lived for a big portion of their life, etc.). By using this criteria, the dataset contains entries about individuals from more than 90 countries and all regions of the world.
Although gender is non-binary, we focused on having equal representation of “feminine” and “masculine” entities. It’s worth mentioning that because the entities are represented as such on Wikipedia, the set doesn’t include individuals that identify as non-binary, as unfortunately there are not enough instances currently represented in Wikipedia to accurately reflect the non-binary community. To label each instance as “feminine” or “masculine” we relied on the biographical information from Wikipedia, which contained gender-specific references to the person (she, he, woman, son, father, etc.).
After applying all these filters, we randomly selected an instance for each occupation-region-gender triplet. For each occupation, there are 2 biographies (one masculine and one feminine), for each of the 7 geographic regions.
Finally, we added 12 instances with no gender. We picked rock bands and sports teams because they are usually referred to by non-gendered third person pronouns (such as “it” or singular “they”). The purpose of including these instances is to study over triggering (i.e., when models learn that they are rewarded for producing gender-specific pronouns, soproduce these pronouns in cases where they shouldn’t).
Results and Applications
This dataset enables a new method of evaluation for gender bias reduction in machine translations (introduced in a previous post). Because each instance refers to a subject with a known gender, we can compute the accuracy of the gender-specific translations that refer to this subject. This computation is easier when translating into English (cases of languages with prodrop or neutral pronouns) since computation is mainly based on gender-specific pronouns in English. In these cases, the gender datasets have allowed us to observe a 67% reduction in errors on context-aware models vs. previous models. As mentioned before, the neutral entities have allowed us to discover cases of over triggering like the usage of feminine or masculine pronouns to refer to genderless entities. This new dataset also enables new research directions into the performance of different models across types of occupations or geographic regions.
As an example, the dataset allowed us to discover the following improvements in an excerpt of the translated biography of Marie Curie from Spanish.
![]() |
Translation result with the previous NMT model. |
![]() |
Translation result with the new contextual model. |
Conclusion
This Translated Wikipedia Biographies dataset is the result of our own studies and work on identifying biases associated with gender and machine translation. This set focuses on a specific problem related to gender bias and doesn’t aim to cover the whole problem. It’s worth mentioning that by releasing this dataset, we don’t aim to be prescriptive in determining what’s the optimal approach to address gender bias. This contribution aims to foster progress on this challenge across the global research community.
Acknowledgements
The datasets were built with help from Anja Austermann, Melvin Johnson, Michelle Linch, Mengmeng Niu, Mahima Pushkarna, Apu Shah, Romina Stella, and Kellie Webster.