Human level control through deep learning - 阅读学习

本篇对谷歌发布的论文Human level control through deep learning进行翻译与学习。
之后会附上完整的AI版Atari Breakout游戏Tensorflow代码。

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment.To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations ofthe environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.Remarkably,humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agent shave achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly fromhigh-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games,using the same》 algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

植根于心理学和神经科学观点上的动物行为，强化学习理论对个体如何优化他们对环境的控制，提供了一个规范性的说明。

然而，要在接近真实世界的复杂环境中成功使用强化学习，Agents面临着一个艰巨的任务：他们必须从高维度的传感器输入中高效的提取有效的环境表达（或者说环境的表现），并用这些概念，将过去的经验推广到新的情况。

值得注意的是，人类和其它动物似乎是通过和谐地结合强化学习和分级感官处理系统来解决这个问题的。前者已经被大量的神经数据证明，并且这些数据揭示了多巴胺神经发出的阶段性信号和时间差分强化学习（Temporal Difference Reinforcement Learning Algorithms）的显著相似性。

虽然强化学习Agent已经在各种领域取得了一些成功，但它们的适用性以前一直局限于可以手工抽取有用特征的领域或者具有可以被完全观测到的低维度状态空间的领域。在这里，我们使用最近在训练深度神经网络方面的进展来开发一种被称为深度Q网络的新型人工Agent，它可以使用端到端强化学习直接从高维传感器输入中学习成功的策略。

我们在经典Atari 2600游戏12的挑战性领域测试了这个Agent。我们证明了，深度Q网络Agent只需接收游戏图像和游戏得分作为输入，使用相同的算法，网络架构和超参数，就能够超越以前大部分算法的性能，并达到与专业人类游戏测试人员相当的游戏水平。这项工作在高维感官输入和行为动作之间架起了桥梁，产生了第一个在面对多元化的挑战性任务时，依然能够学习的人工Agent。

We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks—a central goal of general artificial intelligence that has eluded previous effort.To achieve this,we developed a novel agent,a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural network known as deep neural networks. Notably, recent advances in deep neural networks, in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network, which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s seminal work on feedforward processing in early visual cortex—thereby exploiting the local spatial correlations present in images, and building in robustness to natural transformations such as changes of viewpoint or scale.

我们着手创建一个单一的算法，在面对各种具有挑战性的任务时，能发展出各种各样的能力，这是人工智能的核心目标。

为此，我们开发了一种新型Agent：深度Q网络（DQN - Deep Q-network）,它将强化学习与深度神经网络这类的人工神经网络相结合。

值得注意的是，在最近深度神经网络方面取得的进展中，使用几层节点逐渐构建更加抽象的数据表示，使人工神经网络能够直接从原始传感器数据中学习诸如对象类别等概念。

我们使用一种特别成功的架构 - 深层卷积网络，它使用分层平铺的卷积滤波器的来模拟感受野的效果，从而利用当前图像中的空间相关性，增加自然变换（如视点或尺度的变化）的鲁棒性——这受到了Hubel和Wiesel在早期视觉皮层中对前馈处理的开创性工作的启发。

We consider tasks in which the agent interacts with an environment through a sequence of observations,actions and rewards. The goal of the agent is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function which is the maximum sum of rewards rt discounted by 𝛄 at each timestep t, achievable by a behaviour policy 𝛑 = P(a|s), after making an observation (s) and taking an action (a).

对于某个任务，Agent通过观察，行动和被奖励的方式与环境相互作用， Agent是以最大化未来累积奖励为目标的来选择行动的。更正式地说，我们使用一个深度卷积神经网络来逼近Action-Value函数的最优值。(见下图公式) 该函数值为，在每个时间 t 处，通过行为策略 𝛑 = P(a|s) ，做出观察(s)并采取行动(a)所获取的(r)与折损系数𝛄的乘积之和。

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations,the fact that small updates to Q may significantly change the policy and therefore change the data distribution,and the correlations between the action-values(Q) and the target values r+𝛄 max Q(s’,a’). We address these instabilities with a novel variantof Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution(see below for details). Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated,thereby reducing correlations with the target. While other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration, these methods involve the repeated training of networks de novo on hundreds of iterations. Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function Q(s,a;𝛉i) using the deep convolutional neural network shown in Fig.1,in which 𝛉i are the parameters (that is, weights) of the Q-network at iteration i. To perform experience replay we store the agent’s experiences et=(st, at, rt, s(t+1)） at each time-step t in a data set Dt={e1，…，et}. During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s’)~U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration i uses the following loss function. in which 𝛄 is the discount factor determining the agent’s horizon, 𝛉i are the parameters of the Q-network at iteration i and 𝛉i- are the network i parameters used to compute the target at iteration i. The target network parameters 𝛉i- are only updated with the Q-network parameters i(𝛉i) every C steps and are held fixed between individual updates (see Methods).

当一个非线性函数（例如神经网络）来表示Actoin-Value（也称为Q）函数时，强化学习是不稳定的，甚至是发散的。这种不稳定性有几个原因：1. 观察序列中存在的相关性。2. Q的轻微的更新对策略的显著改变。3. 以及因此而改变的数据分布和Action-Value（Q）与目标值 r+𝛄 的最大值Q(s’,a’)之间的关联。我们用一种新的Q-learning的变体来解决这些不稳定问题，它使用了两个关键的思想。首先，我们使用了一种被称为经验回放的生物学启发机制，该机制对数据随机化，从而消除观察序列中的相关性，并平滑数据分布的变化（详见下文）。其次，我们使用迭代更新，将Action-Value（Q）向目标值定期更新，从而减少与目标的相关性。虽然在强化学习环境中存在用于训练神经网络的其他稳定方法，例如神经拟合的Q迭代，但这些方法涉及在数百次迭代中从头重复训练网络。因此，与我们的算法不同，这些方法效率较低，无法成功应用于大型神经网络。我们使用图1所示的深度卷积神经网络对一个近似值函数Q(s,a;𝛉i)进行参数化，其中𝛉i是迭代i的Q网络的参数（即权重）。为了进行经验回放，我们在数据集Dt={e1，…，et}中的每个时间t存储Agent的经验et=(st, at, rt, s(t+1)）。在学习过程中，我们对经验(s,a,r,s’)~U(D)的样本（或minibatches）进行Q-learning更新，从存储的样本池中随机抽取。迭代i中的Q-learning更新使用以下损失函数：

其中𝛄是确定Agent视界的折损系数，𝛉i是迭代i的Q网络的参数，𝛉i-是在迭代i时用来计算目标的网络i的参数。目标网络参数𝛉i-，只通过每个C步骤中的Q网络参数i(𝛉i-)来更新，并在各个更新之间保持固定（参见方法）。

Figure 1 | Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an 84844 image produced by the preprocessing map w, followed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, max(0，x)).

图1|卷积神经网络的示意图。方法中解释了该体系结构的细节。神经网络的输入包括由预处理映射𝞍产生的84×84×4图像，其后是三个卷积层（注意：蛇形蓝线代表每个滤波器在输入图像上的滑动）以及两个输出单个有效行动的全连接层。每一个隐藏层都跟着一个非线性整流器（即，max(0，x)）。

To evaluate our DQN agent, we took advantage of the Atari 2600 platform, which offers a diverse array of tasks (n=49) designed to be difficult and engaging for human players. We used the same network architecture, hyperparameter values (see Extended Data Table 1) and learning procedure throughout—taking high-dimensional data(210x160 colour video at 60 Hz) as input—to demonstrate that our approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge(thatis,merely the input data were visual images, and the number of actions available in each game, but not their correspondences; see Methods). Notably, our method was able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner illustrated by the temporal evolution of two indices of learning (the agent’s average score-per-episode and average predicted Q-values; see Fig. 2 and Supplementary Discussion for details).

为了评估我们的DQN Agent，我们利用了Atari 2600平台，该平台提供了各种各样的任务（n = 49），对人类玩家具有一定的挑战与吸引力。对不同任务，我们使用了相同的网络架构，超参数值（参见扩展数据表1）以及相同的整个学习过程——以高维数据（60 Hz下的210x160彩色视频）作为输入，证明我们的方法仅用感官输入，和极少的先验知识（即仅仅输入视觉图像，以及每场比赛中可用的动作数量，但不包括它们的对应关系;参见方法）就获得了成功的游戏策略。值得注意的是，我们的方法能够以稳定的方式使用强化学习信号和随机梯度下降来训练大型神经网络，这是通过两种学习指标（Agent的平均每轮游戏分数和平均预测的Q值;详情参见图2和补充讨论）在时间上的演进而实现的。

We compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available. In addition to the learned agents, we also report scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y axis; see Methods). Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games,achieving more than75% of the human score on more than half of the games(29 games;see Fig. 3, Supplementary Discussion and Extended Data Table 2). In additional simulations (see Supplementary Discussion and Extended Data Tables 3 and 4), we demonstrate the importance of the individual corecomponents of the DQN agent—the replay memory, separate target Q-network and deep convolutional network architecture—by disabling them and demonstrating the detrimental effects on performance.

我们比较了DQN和强化学习文献中表现最好的49场比赛中的成绩。除了学到的Agent之外，我们还报告了在受控条件下和在随机选择同一动作的策略下，玩游戏的专业人类游戏测试人员的分数（扩展数据表2和图3，由100％（人类）和0％（随机）在y轴上;参见方法）。我们的DQN方法在无需输入任何先验知识的情况下，比其他方法（例如参考文献12,15）在43款游戏中的表现的都要好。此外，我们的DQN Agent的执行水平与整个49场比赛中的专业人类游戏测试人员相当，在超过一半的游戏中获得超过75％的人类得分（29场比赛;参见图3 ，补充讨论和扩展数据表2）。在其他模拟中（见补充讨论和扩展数据表3和4），我们演示了DQN Agent的单个核心组件 - replay memroy的重要性，以及分离目标Q网络和深度卷积网络体系结构对性能的不利影响。

Figure 2 | Training curves tracking the agent’s average score and average predicted action-value. a, Each point is the average score achieved per episode after the agent is run with e-greedy policy (e 5 0.05) for 520k frames on Space Invaders. b, Average score achieved per episode for Seaquest. c, Average predicted action-value on a held-out set of states on Space Invaders. Each point on the curve is the average of the action-value Q computed over the held-out set of states. Note that Q-values are scaled due to clipping of rewards (see Methods). d, Average predicted action-value on Seaquest. See Supplementary Discussion for details.

图2|训练曲线跟踪Agent的平均分数和平均预测Action-Value。
a，每个点是Agent在520k帧Space Invaders中，用e-贪心策略（e = 0.05）运行后每轮的平均得分。 b，Seaquest每轮的平均分数。 c，pace Invaders所持状态的平均预测Action-Value。曲线上的每个点都是在状态集上计算的Action-Value Q的平均值。请注意，Q值因缩减奖励而缩放（请参阅方法）。 d，Seaquest的平均预测Action-Value。详情请参阅补充讨论。

Figure 3 | Comparison of the DQN agent with the best reinforcement learning methods in the literature. The performance of DQN is normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0% level). Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 1003 (DQN score 2 random play score)/(human score 2 random play score). It can be seen that DQN outperforms competing methods (also see Extended Data Table 2) in almost all the games, and performs at a level that is broadly comparable with or superior to a professional human games tester (that is, operationalized as a level of 75% or above) in the majority of games. Audio output was disabled for both human players and agents. Error bars indicate s.d. across the 30 evaluation episodes, starting with different initial conditions.

图3|DQN Agent与文献中最好的强化学习算法的比较。
DQN的表现相对于专业的人类游戏测试者（即100％等级）和随机游戏（即0％等级）进行了标准化。请注意，以百分比表示的DQN的标准化表现计算为：100 x（DQN分数 - 随机游戏分数）/（人类分数 - 随机游戏分数）。可以看出，在几乎所有的游戏中，DQN都优于竞争方法（也参见扩展数据表2），并且其执行水平与专业人类游戏测试者具有广泛的可比性或优越性（即，作为 75％或以上）。人类玩家和Agent都禁用了音频。错误栏指示s.d. 从不同的初始条件开始的30个评估期间。

We next examined the representations learned by DQN that under-pinned the successful performance of the agent in the context of the game Space Invaders (see Supplementary Video 1 for a demonstration of the performance of DQN), by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’ 25 (Fig. 4). As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. Interestingly, we also found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations that supportad aptive behaviour from high-dimensional sensory inputs. Furthermore, we also show that the representations learned by DQN are able to generalize to data generated from policies other than its own—in simulations where we presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm(Extended Data Fig.1 and Supplementary Discussion). Extended Data Fig. 2 provides an additional illustrationof how the representations learned by DQN allow it to accurately predict state and action values.

基于Agent在游戏Space Invaders中的成功表现（参见补充视频1，演示了DQN的性能），接下来，我们通过使用一种将高维数据可视化的’t-SNE’的技术（图4），检验了DQN所学到的表述。正如预期的那样，t-SNE算法倾向于将感知上相似状态的DQN表示映射到附近的点。有趣的是，我们还发现了t-SNE算法能够根据Reward为DQN表示生成类似的嵌入，但实际上这些DQN表示在感知上并不相似（图4，右下，左上和中间），这与网络能够学习来自高维感知输入的支持应激行为的概念一致。此外，我们还展示了，DQN学习到的表示能够推广到其自身以外的策略产生的数据，在模拟中，我们将这些数据作为人类玩家和Agent游戏过程中的网络状态的输入，记录最后隐藏层的表示，并将由t-SNE算法生成的结果可视化（扩展数据图1和补充讨论）。扩展数据图2提供了DQN学习的表示如何准确预测状态和Action-Value的附加说明。

It is worth noting that the games in which DQN excels are extremely varied in their nature, from side-scrolling shooters (River Raid) to boxing games(Boxing) and three-dimensional car-racing games(Enduro). Indeed, in certain games DQN is able to discover a relatively long-term strategy (for example, Breakout: the agent learns the optimal strategy, which is to first dig a tunnel around the side of the wall allowing the ball to be sent around the back to destroy a large number of blocks; see Supplementary Video 2 for illustration of development of DQN’s performance over the course of training). Nevertheless, games demanding more temporally extended planning strategies still constitute a major challenge for all existing agents including DQN(for example,Montezuma’s Revenge).

值得注意的是，DQN擅长的游戏种类有极大的差别，从单边滚动投篮的游戏（River Raid），到拳击比赛（Boxing）和三维赛车比赛（Enduro）。的确，在某些游戏中，DQN能够发现一个相对长期的策略（例如，Breakout：Agent学习最优策略，即首先在墙边挖一个隧道，将球传送过隧道到达砖块的另一侧来摧毁大量的砖块;参见补充视频2说明在训练过程中DQN表现的发展）。尽管如此，一些游戏需要更多长时间的计划策略，这仍然是包括DQN在内的所有现有Agent（例如蒙特祖玛的复仇）的主要挑战。

Figure 4 | Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders. The plot was generated by letting the DQN agent play for 2 h of real game time and running the t-SNE algorithm on the last hidden layer representations assigned by DQN to each experienced game state. The points are coloured according to the state values (V, maximum expected reward of a state) predicted by DQN for the corresponding game states (ranging from dark red (highest V) to dark blue (lowest V)). The screenshots corresponding to a selected number of points are shown. The DQN agent predicts high state values for both full (top right screenshots) and nearly complete screens (bottom left screenshots) because it has learned that completing a screen leads to a new screen full of enemy ships. Partially completed screens (bottom screenshots) are assigned lower state values because less immediate reward is available. The screens shown on the bottom right and top left and middle are less perceptually similar than the other examples but are still mapped to nearby representations and similar values because the orange bunkers do not carry great significance near the end of a level. With permission from Square Enix Limited.

图4|在Space Invaders游戏中，DQN分配给最后一个隐藏层的Representaion的二维t-SNE嵌入结果。
通过让DQN Agent玩实际时间为2小时的游戏，并且对由DQN指派的最后一个隐藏层的Representation上的每个过去的游戏状态运行t-SNE算法，来产生该图。根据DQN针对相应游戏状态（从深红色（最高V）到深蓝色（最低V））预测的状态值（V，状态的最大期望奖励）对这些点进行着色。对应于所选数量的点的屏幕截图都得以显示。DQN Agent预测满的（右上屏幕截图）和几乎全空的画面（左下屏幕截图）的状态值较高，因为它了解到填满的画面会跟着到来新的充满敌船的画面。部分填满的画面（底部屏幕截图）被分配较低的状态值，因为只能直接得到较少的奖励。屏幕显示在右下角，中间和中间感觉上与其他示例相似，但仍然映射到附近的Representation和类似的值，因为橙色掩体在关卡末尾附近不具有重要意义。With permission from Square Enix Limited.

In this work, we demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm,network architecture and hyperparameters on each game, privy only to the inputs a human player would have. In contrast to previous work, our approach incorporates ‘end-to-end’ reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation. This principle draws on neuro biological evidence that reward signals during perceptual learning may influence the characteristics of representations within primate visual cortex. Notably, the successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm involving the storage and representation of recently experienced transitions. Convergent evidence suggests that the hippocampus may support the physical realization of such a process in the mammalian brain, with the time compressed reactivation of recently experienced trajectories during offline periods (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through interactions with the basal ganglia. In the future, it will be important to explore the potential use of biasing the content of experience replay towards salient events, a phenomenon that characterizes empirically observed hippocampal replay, and relates to the notion of ‘prioritized sweeping’ in reinforcement learning. Taken together, our work illustrates the power of harnessing state-of-the-art machine learning techniques with biologically inspired mechanisms to create agents that are capable of learning to master a diverse array of challenging tasks.

在这项工作中，我们证明了，在每个游戏中只需要很少的先验知识，只接收像素和游戏得分作为输入，并且使用相同的算法，网络架构和超参数，只参与人类玩家的输入，这样一个单一的架构就可以成功地学习一系列不同环境下的控制策略。与以前的研究相比，我们的方法结合了“端到端”的强化学习，该学习使用奖励来持续塑造在卷积网络内的环境中显着特征的表示。这一原则借鉴了神经生物学证据，即在感知学习期间奖励信号可能影响灵长类动物视觉皮层内特征的表示。值得注意的是，强化学习与深层网络架构的成功整合，严重依赖于我们整合了一个重放算法，这个算法涉及存储和最近经历的状态改变的表示。（后面这里有些看不懂…😓胡扯大法）收敛证据表明，海马体可能支持在哺乳动物大脑中实际实现这样的过程，并且在现实时间（例如醒着的休息时间）中，最近经历的轨迹的时间被压缩重新激活，从而提供了可能的价值函数的推定机制，通过与基底神经节的相互作用进行有效更新。未来，重要的是探索，将经验重现的内容应用于突出事件的潜在用途，这是一种表征经验观察到的海马体重放的现象，并且涉及强化学习中“优先清扫”的概念。综合起来，我们的工作展示了使用最新机器学习技术的Agent的能力，其具有生物启发机制，并且能够学习掌握各种具有挑战性的任务。

Preprocessing.** Workingdirectlywith rawAtari2600 frames, whichare2103 160 pixel images with a 128-colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed at reducing the input dimensionality and dealing with some artefacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by the limited number of sprites Atari 2600 can display at once. Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to 84 x 84.The function 𝛟 from algorithm1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the Q-function, in which m=4, although the algorithm is robust to different values of m (for example, 3 or 5).

预处理。根据计算和内存要求，使用RAW Atari2600帧直接工作，其中有2103个160像素的图像和128色调色板。我们应用了一个基本的预处理步骤，旨在降低输入维度并处理Atari 2600仿真器的一些图像。首先，对单个帧编码，我们对被编码帧和前一帧的每个像素颜色值取最大值。这是消除游戏中闪烁现象的必要条件，其中一些物体只出现在偶数帧中，而其他物体只出现在奇数帧中，Atari 2600生成的伪像可以立即显示。其次，我们从RGB帧中提取Y通道，也称为亮度，并将其重新调整为84 x 84。下面描述的算法1的函数将该预处理应用于m个最近的帧并将它们叠加以产生输入给Q函数，其中m=4，但该算法对m的不同值（例如3或5）是稳定的。

Code availability. The source code can be accessed at https://sites.google.com/a/ deepmind.com/dqn for non-commercial uses only.

可用代码。 只能在https://sites.google.com/a/endmind.com/dqn上访问源代码，仅用于非商业用途。

Model architecture. There are several possible ways of parameterizing Q using a neural network. Because Q maps history–action pairs to scalar estimates of their Q-value, the history and the action have been used as inputs to the neural network by some previous approaches. The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action, resulting in a cost that scales linearly with the number of actions. We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual actions for the input state. The main advantageof this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.

The exact architecture, shown schematically in Fig. 1, is as follows. The input to the neural network consists of an 84 x 84 x 4 image produced by the preprocessing map 𝛟. The first hidden layer convolves 32 filters of 8 x 8 with stride 4 with the input image and applies a rectifier nonlinearity . The second hidden layer convolves 64 filters of 43 4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3x3 with stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.

模型架构。使用神经网络有几种可能的Q参数化方法。由于Q将历史的动作组合映射为其Q值的标量估计，因此经验和动作被用作以前某些方法的神经网络的输入。这种类型的体系结构的主要缺点是需要单独的前向传递来计算每个动作的Q值，导致成本与动作的数量呈线性关系。我们改为使用另一种架构，其中每个可能的动作都有一个单独的输出单元，并且只有状态表示形式是神经网络的输入。输出对应于输入状态下单个动作的预测Q值。这种体系结构的主要优点是，能够只需通过网络进行一次正向通过，就能在给定状态下为所有可能的操作计算Q值。

如图1所示，确切的架构如下。神经网络的输入由预处理映射𝛟产生的84 x 84 x 4图像组成。第一个隐藏层包含8×8的32个滤波器，步长4与输入图像并应用非线性整流器。第二个隐藏层用步长2的4x4的64个卷积滤波器，再次跟随非线性整流器。紧接着的是第三卷积层，其卷积64个3x3滤波器，步长为1，后接整流器。最终的隐藏层是完全连接的，由512个整流器单元组成。输出层是一个完全连接的线性层，每个有效操作都有一个输出。在我们考虑的游戏中，有效行动的数量在4到18之间不等。

We performed experiments on 49 Atari 2600 games where results were available for all other comparable methods. A different network was trained on each game: the same network architecture, learning algorithm and hyperparameter settings (see Extended Data Table1)were used across allgames, showing that our approach is robust enough to work on a variety of games while incorporating only minimal prior knowledge(see below).While we evaluated our agents on unmodified games, we made one change to the reward structure of the games during training only. As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude. For games where there is a life counter, the Atari 2600 emulator alsosends the number of lives left in the game, which is then used to mark the end of an episode during training.

In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/ ,tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size 32. The behaviour policy during training was e-greedy with e annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained for a total of 50 million frames (that is, around 38 days of game experience in total) and used a replay memory of 1 million most recent frames.

Following previous approaches to playing Atari 2600 games,we also use a simple frame-skipping technique. More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. Because running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k = 4 for all games.

The values of all the hyperparameters and optimization parameters we reselected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing to the high computational cost.These parameters were then held fixed across all other games.The values and descriptions of all hyperparameters are provided in Extended Data Table 1.

Our experimental setup amounts to using the following minimal prior knowledge: that the input data consisted of visual images (motivating our use of a convolutional deep network), the game-specific score (with no modification), number of actions, although not their correspondences (for example, specification of the up ‘button’) and the life count.

我们在49个Atari 2600游戏上进行了实验，用类似的方法都可以得到结果。每个游戏都训练了一个不同的网络：在所有游戏中都使用了相同的网络架构，学习算法和超参数设置（请参阅扩展数据表1），这表明我们的方法足够稳健，可以在各种游戏上工作，同时只包含极少的先验知识（见下文）。虽然我们对未经修改的游戏的Agent进行了评估，但我们仅在训练期间对游戏的奖励结构进行了一次更改。由于每个游戏的分数范围差异很大，我们将所有正面奖励分别限制在1和所有负面奖励分别限定在-1，而不给予奖励设置为0。以这种方式削减奖励限制了错误派生的规模，并且使得在多个游戏中使用相同的学习速率变得更容易。同时，这可能会影响我们Agent的表现，因为它无法区分不同规模的奖励。对于那些有生命计数器的游戏，Atari 2600模拟器会同时存储游戏中剩余的生命数量，然后用于在训练过程中标记游戏的结尾。

在这些实验中，我们使用了RMSProp（参见http://www.cs.toronto.edu/tijmen/csc321/slides/lecture_slides_lec6.pdf）算法，其大小为32的小数据集。训练期间的行为策略是𝝴-greedy，𝝴在第一百万帧从1.0到0.1线性退火，之后固定在0.1。我们总共训练了5000万帧（总共有38天左右的游戏体验），并使用了最近100万帧的重播记忆。

遵循先前玩Atari 2600游戏的方法，我们也使用简单的跳帧技术。更准确地说，Agent在每个第k帧而不是每一帧都能看到并选择动作，而最后一个动作则在跳过的帧上重复。因为将仿真器向前运行一步需要比Agent选择一个动作更少的计算，这个技术使得Agent能大约多玩k倍的游戏，并且不会显着增加运行时间。我们对所有游戏使用k = 4。

我们通过对游戏Pong，Breakout，Seaquest，Space Invaders和Beam Rider进行非正式搜索来重新选择所有超参数和优化参数的值。由于计算成本高，我们没有进行系统的网格搜索。这些参数在所有其他游戏中保持固定。所有超参数的值和描述在扩展数据表1中提供。

我们的实验设置使用以下极少的先验知识：输入数据包括视觉图像（促使我们使用卷积深度网络），特定游戏得分（没有修改），动作次数和寿命计数。

Evaluation procedure. The trained agents were evaluated by playing each game 30 times for up to 5 min each time with different initial random conditions (‘noop’; see Extended Data Table 1) and an e-greedy policy with e 5 0.05. This procedure is adopted to minimize the possibility of overfitting during evaluation. The random agent served as a baseline comparison and chose a random actionat 10 Hz which is every sixth frame, repeating its last action on intervening frames. 10 Hz is about the fastest that a human player can select the ‘fire’ button, and setting the random agent to this frequency avoids spurious baseline scores in a handful of the games.We did also assess the performance of a random agent that selected an action at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy Climber, Demon Attack, Krull and Robotank), and in all these games DQN outperformed the expert human by a considerable margin.

The professional human tester used the same emulator engine as the agents, and played under controlled conditions. The human tester was not allowed to pause, save or reload games. As in the original Atari 2600 environment, the emulator was run at 60 Hz and the audio output was disabled: as such, the sensory input was equated between humanplayer andagents. The human performance is theaverage reward achieved from around 20 episodes of each game lasting a maximum of 5 min each, following around 2 h of practice playing each game.

评估程序。训练过的Agent通过每次玩游戏30次，每次用不同的初始随机条件（’noop’;参见扩展数据表1）和𝝴=0.05的𝝴-greedy策略，游戏时间为30分钟，来评估每个游戏。采用这一程序是为了尽量减少评估过程中过度拟合的可能性。一个随机的Agent用来作为比较的基准线，并选择每隔6帧的10Hz的随机动作，重复其对中间帧的最后动作。10 Hz是人类玩家选择’发射’按钮时最快的速度，并且将随机Agent设置为这个频率可以避免少数游戏中的虚假基线分数。我们还评估了随机Agent 60赫兹（即每帧）动作的性能。这样做的影响很小：在仅仅六种游戏（Boxing, Breakout, Crazy Climber, Demon Attack, Krull and Robotank）中将规范化的DQN表现改变了5％以上，在所有这些游戏中，DQN的表现都优于人类专家。

专业人类测试人员使用与Agent相同的模拟器引擎，并在受控条件下玩游戏。人工测试人员不允许暂停，保存或重新加载游戏。与原始Atari 2600环境一样，仿真器运行在60 Hz，音频输出被禁用：因此，人类玩家和Agent的感官输入是等同的。人类的表现是练习两小时后的20次游戏中5分钟内的平均游戏得分。