# NIUHE

### Recognizing Image Features and Patterns

HKUST CSIT5401 Recognition System lecture notes 1. 识别系统复习笔记。

## Go

### 论文翻译：在没有人类知识的情况下掌握围棋

#### 1. 前言

​ 人工智能的一个长期目标是在一些有挑战的领域中从零开始学习出超人熟练程度的算法。最近，AlphaGo成为第一个在围棋比赛中击败世界冠军的程序。 AlphaGo中的树搜索使用深度神经网络评估位置和选定的移动。这些神经网络是通过监督学习来自人类专家的走法以及通过强化自我学习来进行训练的。这里我们只介绍一种基于强化学习的算法，没有超出游戏规则的人类数据，指导或领域知识。AlphaGo成为自己的老师：一个神经网络训练预测AlphaGo的移动选择和游戏的胜者。这个神经网络提高了树搜索的强度，在下一次迭代中拥有更高质量的移动选择和更强的自我学习。我们的新程序AlphaGo Zero从零开始学习，实现了超人的表现，与之前发布的夺冠冠军AlphaGo相比以100-0取胜。

## Introduction

In last lecture, we learn policy directly from experience. In previous lectures, we learn value function directly from experience. In this lecture, we will learn model directly from experience and use planning to construct a value function or policy. Integrate learning and planning into a single architecture.

Model-Based RL

• Learn a model from experience
• Plan value function (and/or policy) from model

## Introduction

This lecture talks about methods that optimise policy directly. Instead of working with value function as we consider so far, we seek experience and use the experience to update our policy in the direction that makes it better.

In the last lecture, we approximated the value or action-value function using parameters $\theta$, $V_\theta(s)\approx V^\pi(s)\\ Q_\theta(s, a)\approx Q^\pi(s, a)$ A policy was generated directly from the value function using $\epsilon$-greedy.

In this lecture we will directly parametrise the policy $\pi_\theta(s, a)=\mathbb{P}[a|s, \theta]$ We will focus again on $\color{red}{\mbox{model-free}}$ reinforcement learning.

## Introduction

This lecture will introduce how to scale up our algorithm to real practical RL problems by value function approximation.

Reinforcement learning can be used to solve large problems, e.g.

• Backgammon: $10^{20}$ states
• Computer Go: $10^{170}$ states
• Helicopter: continuous state space

Powered by Hexo and Theme by Hacker