NIUHE

日々私たちが过ごしている日常というのは、実は奇迹の连続なのかもしれんな

Random sample consensus

随机抽样一致(Random sample consensus ,RANSAC)是一种迭代方法,用来排除异常数据(outliers)对模型的影响。

这个算法基于一个基本假设:数据集中正常的样本(inliers)可以很好地拟合给定模型,异常数据则不行,比如异常样本代入模型损失会很大。

Algorithm

算法主要分为两步:

  1. 从全部数据集中随机抽取一小部分,我们假设这部分数据都是正常的,用这部分数据训练模型。采样集的大小是能训练模型的最小样本数。
  2. 用这个模型测试其他剩余样本,如果某个样本的损失小于某个阈值,则认为它是和抽样出的样本是一致的,即也是正常样本;若某个样本的损失大于阈值,则认为它是异常样本。

所有正常样本的集合叫做一致集(consensus set),重复上面两个步骤,直到找到一致集中包含足够多的样本或者到达最大循环次数。

具体描述:

  1. 从原始数据中采样出一个子集,假设为正常样本集。
  2. 用上述子集训练模型
  3. 用其他样本测试该模型。对于拟合很好的样本点,可以通过损失小于某一阈值来判断,把它加入一致集。
  4. 如果一致集的样本有足够多的样本,则认为该模型很好,拟合了大部分正常样本。
  5. 然后,用一致集和采样集中的样本重新训练模型,这也许会进一步提升模型的效果,如果整体损失比上次迭代小,则更新最佳拟合模型。
  6. 重复步骤1~5,直到一致集样本足够多或者达到最大迭代次数,返回最佳拟合模型。

伪代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Given:
data – a set of observed data points
model – a model that can be fitted to data points
n – the minimum number of data values required to fit the model
k – the maximum number of iterations allowed in the algorithm
t – a threshold value for determining when a data point fits a model
d – the number of close data values required to assert that a model fits well to data

Return:
bestfit – model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
maybeinliers = n randomly selected values from data
maybemodel = model parameters fitted to maybeinliers
alsoinliers = empty set
for every point in data not in maybeinliers {
if point fits maybemodel with an error smaller than t
add point to alsoinliers
}
if the number of elements in alsoinliers is > d {
% this implies that we may have found a good model
% now test how good it is
bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
thiserr = a measure of how well model fits these points
if thiserr < besterr {
bestfit = bettermodel
besterr = thiserr
}
}
increment iterations
}
return bestfit

Python代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def ransac(data, model, n, k, t, d):
'''
data – a set of observed data points
model – a model that can be fitted to data points
n – the minimum number of data values required
to fit the model
k – the maximum number of iterations allowed
in the algorithm
t – a threshold value for determining
when a data point fits a model
d – the number of close data values required to
assert that a model fits well to data
Return:
bestfit – model parameters which best fit the data
(or nul if no good model is found)
'''
train_data_x = [[data[0][i]] for i in range(len(data[0]))]
train_data_y = data[1]
iterations = 0
bestfit = 0
besterr = 10000
while iterations < k:
indexes = [rnd.randrange(0, len(data[0]))
for i in range(n)]
maybeinliners_x = [train_data_x[i] for i in indexes]
maybeinliners_y = [train_data_y[i] for i in indexes]
maybemodel = model.fit(maybeinliners_x,
maybeinliners_y)
alsoinliers = set()
predict = maybemodel.predict(train_data_x)
for i in range(len(train_data_x)):
points_x = train_data_x[i]
if points_x not in maybeinliners_x:
data_loss = loss(train_data_y[i], predict[i])
if data_loss < t:
alsoinliers.add((points_x[0],
train_data_y[i]))
if len(alsoinliers) > d:
# this implies that we may have found a good model
# now test how good it is
inliers_x = [[point[0]] for point in alsoinliers]
inliers_y = [point[1] for point in alsoinliers]
inliers_x.extend(maybeinliners_x)
inliers_y.extend(maybeinliners_y)
bettermodel = model.fit(inliers_x, inliers_y)
predict = maybemodel.predict(inliers_x)
thiserr = np.sqrt(np.average(
(inliers_y - predict) ** 2))
if thiserr < besterr:
besterr = thiserr
bestfit = bettermodel
iterations = iterations + 1
return bestfit

DEMO

生成了一部分可以用直线拟合的数据和一些随机噪声,比例为 \(1:1\) ,如下图所示:

用线性回归直接拟合和RANSAC算法的结果:

Parameters

RANSAC算法中有几个超参数需要调试:

  • t, d : 这两个参数需要根据应用和数据集来调节

  • k(最大迭代次数):可以从理论上确定。

    定义 \(p\) 是RANSAC算法在某次迭代采样时只采样到正常样本的概率。如果这真的发生了,那么这次迭代将训练出一个很好的模型,所以 \(p\) 也代表了RANSAC算法得出一个好结果的概率。再定义 \(w\) 为每次采样时,采到一个正常点的概率,即 \(w\) =  number of inliers in data / number of points in data。假设一次采样选取 \(n\) 的样本,那么这次采样选择的样本全部都是正常样本的概率为 \(w^n\)(假设为放回采样),至少选择一个异常样本的概率就为 \(1-w^n\)。总共迭代 \(k\) 次,那么这 \(k\) 次采样都至少有一个异常样本的概率为:\((1-w^n)^k\),应该等于 \(1-p\),即 \[ 1-p=(1-w^n)^k \] 两边同时取对数得: \[ k = \frac{\log(1-p)}{\log(1-w^n)} \] 实际上我们是不放回采样,所以还要在上面的基础上加上 \(k\) 的标准差。\(k\) 的标准差定义为: \[ SD(k) = \frac{\sqrt{1-w^n}}{w^n} \]

Advantages and disadvantages

优势:

劣势:

  • 没有时间上界,即不会自然收敛,只能靠最大迭代次数限制。有时可能最后一次迭代才找到一个好的模型,有时可能第一次就找到了

  • 需要根据不同问题手动调试参数(一些阈值)

  • 只能同时评估一种模型

  • 不是所有时候都能找到合适的模型,尤其是在异常样本个数大于 50% 的时候,如下图所示:

参考

Random sample consensus

Powered by Hexo and Theme by Hacker
© 2019 NIUHE