# NIUHE

## 应用介绍 Applications

### 相似文档 Similar Documents

1. Shingling : 把文档，邮件等等转换成集合。
2. Minhashing : 把大的集合转换成短的标志(signature)并且保持着相似性。
3. Locality-sensitive hashing : 专注于可能相似的标志对儿(pairs of signature)。

## Shingles

k-shinglek-gram 是指文档中连续出现的 k 个字符构成的序列。

### 压缩

Two documents could (rarely) appear to have shingles in common, when in fact only the hash values were shared.

## Minhashing

### Jaccard Similarity

$Sim(C_1, C_2) =\frac{ |C_1 \cap C_2|} {|C_1 \cup C_2|}$

### From Sets to Boolean Matrices

• 矩阵的行表示集合中的各个元素，比如每个k-shingles

• 矩阵的列表示各个集合，比如每个文档的k-shingles集合

• 如果行 $e$ 是列 $S$ 的一个成员，那么在$e$$S$列的值为$1$，否则为$0$

• 两列之间的相似度就是Jaccard 相似度。

• 通常情况下，这会是一个稀疏矩阵。

### 性质

Why?

$h(C_1) = h(C_2)$只有在a类型的行中出现，所以$h(C_1) = h(C_2)$这个事件就相当于a类型的行出现，那么$h(C_1) = h(C_2)$的概率就是a类型的行出现的概率，即$\frac{a}{a+b+c}$

• The similarity of signatures is the fraction of the minhash functions in which they agree.
• Thinking of signatures as columns of integers, the similarity of signatures is the fraction of rows in which they agree.
• Thus, the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.
• And the longer the signatures, the smaller will be the expected error.

Minhash的例子 输入矩阵和上面那个例子相同：

### LSH

• 选一个相似度标准 $t$，并且 $t < 1$，如果两个文档的相似度大于 $t$，则认为这两个文档相似。
• 如果列$c$和列$d$被视为候选文档对，那么他们一定要满足 $M(i, c) = M(i, d) >= t$ ，其中$M$是标记矩阵。

#### LSH for Minhashing Signatures

Partion Into Bands

Example - Bands

### LSH Summary

Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures.

Check that candidate pairs really do have similar signatures.

Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets . ## Applications of LSH

### 实体解析 Entity Resolution

Typically, we want to merge records if their values in corresponding fields are similar.

Matching Customer Records

1. 把经过LSH筛选的所有候选记录的分数算出来，分高的就认为是同一个人。

### 指纹匹配 Fingerprint Matching

LSH for Fingerprints

• 指纹 = 方格的集合
• 不需要minhash，因为方格数不是很多，而且也不是稀疏矩阵
• 用一个位向量(bit-vector)表示一个指纹：如果那个方格有minutiae，则这个位的值是1。这种方法相对于直接存整数节省了很多空间。
• 随机从位向量中取3个集合，每个集合有3个方格（3位）。
• 对于每一个集合，只有三位全部为1才被认为是Candidate Pairs 。
• Funny sort of ‘bucketization.”

### Finding Duplicate News Articles

Problem : the same article, say from the Associated Press, appears on the Web site of many newspapers, but looks quite different.

Special Shingling Technique

Why it Works

• By requiring each shingle to have a stop word, they biased the mapping from documents to shingles so it picked more shingles from the article than from the ads.
• Pages with the same article, but different ads, have higher Jaccard similarity than those with the same ads, different articles.

Enter LSH

• Their first attempt at minhashing was very inefficient.
• They were unaware of the importance of doing the minhashing row by row.
• Since their data was column by column, they needed to sort once before minhashing.