From b05487b6cacc17578541f8b5096bd2424ff3b0dd Mon Sep 17 00:00:00 2001 From: nadare Date: Wed, 19 Apr 2023 14:48:07 +0000 Subject: [PATCH] add embedding tips --- docs/faiss_tips_en.md | 45 +++++++++++++++++++++++++++++++++++++++++-- docs/faiss_tips_ja.md | 43 ++++++++++++++++++++++++++++++++++++++++- 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/docs/faiss_tips_en.md b/docs/faiss_tips_en.md index 4f0f513..e91de9e 100644 --- a/docs/faiss_tips_en.md +++ b/docs/faiss_tips_en.md @@ -17,7 +17,7 @@ The number of IVF divisions (n_ivf) is N//39, and n_probe uses int(np.power(n_iv In this article, I will first explain the meaning of these parameters, and then write advice for developers to create a better index. -# 手法の解説 +# Explanation of the method ## index factory An index factory is a unique faiss notation that expresses a pipeline that connects multiple approximate neighborhood search methods as a string. This allows you to try various approximate neighborhood search methods simply by changing the index factory string. @@ -102,4 +102,45 @@ https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-co ## RFlat RFlat is an instruction to recalculate the rough distance calculated by FastScan with the exact distance specified by the third argument of index factory. -When getting k neighbors, k*k_factor points are recalculated. \ No newline at end of file +When getting k neighbors, k*k_factor points are recalculated. + +# Techniques for embedding +## alpha query expansion +Query expansion is a technique used in searches, for example in full-text searches, where a few words are added to the entered search sentence to improve search accuracy. Several methods have also been proposed for vector search, among which α-query expansion is known as a highly effective method that does not require additional learning. In the paper, it is introduced in [Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019), etc., and [2nd place solution of kaggle shopee competition](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook). + +α-query expansion can be done by summing a vector with neighboring vectors with weights raised to the power of similarity. How to paste the code example. Replace big_npy with α query expansion. + +```python +alpha = 3. +index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat") +original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9) +big_npy /= original_norm +index.train(big_npy) +index.add(big_npy) +dist, neighbor = index.search(big_npy, num_expand) + +expand_arrays = [] +ixs = np.arange(big_npy.shape[0]) +for i in range(-(-big_npy.shape[0]//batch_size)): + ix = ixs[i*batch_size:(i+1)*batch_size] + weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha) + expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1)) +big_npy = np.concatenate(expand_arrays, axis=0) + +# normalize index version +big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9) +``` + +This is a technique that can be applied both to the query that does the search and to the DB being searched. + +## Compress embedding with MiniBatch KMeans +If total_fea.npy is too large, it is possible to shrink the vector using KMeans. +Compression of embedding is possible with the following code. Specify the size you want to compress for n_clusters, and specify 256 * number of CPU cores for batch_size to fully benefit from CPU parallelization. + +```python +import multiprocessing +from sklearn.cluster import MiniBatchKMeans +kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random") +kmeans.fit(big_npy) +sample_npy = kmeans.cluster_centers_ +``` \ No newline at end of file diff --git a/docs/faiss_tips_ja.md b/docs/faiss_tips_ja.md index 0b63610..e838494 100644 --- a/docs/faiss_tips_ja.md +++ b/docs/faiss_tips_ja.md @@ -102,4 +102,45 @@ https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-co ## RFlat RFlatはFastScanで計算した大まかな距離を、index factoryの第三引数で指定した正確な距離で再計算する指示です。 -k個の近傍を取得する際は、k*k_factor個の点について再計算が行われます。 \ No newline at end of file +k個の近傍を取得する際は、k*k_factor個の点について再計算が行われます。 + +# Embeddingに関するテクニック +## alpha query expansion +クエリ拡張は検索で使われるテクニックで、例えば全文検索では入力された検索文に単語を幾つか追加することで検索精度を上げることがあります。ベクトル検索にもいくつか提唱されていて、その内追加の学習がいらず効果が高い手法としてα-query expansionが知られています。論文では[Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019)などで紹介されていて、[kaggleのshopeeコンペの2位の解法](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook)にも用いられていました。 + +α-query expansionはあるベクトルに対し、近傍のベクトルを類似度のα乗した重みで足し合わせることでできます。いかにコードの例を張ります。big_npyをα query expansionしたものに置き換えます。 + +```python +alpha = 3. +index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat") +original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9) +big_npy /= original_norm +index.train(big_npy) +index.add(big_npy) +dist, neighbor = index.search(big_npy, num_expand) + +expand_arrays = [] +ixs = np.arange(big_npy.shape[0]) +for i in range(-(-big_npy.shape[0]//batch_size)): + ix = ixs[i*batch_size:(i+1)*batch_size] + weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha) + expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1)) +big_npy = np.concatenate(expand_arrays, axis=0) + +# normalize index version +big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9) +``` + +これは、検索を行うクエリにも、検索対象のDBにも適応可能なテクニックです。 + +## MiniBatch KMeansによるembeddingの圧縮 +total_fea.npyが大きすぎる場合、KMeansを用いてベクトルを小さくすることが可能です。 +以下のコードで、embeddingの圧縮が可能です。n_clustersは圧縮したい大きさを指定し、batch_sizeは256 * CPUのコア数を指定することでCPUの並列化の恩恵を十分に得ることができます。 + +```python +import multiprocessing +from sklearn.cluster import MiniBatchKMeans +kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random") +kmeans.fit(big_npy) +sample_npy = kmeans.cluster_centers_ +``` \ No newline at end of file