在文件 [1] 上對於 keep-rank-count 的描述如下：

How many documents to keep the first phase top rank values for. Default value is 10000.

而對於 rerank-count 的描述則是如下：

Specifies the number of hits to be re-ranked in the second phase. Default value is 100. This can also be set in the query. Note that this value is local to each node involved in a query.

其實很難看懂這到底在說什麼…。

事實上，它們的功能是應該要放在一起看待的。Vespa 的 ranking 有分 first-phase 和 second-phase，其中 first-phase 會肩負減少資料量的工作，以確保當 Vespa 內存放的資料非常多的時候，不需要對所有資料都做完所有運算。keep-rank-count 和 rerank-count 都是適用在 first-phase 階段的參數，以下分別討論它們的效果，會比較好理解。

keep-rank-count 造成的影響是當 Vespa 的每一個 search node （a.k.a. content node）在 first-phase 取資料時，最多只會依據 rank profile 的計算去取出 keep-rank-count 指定數量的分數（和 docid），例如預設值 10,000 代表的是每個 nonde 都會暫存 10,000 筆資料的分數（和 docid），而超出這 10,000 筆資料的部份，分數就會被忽略（即只會留下 docid ）。因此走到 second-phase 時就只會有 10,000 筆資料能夠被後續做更複雜的處理。這也會反應在 Vespa 最後回覆的資料中，如果是超出 keep-rank-count 的部份的資料，分數都會顯示 –infinite，代表因為分數已經在處理過程被丟棄，因此在結果中會不知道分數是多少。

rerank-count 則是決定離開 first-phase 進入到 second-phase 時，會留下多少筆資料。需要注意的是，這裡的數字一樣都是每個 search node 計算的。舉例來說，如果有 10 台 search node，rerank-count 設定為 100，那麼總共就會有 1,000 筆資料會被送進 second-phase。

keep-rank-count 和 rerank-count 兩個結合起來，假設我的 rank profile 設定是 keep-rank-count=100, rerank-count=500，同時現在有個搜尋的指令，會搜尋到 1,000 筆資料，那麼結果會是如何呢？…..結果會是 second-phase 會看到 500 筆資料，但這 500 筆當中，只會有 100 筆的 relevance score 存在，其他 400 筆的 relevance 會顯示 -infinite。

參考資料

Search Definition Reference

黑毛到白毛的攻城獅之路

2019年9月18日星期三

Vespa 的 keep-rank-count 和 rerank-count

參考資料

沒有留言:

張貼留言

2019年9月18日 星期三

Vespa 的 keep-rank-count 和 rerank-count

參考資料

沒有留言:

張貼留言

2019年9月18日星期三