黑毛到白毛的攻城獅之路: Vespa 的新功能 hash dictionary

2022年1月25日星期二

Vespa 的新功能 hash dictionary

在翻閱 Vespa 的部落格 [1] 時，看到在 2021 年 5 月時，Vespa 新增了 hash dictionary 的功能，所以就來紀錄一下這個功能的細節。

hash dictionary 是什麼？

在 Vespa 的設計中，當欄位被設定為 attribute 時，可以另外加上 fast-search 的設定，讓 Vespa 自動幫這個欄位建立 index 以加快搜尋速度。原本 Vespa 的 fast-search 只能夠使用 b-tree 的資料結構來建立，但現在我們可以選擇使用 hash table 的資料結構來建立 index，在特殊的情境下能夠獲得比 b-tree 更好一點的效能。

hash dictionary 的限制

目前測試發現 hash dictionary 需要設定為 cased，而且必須要同時在 dictionary 跟 match 兩個設定上都加上 cased 才能通過檢查。不過這點我覺得在文件 [2] 上並沒有很明確地點出…。

field id type string {
    indexing: summary | attribute
    attribute: fast-search
    dictionary {
        hash
        cased
    }
    match: cased
}

hash dictionary 的效果

要比較效果的話，首先需要先看一下它的比較對象，也就是預設的 btree dictionary。對工程師來說，看到 b-tree 跟 hash 兩個關鍵字，應該大概就知道差別是什麼了！簡要來說就是 O(logn) 跟 O(1) 的差別 XD。不過除此之外，由於上述的 hash dictionary 的限制，在 Vespa 上設定 hash dictionary 還會另外衍生出 case-sensitive 的議題需要考慮。

首先先看一下 btree 的狀況，如果使用以下的設定的話，btree 的預設行為是 uncased，意味著 "bear" = "BEAR" = "Bear"。

field id type string {
    indexing: summary | attribute
    attribute: fast-search
}

實際使用 [3] 建立出來的測試環境來測試的話，裡面有一個 asin: "B00GQ22Y6Y" 的 document，內容長這樣：

{
    "pathId": "/document/v1/item/item/docid/B00GQ22Y6Y",
    "id": "id:item:item::B00GQ22Y6Y",
    "fields": {
        "title": "Trendy Style Hand-knit Warm Lining inside Winter Bucket Hat w. Cute Flower-Purple #H01",
        "asin": "B00GQ22Y6Y",
        ...(skipped)...
    }
}

此時用以下兩個 YQL 都能夠查到這個 document。這主要是因為預設的設定是 uncased，因此不管大小寫都可以順利查到結果。

SELECT * FROM item WHERE asin contains "b00gq22y6y";
SELECT * FROM item WHERE asin contains "B00GQ22Y6Y";

不過由於使用 hash dictionary 時，會需要設定 cased 屬性，導致更換成以下的 hash dictionary 時，狀況就會不太一樣了：

field id type string {
    indexing: summary | attribute
    attribute: fast-search
    dictionary {
        hash
        cased
    }
    match: cased
}

這時其實結果是 asin contains "b00gq22y6y" 可以查到資料，但 asin contains "B00GQ22Y6Y" 反而查不到…。這結果其實蠻出乎我的意料，不知道是不是 bug 或者是使用方式不正確之類的。

參考資料

沒有留言:

張貼留言

2022年1月25日 星期二