黑毛到白毛的攻城獅之路

Software entities (class, modules, functions, etc.) should be open for extension, but closed for modification.
- Bertrand Meyer

Junior programmers create simple solutions to simple problems. Senior programmers create complex solutions to complex problems. Great programmers find simple solutions to complex problems.
- Charles Connell

註1：本部落格的範例程式碼在 2015 年以前的文章中，大多是以全型空白做縮排。如需服用，請自行用文字編輯器的取代功能把全型空白取代成半型空白。
註2：本部落格的內容授權請參閱部落格底部的授權宣告。

2022年10月23日星期日

Markdown Here 樣式筆記

紀錄目前 Markdown Here plugin 使用的設定。

Syntax Highlight 模板：Monokai Sublime

CSS：
這裡主要是模板以外再複寫一次 blockquote 的語法，讓部落格本來就在用的 blockquote 樣式維持原狀。

/*
 * NOTE:
 * - The use of browser-specific styles (-moz-, -webkit-) should be avoided.
 *   If used, they may not render correctly for people reading the email in
 *   a different browser than the one from which the email was sent.
 * - The use of state-dependent styles (like a:hover) don't work because they
 *   don't match at the time the styles are made explicit. (In email, styles
 *   must be explicitly applied to all elements -- stylesheets get stripped.)
 */

/* This is the overall wrapper, it should be treated as the `body` section. */
.markdown-here-wrapper {
}

/* To add site specific rules, you can use the `data-md-url` attribute that we
   add to the wrapper element. Note that rules like this are used depending
   on the URL you're *sending* from, not the URL where the recipient views it.
*/
/* .markdown-here-wrapper[data-md-url*="mail.yahoo."] ul { color: red; } */

pre, code {
  font-size: 0.85em;
  font-family: Consolas, Inconsolata, Courier, monospace;
}

code {
  margin: 0 0.15em;
  padding: 0 0.3em;
  white-space: pre-wrap;
  border: 1px solid #EAEAEA;
  background-color: #F8F8F8;
  border-radius: 3px;
  display: inline; /* added to fix Yahoo block display of inline code */
}

pre {
  font-size: 1em;
  line-height: 1.2em;
}

pre code {
  white-space: pre;
  overflow: auto; /* fixes issue #70: Firefox/Thunderbird: Code blocks with horizontal scroll would have bad background colour */
  border-radius: 3px;
  border: 1px solid #CCC;
  padding: 0.5em 0.7em;
  display: block !important; /* added to counteract the Yahoo-specific `code` rule; without this, code blocks in Blogger are broken */
}

/* In edit mode, Wordpress uses a `* { font: ...;} rule+style that makes highlighted
code look non-monospace. This rule will override it. */
.markdown-here-wrapper[data-md-url*="wordpress."] code span {
  font: inherit;
}

/* Wordpress adds a grey background to `pre` elements that doesn't go well with
our syntax highlighting. */
.markdown-here-wrapper[data-md-url*="wordpress."] pre {
  background-color: transparent;
}

/* This spacing has been tweaked to closely match Gmail+Chrome "paragraph" (two line breaks) spacing.
Note that we only use a top margin and not a bottom margin -- this prevents the
"blank line" look at the top of the email (issue #243).
*/
p {
  /* !important is needed here because Hotmail/Outlook.com uses !important to
     kill the margin in <p>. We need this to win. */
  margin: 0 0 1.2em 0 !important;
}

table, pre, dl, blockquote, q, ul, ol {
  margin: 1.2em 0;
}

ul, ol {
  padding-left: 2em;
}

li {
  margin: 0.5em 0;
}

/* Space paragraphs in a list the same as the list itself. */
li p {
  /* Needs !important to override rule above. */
  margin: 0.5em 0 !important;
}

/* Smaller spacing for sub-lists */
ul ul, ul ol, ol ul, ol ol {
  margin: 0;
  padding-left: 1em;
}

/* Use Roman numerals for sub-ordered-lists. (Like Github.) */
ol ol, ul ol {
  list-style-type: lower-roman;
}

/* Use letters for sub-sub-ordered lists. (Like Github.) */
ul ul ol, ul ol ol, ol ul ol, ol ol ol {
  list-style-type: lower-alpha;
}

dl {
  padding: 0;
}

dl dt {
  font-size: 1em;
  font-weight: bold;
  font-style: italic;
}

dl dd {
  margin: 0 0 1em;
  padding: 0 1em;
}

blockquote, q {
  border-left: 4px solid #DDD;
  padding: 0 1em;
  color: #777;
  quotes: none;
}

blockquote::before, blockquote::after, q::before, q::after {
  content: none;
}

blockquote{border-left: 15px solid #c76c0c;font-family: Georgia, serif;font-size:15px;text-align: justify;background: #fff;line-height: 1.2;
display:block;margin: 0 0 20px;padding: 15px 20px 15px 45px;position: relative; font-size: 16px;color: #666;
border-right: 2px solid #c76c0c;box-shadow: 2px 2px 15px #ccc;-webkit-box-shadow: 2px 2px 15px #ccc;-moz-box-shadow: 2px 2px 15px #ccc;}
blockquote::before{font-size: 50px;position: absolute;color: #999;left: 10px;top:5px;font-weight: bold;content: "\201C";}
blockquote a{cursor: pointer;color: #c76c0c;text-decoration: none;background: #eee;padding: 0 3px;}
blockquote a:hover{color: #555;}

h1, h2, h3, h4, h5, h6 {
  margin: 1.3em 0 1em;
  padding: 0;
  font-weight: bold;
}

h1 {
  font-size: 1.6em;
  border-bottom: 1px solid #ddd;
}

h2 {
  font-size: 1.4em;
  border-bottom: 1px solid #eee;
}

h3 {
  font-size: 1.3em;
}

h4 {
  font-size: 1.2em;
}

h5 {
  font-size: 1em;
}

h6 {
  font-size: 1em;
  color: #777;
}

table {
  padding: 0;
  border-collapse: collapse;
  border-spacing: 0;
  font-size: 1em;
  font: inherit;
  border: 0;
}

tbody {
  margin: 0;
  padding: 0;
  border: 0;
}

table tr {
  border: 0;
  border-top: 1px solid #CCC;
  background-color: white;
  margin: 0;
  padding: 0;
}

table tr:nth-child(2n) {
  background-color: #F8F8F8;
}

table tr th, table tr td {
  font-size: 1em;
  border: 1px solid #CCC;
  margin: 0;
  padding: 0.5em 1em;
}

table tr th {
 font-weight: bold;
  background-color: #F0F0F0;
}

2022年9月4日星期日

[筆記] 詞彙的建立

在 IR 系統的前處理中，因為我們想要把資料建成 posting list，會需要先有相關的詞彙，才有辦法依據詞彙建立相應的 posting list。不過這時會有幾個常見的問題需要處理。

這篇會簡要紀錄一下書中提到的考量點。

1. Tokenization

Tokenization 是把句子分割並萃取的技術，萃取出來的單位被稱為 token，或者也可能被稱為 word、term，但事實上有時候我們可能會想要把 token 跟其他名詞做一點定義上的區分。以下是書中對於 token 的定義：

A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

除了 token 以外，另外還有 type 和 term 這兩個名詞。書中提供的定義分別如下：

A type is the class of all tokens containing the same character sequence.

A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

直接用書中提供的例子應該會比較好懂。舉例來說，如果我們想對下述的句子做 tokenization：

to sleep perchance to dream

此時，會有 to, sleep, perchance, to, dream 共 5 個 token；但只會有 4 個 type：to, sleep, perchance, dream，因為有兩個 to 是重複的；而如果 IR 系統中有處理 stop words，我們可能會把 to 視為是 stop words 而將它移除，此時最終在 IR 系統中被 index 的 term 就只會有 3 個：sleep, perchance, dream 。

那麼我們通常會如何做 tokenziation 呢？初步看起來，似乎就是直接針對空白做分割，然後可能把標點符號什麼的都去除掉就可以了？然而現實上，還是會出現一些比較困難的問題，導致我們可能還需要考慮別的因子。例如：

' 在英文中，可以代表所有格關係、也可以代表縮寫。
不同語言有自己的語言特徵。
特定領域可能有自己特殊的 type。
- 在英文中也存在多種不同用法。
多個字組成的詞彙無法單純用空白切割來處理。

以下會分別針對這些狀況做點簡單的描述。

1.1. Apostrophe (‘)

' 這個符號在英文中有多種用法，例如以下這個句子：

Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.

這裡同時出現了三種不同的用法：

O’Neill 中的 ' 是名字的一部分。
boys’ 和 Chile’s 中的 ' 是所有格。
aren’t 中的 ' 是縮寫。

此時針對 ' 做的處理的不同，可能會導致後續 IR 系統的 matching 受到影響。例如如果我們將 ' 拆開的話，aren't 被拆成 aren 跟 t 看起來就會很奇怪。同時，O'Neill 被拆成 O 跟 Neill 會導致搜尋時使用 o'neil 會無法 match 到相應的 token。

1.2. 語言特徵

tokenization 的行為往往跟語言有很大的關係，不同語言的 tokenization 作法會有蠻大的不同。例如中文和日文就沒有空白分隔的特性，每個字都是連在一起的。

1.3. 領域特徵

在特定領域，有可能有領域內專屬的詞彙需要能夠被識別成 term，例如 C++ 和 C#、或者飛機型號 B-52（戰略轟炸機）。另外網際網路中的 URL（https://xxx）、e-mail（abc@abc.com）、IP 位址（a.b.c.d）等等可能也會需要能夠被視為是獨立的 term。

1.4. 連字符號 Hyphenation (-)

英文中連字符號的處理可能會很複雜，書中舉出幾種連字符號的用法：

分隔單字之間的母音，像是 co-education。這個例子實際上它代表的是 coeducaion。
將多個名詞連接成一個新的名稱，例如 Hewlett-Packard。
多個詞語的組成結果，例如 the hold-him-back-and-drag-him-away maneuver。這個例子應該要被拆解成各自獨立的 term。

1.5. 特殊詞彙

這類的資訊其實我覺得也可以算是領域特徵的一種。例如書中舉的一個例子是 New York University（紐約大學）跟 York University（約克大學）。如果他們沒有被獨立視為 term，而是純粹依據空白拆開的話，那麼搜尋 York University 時就會 match 到 New York University 了，但這顯然並非是使用者想找的東西。

2. Stop Words

停用詞（Stop Words）的目的，是去除出現頻率過高、但又跟文件本身關係不大的詞彙。不過由於去除停用詞的行為，有時會導致意義的流失，所以現代的 IR 系統比較會考慮不去除停用詞，改為將停用詞賦予較低的權重，使其影響變小。舉例來說，flights to London 如果把 to 給去除了，flights to London 的意義就流失了。

其實還有兩個小節，不過等有空時再補筆記了…。
是說搜尋是個在現代非常常見的行為，但最近的研究總覺得這個領域有點冷門？

2022年6月5日星期日

修復 Windows 10 的開機磁區

因為我要幫家人升級電腦，這次升級會連系統碟都換成 SSD，所以需要把舊硬碟裡的資料一起搬移到新的 SSD 上。本來我想說應該只需要把 C 槽裡的資料備份起來就好，就找了個備份軟體 Macrium Reflect [1] 來做硬碟備份，然後把映像檔放進隨身碟裡以後帶回我家，開始裝主機跟用映像檔還原系統碟出來。

不過實際要執行還原時，發現一個問題是我沒有備份開機磁區出來...。於是那時想了個方法是，先安裝一個全新的 Windows 10，這樣安裝過程就會重新切割硬碟，並且切出應該要有的開機磁區等等的。畢竟我也不太清楚裡面裝了什麼，讓 Windows 10 安裝程式幫忙弄應該比較安全。等到裝好以後，我再用 Macrium Reflect 把映像檔裡的資料蓋到 C 槽上！完美！

結果幻想很美好、現實很殘酷，實際蓋上去以後，發生了 Windows 根本開不了機的問題。事後看起來應該是因為開機磁區（這裡用的是 GPT，因為我想用 UEFI 開機）裡的設定是用 UUID 在辨識系統碟的，所以應該是遇到開機磁區上寫的 UUID 跟實際 C 槽的 UUID 不同的關係，畢竟我把映像檔的資料蓋到 C 槽裡了。於是就進入了研究如何修復開機磁區的過程。

其實我也沒有搞很懂步驟到底在幹麻 XD。不過簡要來說，就是在啟動修復的畫面裡打開命令提示字元，然後把隱藏的開機磁區掛載上去，這樣我們才能夠存取裡面的檔案。接著執行 bootrec /fixboot 來修復 [2]。但我記得我的這個步驟好像是失敗了就是，會遇到 access denied 的問題，然後我好像沒有成功解決它。

最後我實際有做成功的，應該是把開機磁區掛載到 V: 後，執行 bcdboot C:\windows /s V: /f UEFI 這個指令 [3]，這應該是在開機程序裡新增一個開機碟吧。所以我再重開機時，Windows 會跑出有兩個 Windows 要我選擇。接著在 Windows 裡再用 msconfig [4] 把非現在使用的開機系統移除就好了。

參考資料

2022年5月19日星期四

Windows 10 的 Realtek 和 Intel 驅動程式

前陣子換了主機板，換成 MSI MAG H670 TOMAHAWK WIFI DDR4，換完以後本來預期應該不用特別做什麼，Windows 10 就能夠順順讓新的硬體運作了。但結果現實沒有想像中那麼美好，最後遇到兩個驅動程式方面的問題。

首先是遇到一開始本來都好好的，結果重開機後電腦就沒聲音了！但驅動程式那邊又顯示裝置正常運作，然後工作列的喇叭小圖示也有正常顯示。找了非常久的問題，最後大體上辨識了問題應該是出在 Realtek HD Audio 的驅動程式可能有什麼衝突或問題之類的，而且這個問題看似不是只有我遇到，好像不少人都遇到，甚至還有網友整理可用的驅動程式 [2]...。但從結論來說就是，網路上找到的這些解法 [1-3] 對我來說都沒有用。最終我的聲音恢復正常的原因是因為我直接洗掉整顆硬碟重灌 Windows 10 了。重灌後目前用的是 Windows 內建的驅動程式，因為不敢再裝原廠的驅動程式，怕又掉進之前重開機就沒聲音的迴圈...。反正對我來說只要有聲音就好，我也不太需要什麼特效之類的。

接著因為我重灌 Windows 10，遇到另一個我覺得更奇耙的問題...。MSI MAG H670 TOMAHAWK WIFI DDR4 內建有 Intel® I225V 2.5Gbps 網路晶片跟 Intel® Wi-Fi 6 無線網路晶片。在 Windows 10 裝好以後，我要裝網路晶片的驅動程式時，發現 Wifi 的驅動程式告訴我說 Windows 版本過舊沒辦法裝、I225V 的驅動程式則說找不到相應的晶片。此時完全是黑人問號...。也是花了不少時間搜尋後，找到了 [4] 的教學，原來其實兩個問題是同一個，問題是在於 Windows 10 需要升級到 1809 版本（2018/10 更新）以上才有辦法安裝驅動程式。但我的 Windows 10 映像檔是最原始版本 1507，所以裝不進去....。這裡解決方法可以嘗試看看 [4]，它有教如何自己去改 INF 設定檔的內容，強迫安裝驅動程式。不過我最後的解法是拿出了一個很老舊的 USB 無線網卡，用它連上網路開始做 Windows Update....。

2022年4月9日星期六

Vespa timeout 機制

Vespa 在處理查詢的時候，有預設的 timeout 機制，能夠在時間不夠的時候將既有已經收集到的結果吐出，而不是放棄既有的結果並回覆 504 timeout。這樣的行為其實就是現代的 reactive system 的思維。這裡會簡要地介紹 timeout 的機制 [1]，並且提一下最近遇到的實例。

2022年2月21日星期一

S3 檔案不存在時有可能會拿到 403 錯誤

筆記，原來 S3 要檢查檔案是否存在，是需要有 s3:ListBucket 的權限的，如果沒有這個權限的話，當檔案不存在時 S3 會拋出的錯誤會是 403 而不是 404，代表的是 S3 想要 list bucket 但沒有足夠權限…。

參考資料

AWS S3 IAM errors with missing files: 404 expected, 403 returned

2022年2月20日星期日

JMeter 自訂輸出結果

JMeter 預設的套件能夠根據測試對象輸出像是 throughput、latency、status code 等數據的報告，不過如果遇到自己想輸出的東西是從 API response 裡萃取出來的狀況，就會稍微麻煩一點。實際上還是能夠做到，但看起來會存在一些限制，這篇會簡單紀錄一下需要做的事情。

使用 sample variables 自訂變數

JMeter 有個 sample_variables 的參數，在啟動測試時可以一起帶進去指定，JMeter 在最後輸出 JTL 時就會一起把 sample_variables 裡指定的變數一起輸出到 JTL 裡。

jmeter -Jsample_variables=price -n -t mytest.jmx -l test_result.csv

以上述的指令來說，指定的自訂變數就是 price 這個變數，只要在測試過程當中有把結果寫入到 price 變數，JMeter 在寫入 JTL 時就會把 price 一起寫進去了。輸出的結果會類似這樣：

timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,failureMessage,bytes,sentBytes,grpThreads,allThreads,URL,Latency,IdleTime,Connect,"price"
1645111313217,1292,Random Commerce Request,200,OK,Thread Group 1-6,text,true,,946,150,10,10,https://random-data-api.com/api/commerce/random_commerce,1285,0,1010,65.85

可以看到最後面多了一個 “price” 的欄位。

產生自訂變數的報告

這個看起來存在一些限制，本來我希望的結果是產生像是 aggregate report 那樣的表格，可以幫我計算自訂變數的平均數、標準差、中位數、百分位數等等的。但目前看起來好像只能夠讓 JMeter 在產生 HTML 報告的時候引入自訂變數而已，而且報告產生的樣式似乎也沒什麼可調整的空間？不知道有沒有其他 plugin 可以協助，不過目前是沒有找到…。

這裡單純紀錄一下，產生報告的時候目前不能用 Java 17，會噴出錯誤訊息。可以換成 Java 8 或 11。

參考資料

sample_variables property

2022年2月5日星期六

準備 Vespa 測試環境

其實本來想紀錄一下建 Vespa container 的過程，但翻了一下之前的文章 [1]，發現其實雖然細節有點不同，但大體上也是大同小異，這篇就簡單寫了，畢竟內容其實就跟 Vespa 的 Github 上寫得差不多 😆。

這篇其實算是個前置作業，目的是因為最近想紀錄一點 Vespa 的實驗數據，不過畢竟不能拿公司的數據放部落格（其實也不是公司不允許，單純只是要申請跟審核感覺很麻煩，我懶得弄 🙈），所以想要用簡單的 Vespa container 來做測試。基於這個原因，需要在自己的電腦準備一個 Vespa 環境，並且需要塞一些合理的測試資料進去。Vespa 團隊在他們的 Github [2] 上有準備一個 e-commerce 的範例，看起來還不錯，所以預計會先拿這個來做初始環境的建置。

2022年1月25日星期二

Vespa 的新功能 hash dictionary

在翻閱 Vespa 的部落格 [1] 時，看到在 2021 年 5 月時，Vespa 新增了 hash dictionary 的功能，所以就來紀錄一下這個功能的細節。

hash dictionary 是什麼？

在 Vespa 的設計中，當欄位被設定為 attribute 時，可以另外加上 fast-search 的設定，讓 Vespa 自動幫這個欄位建立 index 以加快搜尋速度。原本 Vespa 的 fast-search 只能夠使用 b-tree 的資料結構來建立，但現在我們可以選擇使用 hash table 的資料結構來建立 index，在特殊的情境下能夠獲得比 b-tree 更好一點的效能。

hash dictionary 的限制

目前測試發現 hash dictionary 需要設定為 cased，而且必須要同時在 dictionary 跟 match 兩個設定上都加上 cased 才能通過檢查。不過這點我覺得在文件 [2] 上並沒有很明確地點出…。

field id type string {
    indexing: summary | attribute
    attribute: fast-search
    dictionary {
        hash
        cased
    }
    match: cased
}

hash dictionary 的效果

要比較效果的話，首先需要先看一下它的比較對象，也就是預設的 btree dictionary。對工程師來說，看到 b-tree 跟 hash 兩個關鍵字，應該大概就知道差別是什麼了！簡要來說就是 O(logn) 跟 O(1) 的差別 XD。不過除此之外，由於上述的 hash dictionary 的限制，在 Vespa 上設定 hash dictionary 還會另外衍生出 case-sensitive 的議題需要考慮。

首先先看一下 btree 的狀況，如果使用以下的設定的話，btree 的預設行為是 uncased，意味著 "bear" = "BEAR" = "Bear"。

field id type string {
    indexing: summary | attribute
    attribute: fast-search
}

實際使用 [3] 建立出來的測試環境來測試的話，裡面有一個 asin: "B00GQ22Y6Y" 的 document，內容長這樣：

{
    "pathId": "/document/v1/item/item/docid/B00GQ22Y6Y",
    "id": "id:item:item::B00GQ22Y6Y",
    "fields": {
        "title": "Trendy Style Hand-knit Warm Lining inside Winter Bucket Hat w. Cute Flower-Purple #H01",
        "asin": "B00GQ22Y6Y",
        ...(skipped)...
    }
}

此時用以下兩個 YQL 都能夠查到這個 document。這主要是因為預設的設定是 uncased，因此不管大小寫都可以順利查到結果。

SELECT * FROM item WHERE asin contains "b00gq22y6y";
SELECT * FROM item WHERE asin contains "B00GQ22Y6Y";

不過由於使用 hash dictionary 時，會需要設定 cased 屬性，導致更換成以下的 hash dictionary 時，狀況就會不太一樣了：

field id type string {
    indexing: summary | attribute
    attribute: fast-search
    dictionary {
        hash
        cased
    }
    match: cased
}

這時其實結果是 asin contains "b00gq22y6y" 可以查到資料，但 asin contains "B00GQ22Y6Y" 反而查不到…。這結果其實蠻出乎我的意料，不知道是不是 bug 或者是使用方式不正確之類的。

參考資料

Gradle 的 ‘plugin’ 區塊限制

很緩慢地直到去年年底才開始摸 Gradle，然後最近要試著自己弄小專案時，撞到一個奇怪的問題。我用 Gradle init 的指令幫忙產生第一版的 Gradle 設定，接著在根目錄的 settings.gradle 想要加入 java plugin，例如下面這樣：

plugins {
    id "java"
}

repositories {
    mavenCentral()
}

rootProject.name = 'sample-vespa-data-feeder'
include('app')

結果意外地（我很意外 XD…）遇到了以下的錯誤訊息：

An exception occurred applying plugin request [id: 'java']
> Failed to apply plugin 'org.gradle.java'.
   > Could not create plugin of type 'JavaPlugin'.
      > Unable to determine constructor argument #3: missing parameter of type JvmPluginServices, or no service of type JvmPluginServices.

結果到處亂看的時候，看到 [1] 才赫然發現，原來 plugin 是新的用法，而且這個用法好像不能寫在 root project 上。後來我把 plugins {} 跟 repositories {} 都改放進 app/build.gradle 就正常了….。

BTW，小小的題外話，我用 Gradle init 產生設定時，是選擇 single project 的，不過它產生出來的還是具有 multi project 的結構，總覺得 Gradle 的設計是不是其實根本沒有 single project….？

參考資料

What the difference in applying gradle plugin

訂閱：文章 (Atom)

2022年10月23日 星期日

2022年9月4日 星期日

1. Tokenization

1.1. Apostrophe (‘)

1.2. 語言特徵

1.3. 領域特徵

1.4. 連字符號 Hyphenation (-)

1.5. 特殊詞彙

2. Stop Words

2022年6月5日 星期日

2022年5月19日 星期四

2022年4月9日 星期六

2022年2月21日 星期一

參考資料

2022年2月20日 星期日

使用 sample variables 自訂變數

產生自訂變數的報告

參考資料

2022年2月5日 星期六

2022年1月25日 星期二

hash dictionary 是什麼？

hash dictionary 的限制

hash dictionary 的效果

參考資料

參考資料

2022年10月23日星期日

2022年9月4日星期日

2022年6月5日星期日

2022年5月19日星期四

2022年4月9日星期六

2022年2月21日星期一

2022年2月20日星期日

2022年2月5日星期六

2022年1月25日星期二