黑毛到白毛的攻城獅之路: Java 的 String Matching

2012年9月19日星期三

Java 的 String Matching

最近比較悠閒，打算找個時間來想看看要怎麼用 Java 來找專案中沒有再用到的東西。
基本概念就是要搜尋某個資料夾裡的所有檔案
然後對這些檔案的檔名，去其他檔案裡面搜尋，看有沒有被 reference 到。
其實就是 String Matching 的方法～。

搜尋資料時，找到 [1] 這個討論地感覺蠻完整的網路資料！（然後我就懶得再找其他資料了 XD）
[1] 主要是討論如何提昇 Java 中的 String operation
其中 Pattern First Exact String Matching 的段落，就是專門在討論在 Java 中做 String Matching 的問題。

Java 原生並沒有任何特殊的 String Matching，使用 indexOf() 或 lastIndexOf() 等都是暴力法搜尋
因此 [1] 找到了一個電子書「Exact String Matching Algorithms [2]」，並實作了上面提到的多種 String Matching 的演算法。
PS. [1] 的作者超佛心！m(_ _)m.....

而為了要同時比較 Java 原生的 String Matching 方法，跟其他演算法的差異
[1] 也用原生的方法寫了以下兩種字串比對的 method
第一種是用 Java 的 indexOf() 函式，而第二種是用 Java 的 Regular Expression 比對：

Method #1 – The indexOf() approach

public static List findAll(String pattern, String source) {
　　List idx = new ArrayList<Integer>();
　　int id = -1;
　　int shift = pattern.length();
　　int scnIdx = -shift;
　　while (scnIdx != -1 || id == -1) {
　　　　idx.add(scnIdx);
　　　　id = scnIdx + shift;
　　　　scnIdx = source.indexOf(pattern, id);
　　}
　　idx.remove(0);
　　return idx;
}

Method #2 – The Matcher find() approach

public static List findAll(String pattern, String source) {
　　List idx = new ArrayList<Integer>();
　　Pattern ptrn = Pattern.compile(pattern);
　　Matcher mtch = ptrn.matcher(source);
　　while(mtch.find())
　　　　idx.add(mtch.start());
　　return idx;
}

效率比較的結果如下圖，其中藍色的部份是前處理的時間，橘色的是實際做字串比對的時間：

轉載自 [1]
圖中最下面的 JVM、JVM2 應該就是上面提到的使用 Java 原生方法的 Method #1、#2。
[1] 的結論是，從圖中可以看出，實際上 Java 原生的方法並不算慢
如果要處理的是中小型文件，那直接用 Java 原生的方法也不會慢太多（甚至可能比較快）
但要處理大型文件或者文件數量龐大時，建議就可以從 [1] 列出來的一堆演算法中尋找合適的演算法來實作。

參考資料：
1、Java Best Practices – String performance and Exact String Matching
2、Exact String Matching Algorithms

沒有留言:

張貼留言

2012年9月19日 星期三

Java 的 String Matching

沒有留言:

2012年9月19日星期三