2013年4月19日 星期五

Java 檔案切割與合併

想要實現將一個檔案依照指定長度,分割成很多份,然後再即時把它們合併起來
主要可以參考 [4],基本原理就是切割時計算長度,然後指定長度到達時就改寫到另一個檔案
而合併時可以善用 [4] 描述的 SequenceInputStream!
因為 SequenceInputStream 預設是餵一連串的 InputStream 的物件
因此如果需要做某些特殊的指定,也可以自己重寫一個繼承 InputStream 的客製化 InputStream 餵給它。

切割的部分,我寫的測試程式如下:
public LinkedList<ChunkFile> splitChunkFiles (File file, int lengthOfChunk, File outputDir) {
  if(outputDir == null || !outputDir.exists()) {
    System.err.println(outputDir.getAbsolutePath() + " is not exist.");
    return null;
  }
  
  LinkedList<ChunkFile> chunks = new LinkedList<ChunkFile>();
  
  final int overlap = (lengthOfChunk > 51200) ? 51200 : (int) (lengthOfChunk / 5);
  System.out.println("Length of the overall file: " + file.length());
  System.out.println("Length of a chunk file: " + lengthOfChunk + " bytes");
  //System.out.println("Number of chunks: " + chunks.length);
  System.out.println("Length of overlapping: " + overlap + " bytes\n");
  
  // Input
  FileInputStream fis = null;
  //BufferedInputStream bis = null;
  byte[] buffer = new byte[2048];
  int lastIndex = 0;
  int offset = -1; // temp var for storing the index of start.
  int len = -1;
  
  // Output
  FileOutputStream fos = null;
  
  try {
    fis = new FileInputStream(file);
    //bis = new BufferedInputStream(fis);
    int i = 0;
    
    while(true) {
      File outputFile = new File(outputDir, file.getName() + ".part" + (i+1));
      int read = 0; // Record the number of bytes read.
      
      try {
        fos = new FileOutputStream(outputFile);
        offset = lastIndex;
        System.out.println("Current available bytes: " + fis.available());
        System.out.println("Processing file " + outputFile.getName() + ": " + offset + "-" + (offset+lengthOfChunk-1) + " (" + lengthOfChunk + ")");
        
        // Reset the number of read.
        read = 0;
        while( (len = fis.read(buffer)) != -1) {
          // Set cursor the be at the end of the chunk if the index is overloaded for the chunk.
          if((read + len) > lengthOfChunk)
            len = lengthOfChunk - read;
            fos.write(buffer, 0, len);
          fos.flush();
          
          read += len;
          if(read >= lengthOfChunk) break;
        }
      } catch (Exception e) {
        e.printStackTrace();
      } finally {
        if(fos != null) fos.close();
        
        // Move the cursor back.
        System.out.println("Read " + read + " bytes for chunk file " + outputFile.getName());
        lastIndex = lastIndex + read - overlap;
        System.out.println("Move cursor to " + lastIndex);
        
        ++i;
        fis.close();
        fis = new FileInputStream(file);
        fis.skip(lastIndex);
      }
      
      System.out.print("Output " + outputFile.getAbsolutePath() + "\t[");
      long chunkLength = -1;
      if(outputFile.length() < lengthOfChunk) {
        chunkLength = outputFile.length();
        System.out.print("offset: " + offset + ", length: " + chunkLength);
        System.out.println("]\tLength: " + outputFile.length() + "\n");
        
        chunks.addLast(new ChunkFile(outputFile, 0, chunkLength, offset));
        break;
      }
      else {
        chunkLength = outputFile.length() - overlap;
        System.out.print("offset: " + offset + ", length: " + chunkLength);
        System.out.println("]\tLength: " + outputFile.length() + "\n");
        
        chunks.addLast(new ChunkFile(outputFile, 0, chunkLength, offset));
      }
    }
  } catch (IOException e) {
    e.printStackTrace();
  } finally {
    try {
      //if(bis != null) bis.close();
      if(fis != null) fis.close();
    } catch (IOException e) {}
    
    System.out.println("=============================");
    long totalSize = 0;
    for(int i=0 ; i<chunks.size() ; i++) {
      ChunkFile chunkFile = chunks.get(i);
      totalSize += chunkFile.length;
      long end = (i != chunks.size()-1) ? (totalSize-1) : totalSize;
      System.out.println(chunkFile.file.getName() + "\t" + chunkFile.offset + " ~ " + end);
    }
    System.out.println("=============================");
    for(int i=0 ; i<chunks.size() ; i++) {
      ChunkFile chunkFile = chunks.get(i);
      System.out.println(chunkFile.file.getName() + "\t" + chunkFile.offset + " ; Length " + chunkFile.length);
    }
    System.out.println("=============================");
  }
  
  return chunks;
}

public class ChunkFile {
  public File file = null;
  public int offset = -1;
  public int actualOffset = -1;
  public long length = -1;
  
  public ChunkFile(File f, int off, long len, int act_off) {
    this.file = f;
    this.offset = off;
    this.length = len;
    this.actualOffset = act_off;
  }
}

程式碼一大段是因為我的測試程式的目的是要產生內容有重疊的檔案 XD
例如第一段子檔案是原始檔案的 0~1024 bytes、第二段子檔案是原始檔案的 512~1536 bytes
中間有 512 bytes 是重疊的~。
然後程式輸出時要能夠確實顯示每一段子檔案代表的是原始檔案的哪裡到哪裡(不重疊的)。

只是想要單純的小範例的話,推薦直接看 [4] 的範例。

至於合併檔案的部份,我寫的程式碼是不能放在網路上的 XD,所以還是推薦直接看 [4] 的範例。
而我寫的合併的程式碼,主要是透過上面測試的程式碼會吐正確的合併資訊,讓合併的程式碼知道哪個子檔案只要讀哪些部分
然後自己客製一個 FileInputStream,並指定讀取的範圍。
合併時就依照 [4] 的方法直接呼叫 SequenceInputStream.read(byte[] buf) 的方法
它會自動把指定的 InputStream 合併吐出(註1),因此程式要做即時合併檔案並且輸出時
只要把它當成一個 Stream,然後按照一般 InputStream 的用法跑一個迴圈就可以把一堆檔案一起照順序讀完了。

註1. 以檔案來說,前一個檔案結尾時,SequenceInputStream 會先回傳讀取的 byte 數
而不會直接抓下一個 InputStream 進來把 buffer 放滿,因此其實並不是很複雜的寫法
有興趣的話可以自己去看 SequenceInputStream 的原始碼,相當簡單易懂 XD

參考資料:
1、How to Split Large Files Into Smaller Chunks, with Java
2、Java - Reading file chunk by chunk
3、How to chain multiple different InputStreams into one InputStream
4、Java Gossip: SequenceInputStream

沒有留言: