Class UnicodeCompressor

java.lang.Object
com.ibm.icu.text.UnicodeCompressor
All Implemented Interfaces:
SCSU

public final class UnicodeCompressor extends Object implements SCSU
A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

USAGE

The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

  String s = ... ; // get string from somewhere
  byte [] compressed = UnicodeCompressor.compress(s);
 

The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

  // Compress an array "chars" of length "len" using a buffer of 512 bytes
  // to the OutputStream "out"

  UnicodeCompressor myCompressor         = new UnicodeCompressor();
  final static int  BUFSIZE              = 512;
  byte []           byteBuffer           = new byte [ BUFSIZE ];
  int               bytesWritten         = 0;
  int []            unicharsRead         = new int [1];
  int               totalCharsCompressed = 0;
  int               totalBytesWritten    = 0;

  do {
    // do the compression
    bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                         len, unicharsRead,
                                         byteBuffer, 0, BUFSIZE);

    // do something with the current set of bytes
    out.write(byteBuffer, 0, bytesWritten);

    // update the no. of characters compressed
    totalCharsCompressed += unicharsRead[0];

    // update the no. of bytes written
    totalBytesWritten += bytesWritten;

  } while(totalCharsCompressed < len);

  myCompressor.reset(); // reuse compressor
 
See Also:
  • Field Details

    • sSingleTagTable

      private static boolean[] sSingleTagTable
      For quick identification of a byte as a single-byte mode tag
    • sUnicodeTagTable

      private static boolean[] sUnicodeTagTable
      For quick identification of a byte as a unicode mode tag
    • fCurrentWindow

      private int fCurrentWindow
      Alias to current dynamic window
    • fOffsets

      private int[] fOffsets
      Dynamic compression window offsets
    • fMode

      private int fMode
      Current compression mode
    • fIndexCount

      private int[] fIndexCount
      Keeps count of times character indices are encountered
    • fTimeStamps

      private int[] fTimeStamps
      The time stamps indicate when a window was last defined
    • fTimeStamp

      private int fTimeStamp
      The current time stamp
  • Constructor Details

    • UnicodeCompressor

      public UnicodeCompressor()
      Create a UnicodeCompressor. Sets all windows to their default values.
      See Also:
  • Method Details

    • compress

      public static byte[] compress(String buffer)
      Compress a string into a byte array.
      Parameters:
      buffer - The string to compress.
      Returns:
      A byte array containing the compressed characters.
      See Also:
    • compress

      public static byte[] compress(char[] buffer, int start, int limit)
      Compress a Unicode character array into a byte array.
      Parameters:
      buffer - The character buffer to compress.
      start - The start of the character run to compress.
      limit - The limit of the character run to compress.
      Returns:
      A byte array containing the compressed characters.
      See Also:
    • compress

      public int compress(char[] charBuffer, int charBufferStart, int charBufferLimit, int[] charsRead, byte[] byteBuffer, int byteBufferStart, int byteBufferLimit)
      Compress a Unicode character array into a byte array. This function will only consume input that can be completely output.
      Parameters:
      charBuffer - The character buffer to compress.
      charBufferStart - The start of the character run to compress.
      charBufferLimit - The limit of the character run to compress.
      charsRead - A one-element array. If not null, on return the number of characters read from charBuffer.
      byteBuffer - A buffer to receive the compressed data. This buffer must be at minimum four bytes in size.
      byteBufferStart - The starting offset to which to write compressed data.
      byteBufferLimit - The limiting offset for writing compressed data.
      Returns:
      The number of bytes written to byteBuffer.
    • reset

      public void reset()
      Reset the compressor to its initial state.
    • makeIndex

      private static int makeIndex(int c)
      Create the index value for a character. For more information on this function, refer to table X-3 UTR6.
      Parameters:
      c - The character in question.
      Returns:
      An index for c
    • inDynamicWindow

      private boolean inDynamicWindow(int c, int whichWindow)
      Determine if a character is in a dynamic window.
      Parameters:
      c - The character to test
      whichWindow - The dynamic window the test
      Returns:
      true if c will fit in whichWindow, false otherwise.
    • inStaticWindow

      private static boolean inStaticWindow(int c, int whichWindow)
      Determine if a character is in a static window.
      Parameters:
      c - The character to test
      whichWindow - The static window the test
      Returns:
      true if c will fit in whichWindow, false otherwise.
    • isCompressible

      private static boolean isCompressible(int c)
      Determine if a character is compressible.
      Parameters:
      c - The character to test.
      Returns:
      true if the c is compressible, false otherwise.
    • findDynamicWindow

      private int findDynamicWindow(int c)
      Determine if a dynamic window for a certain character is defined
      Parameters:
      c - The character in question
      Returns:
      The dynamic window containing c, or INVALIDWINDOW if not defined.
    • findStaticWindow

      private static int findStaticWindow(int c)
      Determine if a static window for a certain character is defined
      Parameters:
      c - The character in question
      Returns:
      The static window containing c, or INVALIDWINDOW if not defined.
    • getLRDefinedWindow

      private int getLRDefinedWindow()
      Find the least-recently defined window