Class TokenStreamToAutomaton

java.lang.Object
org.apache.lucene.analysis.TokenStreamToAutomaton
Direct Known Subclasses:
ConcatenateGraphFilter.EscapingTokenStreamToAutomaton

public class TokenStreamToAutomaton extends Object
Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the TermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.
  • Field Details

    • preservePositionIncrements

      private boolean preservePositionIncrements
    • finalOffsetGapAsHole

      private boolean finalOffsetGapAsHole
    • unicodeArcs

      private boolean unicodeArcs
    • POS_SEP

      public static final int POS_SEP
      We create transition between two adjacent tokens.
      See Also:
    • HOLE

      public static final int HOLE
      We add this arc to represent a hole.
      See Also:
  • Constructor Details

    • TokenStreamToAutomaton

      public TokenStreamToAutomaton()
      Sole constructor.
  • Method Details

    • setPreservePositionIncrements

      public void setPreservePositionIncrements(boolean enablePositionIncrements)
      Whether to generate holes in the automaton for missing positions, true by default.
    • setFinalOffsetGapAsHole

      public void setFinalOffsetGapAsHole(boolean finalOffsetGapAsHole)
      If true, any final offset gaps will result in adding a position hole.
    • setUnicodeArcs

      public void setUnicodeArcs(boolean unicodeArcs)
      Whether to make transition labels Unicode code points instead of UTF8 bytes, false by default
    • changeToken

      protected BytesRef changeToken(BytesRef in)
      Subclass and implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.
    • toAutomaton

      public Automaton toAutomaton(TokenStream in) throws IOException
      Pulls the graph (including PositionLengthAttribute) from the provided TokenStream, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.
      Throws:
      IOException
    • addHoles

      private static void addHoles(Automaton.Builder builder, RollingBuffer<TokenStreamToAutomaton.Position> positions, int pos)