Summary
Defines how a document will be tokenized. Tokens
are fragments of a text that can be defined as
words, lines, sentences, paragraphs, characters or
specific patterns.
Fields
Words
Splits the text using each word as a token.
Lines
Splits the text using the end of each line as a token.
Sentences
Splits the text using the end of each sentence as a token.
Paragraphs
Splits the text using the end of each paragraph as a token.
Characters
Splits the text at the specified characters (separated by spaces).
To include a space use
^s.
Pattern
No help written for this yet
Unix Format
Splits the text by tokens found with a Unix format
regular expression
Regular Exp
Splits the text by tokens found with a regular expression. Do not use
\d, \D,
\w, \W etc. Instead,
please use [0-9], [^0-9],
[a-zA-Z] or [^a-zA-Z].
Using \n, \r, or
\t is fine.
Separate on Tags
Splits the text at the specified elements.
Note: The text can only be split by one element at a time.
Keep Element Tags with Token
Deselect this box if you would like the
specified element to be removed from the results.