Having trouble?
Try the newer version at the University of Alberta:
http://taporware.ualberta.ca
Tools Home : Plain Text Tools : Tokenize Text

Click here to show HTML tools HTML Tools

Click here to expand XML tools XML tools

Click here to expand plain text tools Plain Text Tools

Click here to expand other tools Other tools

 Beta tools
 Add Tools Demo
 Manual
 About

Tokenize Text with Specified Separators
?
Summary

Tool splits text document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters or patterns. The results can be listed with the token removed, before the split, or after the split.

Note: The input text format should be plain text. If you submit an XML or HTML text, the tool will strip all the tags, and then process it as plain text. For best results with XML or HTML text, it is suggested to use XML-specific or HTML-specific tools.

For more details, see here.

Walkthrough

Example: fetch text from http://www.gutenberg.org/dirs/etext91/peter16.txt; display list of sentences found in the text.
  1. Source text
    1. Enter `http://www.gutenberg.org/dirs/etext91/peter16.txt' in the Text source URL field;
  2. Token types
    1. Select the sentences option.
  3. Results
    No help written for this yet.
*
» Source text
  Example: http://taporware.mcmaster.ca/sampleDocs/plainText.txt


?
Summary

Determines the text source. Text can be obtained from a URL or by uploading a file.

Fields

Source URL
Text from the entered URL will be used as the data source for the analysis.

Local file
Use this field to upload a local file for analysis.

Treat XML/HTML as plain text
Enabling this option will strip tags from an HTML or XML document. <p> and <br /> in HTML documents and all tags in XML documents are converted to new lines (i.e. \n).
*
» Token types





?
Summary

No help written for this yet.

Fields

Words
Splits the text using each word as a token.

Lines
Splits the text using the end of each line as a token.

Sentences
Splits the text using the end of each sentence as a token.

Paragraphs
Splits the text using the end of each paragraph as a token.

Characters
Splits the text at the specified characters (separated by spaces). To include a space use ^s.

Pattern
No help written for this yet

Unix Format
Splits the text by tokens found with a Unix format regular expression

Regular Exp
Splits the text by tokens found with a regular expression. Do not use \d, \D, \w, \W etc. Instead, please use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z]. Using \n, \r, or \t is fine.
» Results
?
Summary

Allows the user to choose how the results will be formatted and whether they should be displayed in a new browser window.

Fields

Display options
Offers a variety of options on how results will be presented.

Display as
Determines the format in which results will be delivered

Open results in new window
Checking this box will display the results in a new window. This option is selected by default. In some cases pop-up blockers may disallow windows from being created, in which case this option may be de-selected.
`*' indicates a required field
`' do not use \d, \D, \w, \W etc. use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z] instead. Using \n, \r, \t is fine.

 

 

TAPoRware Project, McMaster University,