Having trouble?
Try the newer version at the University of Alberta:
http://taporware.ualberta.ca
Tools Home : HTML Tools : Tokenize HTML Document

Click here to show HTML tools HTML Tools

Click here to expand XML tools XML tools

Click here to expand plain text tools Plain Text Tools

Click here to expand other tools Other tools

 Beta tools
 Add Tools Demo
 Manual
 About

Tokenize HTML Document
?
Summary

This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.

For more details, see here.

Walkthrough

Example: fetch HTML from http://www.w3.org/; extract text between <body> and </body> tags; split resulting text into sentences; preserve separator in results.
  1. Source text
    1. Enter `http://www.w3.org/' in the URL field.
  2. Subtext limited to
    1. Enter `body' in the Elements field.
  3. Token types
    1. Select Sentences
  4. Results
    1. Select Keep with previous token from the Display options drop-down menu.
*
» Source text
  Example: http://www.w3.org/

?
Summary

Determines the HTML source. HTML can be obtained from a URL or by uploading a file.

Fields

Source URL
HTML from the entered URL will be used as the data source for the analysis.

Local file
Use this field to upload a local HTML file for analysis.
*
» Subtext limited to
(separate multiple elements with a `,')
?
Summary

Limits included text to text that appears within the spacified tag(s). Multiple tags should be delimited by commas. Leaving this field empty will include all text in the aggregate.

Fields

Elements
The text extraction will be restricted to the tag(s) entered here. Multiple tags should be separated by commas.
*
» Token types





(separate characters with spaces; to separate by whitespace, use ^s)
?
Summary

Defines how a document will be tokenized. Tokens are fragments of a text that can be defined as words, lines, sentences, paragraphs, characters or specific patterns.

Fields

Words
Splits the text using each word as a token.

Lines
Splits the text using the end of each line as a token.

Sentences
Splits the text using the end of each sentence as a token.

Paragraphs
Splits the text using the end of each paragraph as a token.

Characters
Splits the text at the specified characters (separated by spaces). To include a space use ^s.

Pattern
No help written for this yet

Unix Format
Splits the text by tokens found with a Unix format regular expression

Regular Exp
Splits the text by tokens found with a regular expression. Do not use \d, \D, \w, \W etc. Instead, please use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z]. Using \n, \r, or \t is fine.

Separate on Tags
Splits the text at the specified elements.
Note: The text can only be split by one element at a time.

Keep Element Tags with Token
Deselect this box if you would like the specified element to be removed from the results.
» Results
?
Summary

Allows the user to choose how the results will be formatted and whether they should be displayed in a new browser window.

Fields

Display options
Offers a variety of options on how results will be presented.

Display as
Determines the format in which results will be delivered

Open results in new window
Checking this box will display the results in a new window. This option is selected by default. In some cases pop-up blockers may disallow windows from being created, in which case this option may be de-selected.
`*' indicates a required field
`' do not use \d, \D, \w, \W etc. use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z] instead. Using \n, \r, \t is fine.

 

 

TAPoRware Project, McMaster University,