presage
0.9.2~beta
|
#include <tokenizer.h>
Classes | |
class | StreamGuard |
Public Member Functions | |
Tokenizer (std::istream &stream, const std::string blankspaces, const std::string separators) | |
virtual | ~Tokenizer () |
virtual int | countTokens ()=0 |
virtual bool | hasMoreTokens () const =0 |
virtual std::string | nextToken ()=0 |
virtual double | progress () const =0 |
void | blankspaceChars (const std::string) |
std::string | blankspaceChars () const |
void | separatorChars (const std::string) |
std::string | separatorChars () const |
void | lowercaseMode (const bool) |
bool | lowercaseMode () const |
std::string | streamToString () const |
Protected Member Functions | |
bool | isBlankspace (const int character) const |
bool | isSeparator (const int character) const |
Protected Attributes | |
std::istream & | stream |
std::ios::iostate | sstate |
std::streamoff | offbeg |
std::streamoff | offend |
std::streamoff | offset |
Private Attributes | |
std::string | blankspaces |
std::string | separators |
bool | lowercase |
The Tokenizer class takes an input stream and parses it into "tokens", allowing the tokens to be read one at a time.
The parsing process is controlled by the character classification sets:
Each byte read from the input stream is regarded as a character in the range '\u0000' through '\u00FF'.
In addition, an instance has flags that control:
A typical application first constructs an instance of this class, supplying the input stream to be tokenized, the set of blankspaces, and the set of separators, and then repeatedly loops, while method hasMoreTokens returns true, calling the nextToken method.
Definition at line 64 of file tokenizer.h.
Tokenizer::Tokenizer | ( | std::istream & | stream, |
const std::string | blankspaces, | ||
const std::string | separators | ||
) |
Definition at line 27 of file tokenizer.cpp.
References blankspaceChars(), blankspaces, offbeg, offend, offset, separatorChars(), separators, sstate, and stream.
|
virtual |
Definition at line 53 of file tokenizer.cpp.
void Tokenizer::blankspaceChars | ( | const std::string | chars | ) |
std::string Tokenizer::blankspaceChars | ( | ) | const |
Gets blankspace characters.
Definition at line 66 of file tokenizer.cpp.
References blankspaces.
Referenced by Tokenizer().
|
pure virtual |
Returns the number of tokens left.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
pure virtual |
Tests if there are more tokens.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
protected |
Definition at line 91 of file tokenizer.cpp.
References blankspaces.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
|
protected |
Definition at line 101 of file tokenizer.cpp.
References separators.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
void Tokenizer::lowercaseMode | ( | const bool | value | ) |
Sets lowercase mode.
Definition at line 81 of file tokenizer.cpp.
References lowercase.
Referenced by ContextChangeDetector::change(), ContextTracker::learn(), and main().
bool Tokenizer::lowercaseMode | ( | ) | const |
Gets lowercase mode.
Definition at line 86 of file tokenizer.cpp.
References lowercase.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
|
pure virtual |
Returns the next token.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
pure virtual |
Returns progress percentage.
Implemented in ForwardTokenizer, and ReverseTokenizer.
void Tokenizer::separatorChars | ( | const std::string | chars | ) |
std::string Tokenizer::separatorChars | ( | ) | const |
Gets separator characters.
Definition at line 76 of file tokenizer.cpp.
References separators.
Referenced by Tokenizer().
|
inline |
Definition at line 109 of file tokenizer.h.
|
private |
Definition at line 154 of file tokenizer.h.
Referenced by blankspaceChars(), isBlankspace(), and Tokenizer().
|
private |
Definition at line 157 of file tokenizer.h.
Referenced by lowercaseMode().
|
protected |
Definition at line 146 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ForwardTokenizer::ForwardTokenizer(), ReverseTokenizer::hasMoreTokens(), ReverseTokenizer::nextToken(), ReverseTokenizer::progress(), streamToString(), and Tokenizer().
|
protected |
Definition at line 147 of file tokenizer.h.
Referenced by ReverseTokenizer::countTokens(), ForwardTokenizer::hasMoreTokens(), ForwardTokenizer::nextToken(), ReverseTokenizer::nextToken(), ForwardTokenizer::progress(), ReverseTokenizer::progress(), ReverseTokenizer::ReverseTokenizer(), streamToString(), and Tokenizer().
|
protected |
Definition at line 148 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ReverseTokenizer::countTokens(), ForwardTokenizer::ForwardTokenizer(), ReverseTokenizer::hasMoreTokens(), ForwardTokenizer::hasMoreTokens(), ReverseTokenizer::nextToken(), ForwardTokenizer::nextToken(), ForwardTokenizer::progress(), ReverseTokenizer::progress(), ReverseTokenizer::ReverseTokenizer(), and Tokenizer().
|
private |
Definition at line 155 of file tokenizer.h.
Referenced by isSeparator(), separatorChars(), and Tokenizer().
|
protected |
Definition at line 145 of file tokenizer.h.
Referenced by Tokenizer(), and ~Tokenizer().
|
protected |
Definition at line 144 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ReverseTokenizer::countTokens(), ForwardTokenizer::nextToken(), ReverseTokenizer::nextToken(), ReverseTokenizer::ReverseTokenizer(), streamToString(), Tokenizer(), and ~Tokenizer().