cmdTokenizerobject | en_us.t[4589] |
Superclass Tree | Property Summary | Method Summary | Property Details | Method Details |
cmdTokenizer : Tokenizer
Inherited from cmdTokenizer
Tokenizer
object
patAlphaDashAlpha
patPunct
patSpelledTens
patSpelledUnits
rules_
acceptAbbrTok
buildOrigText
tokCvtAbbr
tokCvtApostropheS
tokCvtSpelledNumber
Tokenizer
:
deleteRule
deleteRuleAt
insertRule
insertRuleAt
tokCvtLower
tokCvtSkip
tokenize
patAlphaDashAlpha | en_us.t[4755] |
patPunct | en_us.t[4870] |
patSpelledTens | en_us.t[4866] |
patSpelledUnits | en_us.t[4868] |
rules_ OVERRIDDEN | en_us.t[4590] |
acceptAbbrTok (txt) | en_us.t[4767] |
buildOrigText (toks) | en_us.t[4806] |
tokCvtAbbr (txt, typ, toks) | en_us.t[4787] |
When we find an abbreviation, we'll enter it with the abbreviated word minus the trailing period, plus the period as a separate token. We'll mark the period as an "abbreviation period" so that grammar rules will be able to consider treating it as an abbreviation -- but since it's also a regular period, grammar rules that treat periods as regular punctuation will also be able to try to match the result. This will ensure that we try it both ways - as abbreviation and as a word with punctuation - and pick the one that gives us the best result.
tokCvtApostropheS (txt, typ, toks) | en_us.t[4717] |
tokCvtSpelledNumber (txt, typ, toks) | en_us.t[4741] |