Abstract: Tokenization is an important early step in natural language processing (NLP) tasks. The idea is to split the input sentence into smaller units, called tokens, for further processing. Words ...