Tokenization is the process in which text is divided into smaller parts called tokens. Applications like Text classification, language translation, chatbot, sentimental analysis, Tokenization plays vital role.
Tokenize module has sub-modules as follow
- word Tokenize
- sentence Tokenize
- word Tokenize - It used to divide the text into words.
- sentence Tokenize - It used to divide the text into sentence.
from nltk.tokenize import word_tokenize
text_sample = " Hi this is nltk! It used for tokenization."
word = word_tokenize(text_sample)
sent = sent_tokenize(text_sample)
print(word)
print(sent)
Output:
['Hi', 'this', 'is', 'nltk', '!', 'It', 'used', 'for', 'tokenization', '.']
['Hi this is nltk!', 'It used for tokenization.']