The Python programming language is a valuable tool for scientific research. Python Natural Language Toolkit (NLTK) is one of the best packages for exploratory natural language parsing and understanding (Bird 2005). Python also supports modules and package and code reuse. It also has extensive documentation which teach the concepts behind language processing tasks supported. This Project chooses python because its concept are simple and has a shallow learning curve. Python also has a good string-handling functionality. Python also permits data and methods to be used and reused easily (Bird 2005).


The Natural Language Toolkit consists of different program and libraries for classification, tokenization, stemming, tagging and similar algorithms which is written in Python. Bird (2005) inspired the criteria for using NLTK, stating that “NLTK is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language.”

Sci-kit Learn

Sci-kit Learn is a machine learning library in Python. It is simple and efficient to be used for non-experts. Its goal is to provide a useful machine learning tools which can be accessible in different scientific area. It is just a library that provides machine learning algorithms to a general-purpose high-level language. It also consists of classical learning algorithms, model evaluation, and selection tools as well as preprocessing procedures (Sci-kit Learn 2015a). Sci-kit Learn also includes a collection of function which a user can import them into Python (Buitinck et al. 2013).

Previous Research

Paul (2012) worked on the similar project on the same data in Rapid Miner software. In that research, he concluded that that software did not have good accuracy and the speed of running each algorithm.
Text classification using data mining project (Kamruzzaman et al. 2005) used Naïve Bayes, decision tree and Genetic algorithm for classifying the text data, and at the end demonstrated 90% accuracy.
Kotsiantis (2007) used various supervised machine learning classification techniques such as Naïve Bayes classifier and SVM and ensemble of classifiers. He concluded that ensemble of classifiers had the best possible classification accuracy.

In improved k-Nearest Neighbor Classification Using Genetic Algorithm project, Suguna and Thanushkodi (2010) declared that although KNN has some limitation, but it is one of the most popular neighborhood classifier in pattern recognition. In their research they combined KNN with Genetic algorithm (GKNN) to increase the accuracy of the KNN. Beriman (2001) showed random forest to be effective tool at text mining and has low error rate.
Kamruzzaman et al. (2005) discusses an efficient technic for text classification. The article highlighted the positive aspects of larger data sets with more classes for better accuracy. Suguna and Thanushkodi (2010) emphasize on KNN classifier as the most popular neighbor classification in pattern recognition. Bijalwan et al. (2014) used Naïve Bayes and KNN classifiers for classifying the text data, and at the end demonstrated 99% accuracy in KNN. Breiman (2001) conclusioned “Random forests are an effective tool in prediction. Because of the Law of Large Numbers they do not over fit”. Paul (2012) said that rapid miner could not be a proper software for text mining and doing classification algorithm. Based on these findings, in current research, Python language programming is used and the paper stresses the following select classifications:
Naïve Bayes: the Naive Bayes classifier is simple yet accurate classifier.
Random Forest: this classifier is combination of trees that all of them have same distribution.
K-Nearest Neighbor: a new pattern is classified into the class with the most members present among the K nearest neighbors (Suguna and Thanushkodi 2010).