The fast developments on the computer and networking technologies have increased the popularity of the Web which has caused the inclusion of more and more information to the Web. However, the explosive growth of the information on the Web has brought some search problems: (i) general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic, (ii) the number of pages to be indexed by Web search systems has been increasing day by day which makes difficult to keep both automated and human maintained indices up to date. To overcome these search problems, vertical search engines, which traverse a subset of the Web to only gather documents on a specific topic and to identify the promising links that lead to on-topic documents, were proposed (Altıngövde, Özel, Ulusoy, Özsoyog˘lu, & Özsoyog˘lu, 2001; Chakrabarti, Van den Berg, & Dom, 1999; De Bra & Post, 1994; Menczer, Pant, & Srinivasan, 2004; Pinkerton, 1994). During a focused crawling process of a vertical search engine, an automatic classification mechanism is required to determine whether the Web page being considered is ‘‘on the specific topic” or not (Qi & Davison, 2009).
Automatic Web page classification is a supervised learning problem in which a set of labeled Web documents is used for training a classifier, and then the classifier is employed to assign one or more predefined category labels to future Web pages (Qi & Davison, 2009). In Web page classification process, every term and every HTML tag on each Web page can be considered as an attribute, which makes the number of features to be large.
Tests were conducted on three different collections. A small collection includes Computer Science related conference home pages that were obtained from the Open Directory Project Web site (http://www.dmoz.org), and two larger collections include course home pages and student home pages from the WebKB dataset (Craven et al., 1998).
In this study, our aim is to determine the ‘‘role” of a Web page (i.e., functional classification) such as to decide whether the Web page is a ‘‘student home page”, or a ‘‘course page”, or a ‘‘department home page”. While doing that, we give a single class label (e.g., ‘‘course page”) to each Web page, and we make binary classification in which we categorize instances into exactly one of the two classes (e.g., ‘‘course page”, or ‘‘not course page”). This kind of classification problem exists especially in focused crawling systems of vertical search engines.