I reviewed this paper for my Research Methodology assignment and also to help me in doing my thesis. Like usual, I apologize for any sentence indicating as plagiarism.
Author: Thorsten Joachims
Year : 1998
Appear in: Proceeding of the European Conference on Machine Learning, Berlin, Germany
Page : 137-142
Text categorization aims in classifying documents into fixed number of categories which are previously defined. Developing manual text classifiers is difficult and time-consuming, however, by using machine learning it is beneficial to determine classifiers from examples which automatically applies the category. From reasons previously stated, this paper elaborates the advantageous of Support Vector Machines (SVMs) for text categorization.
Transforming documents into an appropriate representation for the learning algorithm and the classification task is the first step in text categorization. Each distinct word wi in documents which occurs for certain number of times is corresponded to a feature. The word considered as features if it appears in the training set at least 3 times and it is not stop-word (like “and”, “or”, etc). This model of representation leads to thousands of dimension features spaces which needs feature subset selection to improve generalization accuracy and to avoid overfitting. To select a subset of features this paper applies the information gain criterion as recommended by Y. Yang and J. Pedersen.
According to V. Vapnik, Support Vector Machines are based on the Structural Risk Minimization principle from computational learning theory. Structural Risk Minimization means obtaining hypothesis h such that the lowest true error can be guaranteed, where true error of h is the probability that h will make an error on randomly selected examples. SVMs have the ability to independently learn the feature space dimensions, thus make the hypothesis complexity can be measured by the margin which separate the data, not the number of features.
In this paper, also, Joachims explains the properties of text so that it applies well with the Support Vector Machines. Firstly, text has high dimensionality of input space which means it has significant number of features. However, SVMs exploit overfitting protection, hence they are possible to manage such problems. The second reasons is text only has small amount of irrelevant features. In order to have good performance, classifiers should capable of handling dense features. Third, document vectors are said to be sparse. According to Kivinen et al. logarithmic mistake bound models are applicable for such problems and it has similar inductive bias like SVMs. Giving these evidences, SVMs are expected to handle problems with dense concepts and sparse instances. Lastly, Joachims states that most text categorization are linearly separable. From the experiments done, Joachims proves that the two collection of data set used as evaluation are mostly linearly separable. Since the idea of SVMs is to find linear separators, this argument applies to SVM as well.
As previously stated, Joachims uses two data set collection in his experiments, the “ModApte” split of the Reuters-21578 dataset compiled by David Lewis and the Ohsumed corpus compiled by William Hersh. The experiments are done by comparing the performance of SVMs using polynomial and radial basic function (RBF) kernels with other four conventional methods that popularly used as text classifiers; naïve Bayes classifiers, the Rocchio algorithm, k-nearest neighbor classifiers, and the C4.5 decision tree with each method represents different machine learning concept.
The first dataset, “the ModApte”, consists of 9603 training set and 3299 test set leading to 9962 distinct terms in the training set aftermath. Meanwhile, in the second dataset, the Ohsumed collection, from 50216 documents which have abstracts, only the first and the second 10000 are used for training and testing respectively resulting in 15561 distinct terms afterwards.
The result on the Reuters corpus are shown in table below containing the precision/recall-breakeven point of the ten most frequent Reuters categories to measure the performance and microaveraged performance over all Reuters categories as tools to obtain a single performance over all binary classification task.
From the table we can infer that among the conventional methods, k-nearest neighbor classifiers performs the best with microaveraged of 82.3 and compare to all conventional methods, SVMs performs much better with microaveraged of 86.0 for polynomial and 86.4 for radial basic function. In SVM polynomial degree 5, although using all 9962 features and hypothesis spaces are complex, there is no overfitting occurs. SVMs are more expensive than naïve Bayes, Rocchio and k-NN in the training time, however, compare to C4.5, SVMs are almost similar.
In conclusions, SVMs are proved to be applicable for text categorization. SVMs have the capability to generalize well in high dimensional feature spaces so that it requires no feature selection. Besides, SVMs are robust, outperforming other conventional methods in all experiments. Moreover, SVMs can obtain good parameter automatically eliminating the requirement for parameter tuning. Finally, in my opinion, although this paper is very good in providing evidences, it still lack of explanations. I have to read several times in order to understand it since it only provides very short information. For the expert, it may be easy to understand, but for the first-timer like me, it is very hard to understand. Also, the examples provide are not complete. The author only provide result from one dataset only, making it even harder to get the important idea of the paper itself.