[Review] Text Categorization with Support Vector Machines: Learning with Many Relevant Features

I reviewed this paper for my Research Methodology assignment and also to help me in doing my thesis. Like usual, I apologize for any sentence indicating as plagiarism.

Author: Thorsten Joachims

Year : 1998

Appear in: Proceeding of the European Conference on Machine Learning, Berlin, Germany

Page : 137-142

Text categorization aims in classifying documents into fixed number of categories which are previously defined. Developing manual text classifiers is difficult and time-consuming, however, by using machine learning it is beneficial to determine classifiers from examples which automatically applies the category. From reasons previously stated, this paper elaborates the advantageous of Support Vector Machines (SVMs) for text categorization.

Transforming documents into an appropriate representation for the learning algorithm and the classification task is the first step in text categorization. Each distinct word wi in documents which occurs for certain number of times is corresponded to a feature. The word considered as features if it appears in the training set at least 3 times and it is not stop-word (like “and”, “or”, etc). This model of representation leads to thousands of dimension features spaces which needs feature subset selection to improve generalization accuracy and to avoid overfitting. To select a subset of features this paper applies the information gain criterion as recommended by Y. Yang and J. Pedersen.

According to V. Vapnik, Support Vector Machines are based on the Structural Risk Minimization principle from computational learning theory. Structural Risk Minimization means obtaining hypothesis h such that the lowest true error can be guaranteed, where true error of h is the probability that h will make an error on randomly selected examples. SVMs have the ability to independently learn the feature space dimensions, thus make the hypothesis complexity can be measured by the margin which separate the data, not the number of features.

In this paper, also, Joachims explains the properties of text so that it applies well with the Support Vector Machines. Firstly, text has high dimensionality of input space which means it has significant number of features. However, SVMs exploit overfitting protection, hence they are possible to manage such problems. The second reasons is text only has small amount of irrelevant features. In order to have good performance, classifiers should capable of handling dense features. Third, document vectors are said to be sparse. According to Kivinen et al. logarithmic mistake bound models are applicable for such problems and it has similar inductive bias like SVMs. Giving these evidences, SVMs are expected to handle problems with dense concepts and sparse instances. Lastly, Joachims states that most text categorization are linearly separable. From the experiments done, Joachims proves that the two collection of data set used as evaluation are mostly linearly separable. Since the idea of SVMs is to find linear separators, this argument applies to SVM as well.

As previously stated, Joachims uses two data set collection in his experiments, the “ModApte” split of the Reuters-21578 dataset compiled by David Lewis and the Ohsumed corpus compiled by William Hersh. The experiments are done by comparing the performance of SVMs using polynomial and radial basic function (RBF) kernels with other four conventional methods that popularly used as text classifiers; naïve Bayes classifiers, the Rocchio algorithm, k-nearest neighbor classifiers, and the C4.5 decision tree with each method represents different machine learning concept.

The first dataset, “the ModApte”, consists of 9603 training set and 3299 test set leading to 9962 distinct terms in the training set aftermath. Meanwhile, in the second dataset, the Ohsumed collection, from 50216 documents which have abstracts, only the first and the second 10000 are used for training and testing respectively resulting in 15561 distinct terms afterwards.

The result on the Reuters corpus are shown in table below containing the precision/recall-breakeven point of the ten most frequent Reuters categories to measure the performance and microaveraged performance over all Reuters categories as tools to obtain a single performance over all binary classification task.

tes1

From the table we can infer that among the conventional methods, k-nearest neighbor classifiers performs the best with microaveraged of 82.3 and compare to all conventional methods, SVMs performs much better with microaveraged of 86.0 for polynomial and 86.4 for radial basic function. In SVM polynomial degree 5, although using all 9962 features and hypothesis spaces are complex, there is no overfitting occurs. SVMs are more expensive than naïve Bayes, Rocchio and k-NN in the training time, however, compare to C4.5, SVMs are almost similar.

In conclusions, SVMs are proved to be applicable for text categorization. SVMs have the capability to generalize well in high dimensional feature spaces so that it requires no feature selection. Besides, SVMs are robust, outperforming other conventional methods in all experiments. Moreover, SVMs can obtain good parameter automatically eliminating the requirement for parameter tuning. Finally, in my opinion, although this paper is very good in providing evidences, it still lack of explanations. I have to read several times in order to understand it since it only provides very short information. For the expert, it may be easy to understand, but for the first-timer like me, it is very hard to understand. Also, the examples provide are not complete. The author only provide result from one dataset only, making it even harder to get the important idea of the paper itself.

Advertisements

Final Exam

Setelah berminggu-minggu penuh dengan project aneh yang mampu gw ga sanggup untuk membelah otak dengan benar, akhirnya sampai jg pada penghujung semester tujuh dimana akan ditutup dengan final exam. Seperti yang sudah-sudah, final exam ini biasanya berlangsung selama dua minggu dengan diisi dengan satu mata kuliah per harinya. Untuk kali ini, saya hanya mendapatkan total 8 mata kuliah. Artinya terdapat 8 hari dengan urutan ajaib yang sulit diduga untuk dilalui. Usut punya usut ternyata setelah dilihat di jadwal finalnya terdapat satu mata kuliah yang tidak terdaftar di jadwal tersebut yaitu Software Engineering. Kebetulan sekali mengingat kebodohan gw di bidang itu jadinya gw cukup bersorak lompat salto 7 kali bolak balik kalau misalnya mata kuliah itu tidak jadi di-final-exam-kan.

Namun, cukup mencurigakan juga. Masalahnya si bapak dosen yang mata kuliah yang bersangkutan sudah memberitahu kalau nanti final exam itu kita hanya boleh membawa kertas contekan selembar ukuran a4 yang sudah ditulisi berbagai macam contekan dengan tulisan tangan dikedua sisi kertas tersebut. Lalu dari mana ceritanya pelajaran tersebut bisa terlupakan keberandaannya di final exam?

Merupakan suatu hal yg harus ditindak lanjuti. Namun batere saya habis. Harus diisi ulang dulu.

Multimedia Project Result

I’ve told u I made animation for my multimedia assignment, didn’t I?

Here is the video look like.

The theme is global warming. We made it using stop motion and cutout animation techniques. Our goal so that young people, teens, like us are encouraged to preserve the earth.

The video is about 2:41 long. Made of about 200 pictures for the stop motion animation. For the cutout animation, we used flash as the tools. The audio played in this clip was taken from Rattatouille original soundtrack, “Remy Drives a Linguini”.

Enjoy our video.

Fundamentals of Multimedia on One’s Movement

Rumah gw luas namun tidak seluas lapangan bola. Tapi gw tau esensi dibalik pembuatan rumah yang luasnya hampir selapangan bola itu, karena gw sering banget nabrak2 furniture kalo lagi jalan.

Hal itu terbukti sangat benar ketika tadi pas selesai solat maghrib gw beranjak keluar kamar untuk nonton tipi lagi, tapi secara tak dinyana henpon bersuara nyaring seperti ketika Phillip Lahm membobol gawang Portugal dan membawa Jerman ke semifinal Euro 2008 kemaren. Karena kaget, sontak gw lari2an ke kamar dan mencari keberadaan henpon gw supaya bisa secepatnya diraih. Ternyata henpon bermukim di sebelah laptop, di atas meja sebelah lemari buku. Gw dengan semangat dan sedikit grasak-grusuk langsung ambil jalur tengah meloncati tempat tidur. Di sinilah esensi pembuatan rumah besar seluas lapangan bola itu menjadi sia-sia. Ketika menyebrangi tempat tidur untuk menggapai henpon, kaki kiriku satu-satunya menendang ujung buku fundamentals of mutimedia yang baru rabu kemaren gw ambil dari tempat fotokopian dan ga tanggung-tanggung, buku itu loncat mental ke kursi gw yang jaraknya sekitar 30 centimeter dari pinggir kasur. Biar gw ulang lagi, kaki gw lebih tepatnya nabrak ujung buku setebel 7 centimeter yang tajam, berat dan tumpul. Gw harap kebayang gimana rasanya.

Henpon berhasil gw ambil, tapi akibatnya ujung kaki kiri satu-satunya itu sakit tak tertahankan di bagian kelingking. Sambil loncat-loncat kayak kelinci dikejar macan afrika, gw jawab telepon yang ternyata berasal dari odi. Susah payah gw nahan supaya ga berteriak kesakitan tapi akhirnya mulutku yang juga satu-satunya di dunia ini mengaduh-aduh sehingga membuat odi bingung, ada apa gerangan dengan kakakku yang aneh ini?

Langkah pertama setelah henpon kututup, langsung berjalan ke dapur sambil terpincang-pincang. Buka frezeer dan mengambil sebalok es untuk mendamaikan kelingkingnya yang nyut-nyutan. Rasanya lumayan, sakitnya bisa sedikit hilang walau ga semuanya bisa hilang.

Kasian bapakku. Sudah capek-capek membuat rumah seluas lapangan bola dengan harapan anaknya bisa selamat kalo sedang berjalan di rumah, tapi anaknya malah menabrak bukunya sendiri yang ditaronya sendiri di atas tempat tidurnya sendiri. Sama sekali tak ada hubungannya dengan luas rumah yang seperti lapangan bola.

Windows Installation Process: FAILED

Give applause because i dont know any shit happens in my computer, the windows part for exact. Sekarang akhirnya laptop gw isinya ubuntu.

Yaudalah gpp. Kan kuliah cuma tinggal seminggu doang. Untung br skrg, klo tengah2 semester mah gw bs panik tak terkira. Intinya skrg sih tinggal install flash aja trus selesei kerjaan.

Lalu apa yg akan terjadi pd laptop gw? Apakah selamanya akan ubuntu based? Ah, nein. Ini laptop ntar kalo uda kelar kerjaan gw smuanya, gw bawa ke hp center. Biarlah orang hp disana yg ngebenerin. Ntar deket2 final ato minggu dpn. biar cepet kelar dan damai.

LInux is not bad anyway. Latian pke linux deh. Biar ntar ga bingung waktu lg buat thesis.

No Laptop, NO LIFE

I cursed every stupidity sector in my brain that brought up the idea to make another partition in my harddisk which already dual boot. IT’S ALREADY UBUNTU INSTALLED!!

Because of that idea, I can’t boot my harddisk. I was panic.
I thought I lost all of it though I dont mind about the data since I already have the backup. What I concern most is my hd, my lovely-80-gigabyte-harddisk-drive.

Thank God ada Bonggas yg menyelamatkan. Paling ga gw tau masalahnya apa dan gw tau itu hd masi bs diselamatkan. Apakah rencana selanjutnya? Gw tukeran hd lg. Hd external punya bokap gw, gw comot, pindahin datanya, lalu format. Dan saya kembali menjadi teknisi dg membongkar pasang hd laptop dan tuker ama yg baru.

Ketika hd udah dituker, saya kira masalah saya sudah selesai krn tinggal install windowsnya aja. Eh ternyata the second shit happens dimana gw ga bs install windows krn pas gw lg install gt tiba2 di tengah2 prosesnya berhenti sendiri trus ulang lg dr awal.

OH SHIT. Gw kebanyakan dosa deh kyknya. Tuhan hukum gw dg cara yg cukup kejam dg ngambil kehidupan laptop gw yg bagi gw uda kyk nyawa gw sendiri. My second life is in that black thing. I love my laptop and I dont plan to lose her this soon. I still need her to do my thesis. How can I do my thesis without her?

Bsk gw coba install pke cd windowsnya firman, pke cradlenya firman skalian. Semoga bisa diinstal krn klo ampe ga bs, I’M A DEAD MEAT.

Please pray for me

Today’s Plan

Minggu cerah ceria panas layaknya matahari menggantung di atas kepala.

There are several things that need to be done today.
First, I have to finish the project off. Lakukan syuting pada hari ini sampai selesai, lalu nanti malam kita lanjutkan lg bikin flash.
I’m the animator!!

Second..
Nonton Twilight!! Yeah, krn kemaren ga jadi nonton soalnya uda telat dan gw amat sangat malas pulang malem2, jadinya nontonnya diundur jd hari ini walopun kmrn gw sempet gemes pengen gigit si rona.
Kenapa maen dotanya lama bgd siih? Harusnya tuh jam 5 kurang 15 uda kelar maennyaaaaa..!!

Suddenly I hate Dota.

Pokoknya semuanya harus kelar. Biar project jg cepet selesai. Males soalnya kalo ditunda2 terus.