Site Loader

IntroductionIn today’s society a large portion of the worlds population get their news on their electronic devices. Many of the major newspapers have websites that can be used on phones and tablets, in which their articles are displayed in a flow, typically sorted on time of publication. Not all readers are interested in all of the articles published each day. The problem is then that a user would have to go to the different pages to read through all the news articles till they find the articles they are interested in. In this project we explore the possibility of using text classification to teach a machine to select the class of an article. With this information a newspaper can divide articles into categories where one can display all articles about aaclass e.g. Politics, Sports, Entertainment, etc.1.1 BackgroundMachine learning is a technique where one lets a machine learn about an area based on data it is fed with. The collection of data can be either supervised or unsupervised, or a mix of both.  learning concerns finding natural categorizations in data. The machine does not know what categories an object belongs to or what classifications exist. In supervised learning pre-defined classes are used and the machine knows what categories an object belongs to.  classification is the process of classifying a text document into categories. This is done by using machine learning. A document can belong to one, many or no categories. In this project the data will always belong to just one category.There are many different methods available that can be used for text classification Naïve Bayes is a family of models for text classification in which naive assumptions  made to categorize text into different groups. One area of use in which Naïve Bayes have shown great results is spam filtering. The naive assumption is that words are individually significant and does not have a connection to each other. Through this assumption you are able to determine what category a text belongs to through statistical significance.1.2 Document ClassificationDocument classification is the task of grouping documents into categories based upon their content. Document classification is a significant learning problem that is at the core of many information management and retrieval tasks. Document classification performs an essential role in various applications that deals with organizing, classifying, searching and concisely representing a significant amount of information. Document classification is a longstanding problem in information retrieval which has been well studied.Automatic document classification can be broadly classified into three categories. These are Supervised document classification, Unsupervised document classification, and Semi-supervised document classification. In Supervised document classification, some mechanism external to the classification model (generally human) provides information related to the correct document classification. Thus, in case of Supervised document classification, it becomes easy to test the accuracy of document classification model. In Unsupervised document classification, no information is provided by any external mechanism whatsoever. In case of Semi-supervised document classification parts of the documents are labeled by an external mechanism.There are two main factors which contribute to making document classification  a challenging task: (a) feature extraction; (b) topic ambiguity. First, Feature extraction deals with taking out the right set of features that accurately describes the document and helps in building a good classification model. Second, many broad topic documents are themselves so complicated that it becomes difficult to put it into any specific category. Let us say a document that talks of about theocracy. In such document, it would become tough to pick whether the document should be placed under the category of politics or religion. Also, broad topic documents may contain terms that have different meanings based on different context and may appear multiple times within a document in different contexts.Never before has document classification been as imperative as it is at the moment. The expansion of the internet has resulted in significant increase of unstructured data generated and consumed. Thus there is a dire need for content-based document classification so that these documents can be efficiently located by the consumers who want to consume it. Search engines were precisely developed for this job. Search engines like Yahoo, HotBot, etc. in their early days used to work by constructing indices and find the information requested by the user however it was not very uncommon that search engines at times may return a list of documents with poor correlation. This has led to development and research of intelligent agents that makes use of machine learning in classifying documents.Some of the techniques that are employed for document classification are: Naïve Bayes classifier, Support Vector Machine, Decision Trees, Neural Network, etc.Some of the applications that make use of the above techniques for document classification are listed below:· Email routing: Routing an email to a general address, to a specific address or mailbox depending on the topic of the email.· Language identification: Automatically determining the language of a text. It can be useful in many use cases one of them being the direction in which the language should be processed. Most of the languages are read and written from left to right and top to bottom, but there are some exceptions though.  For example, Hebrew and Arabic are processed from right to left. This  knowledge can then be used along with language identification in correct processing of the text in any language.· Readability assessment: Automatically determining how readable any document is for an audience of a certain age.· Sentiment analysis: Determining the sentiment of a speaker based on the  content of the document.Machine Learning ApproachesSupport Vector MachineA Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.In practice the classes are usually not linearly separable. In such cases a higher order function can split the dataset. To accomplish a nonlinear SVM classifier a so-called kernel trick is applied. A function is applied to the dataset which maps the points in the nonlinear dataset to points in a linear dataset. Quite simple possible functions are the square root or the square, which could change the data to a linear space. This computation is done implicitly, therefore the user does not have to scale the data to a linear space. The only input that is required is which function type with corresponding parameters must be used. Most often datasets are not nicely distributed such that the classes can be separated by a line or higher order function. Real datasets contain random errors or noise which create a less clean dataset. Although it is possible to create a model that perfectly separates the data, it is not desirable, because such models are overfitting on the training data. Overfitting is caused by incorporating the random errors or noise in the model. Therefore the model is not generic, and makes significantly more errors on other datasets. Creating simpler models keeps the model from overfitting. The complexity of the model has to be balanced between fitting on the training data and being generic. This can be achieved by allowing models which can make errors. A SVM can make some errors to avoid overfitting. It tries to minimize the number of errors that will be made. In this example it is not desirable to create a model that perfectly splits the data points. In that case the model is overfitting on the training data, which makes more errors on the test set. The three random outlying data points which are misclassified by the model are shown in red. Sometimes it is impossible to train a model, which achieves a perfect separation. This can only happen when two data points have an identical feature vector and a different class label.K-nearest neighborInstance-based classifiers such as the kNN classifier operate on the premises that classification of unknown instances can be done by relating the unknown to the known according to some distance/similarity function. The intuition is that two instances far apart in the instance space defined by the appropriate distance function are less likely than two closely situated instances to belong to the same class.Classification (generalization) using an instance-based classifier can be a simple matter of locating the nearest neighbour in instance space and labelling the unknown instance with the same class label as that of the located (known) neighbour. This approach is often referred to as a nearest neighbour classifier. The downside of this simple approach is the lack of robustness that characterize the resulting classifiers. The high degree of local sensitivity makes nearest neighbour classifiers highly susceptible to noise in the training data.More robust models can be achieved by locating k, where k > 1, neighbours and letting the majority vote decide the outcome of the class labelling. A higher value of k results in a smoother, less locally sensitive, function. The nearest neighbour classifier can be regarded as a special case of the more general k-nearest neighbours classifier, hereafter referred to as a kNN classifier. The drawback of increasing the value of k is of course that as k approaches n, where n is the size of the instance base, the performance of the classifier will approach that of the most straightforward statistical baseline, the assumption that all unknown instances belong to the class most most frequently represented in the training data.This problem can be avoided by limiting the influence of distant instances. One way of doing so is to assign a weight to each vote, where the weight is a function of the distance between the unknown and the known instance. By letting each weight be defined by the inversed squared distance between the known and unknown instances votes cast by distant instances will have very little influence on the decision process compared to instances in the near neighbourhood. Distance weighted voting usually serves as a good middle ground as far as local sensitivity is concerned.3.2 Naïve BayesThe Naïve Bayes model is a family of classification models that makes what is called a naive assumption. The naive assumption is that attributes(words) are independent of each other. What this means is that the order of the attributes(words) does not matter. This is also the case when we use the setting NGRAM=2. The only difference is that two words are grouped together and is then considered as one attribute.The Naïve Bayes model we used is in our experiments is commonly called a Multinomial Naïve Bayes model. This model takes into account the number of occurrences of anattribute in a document. The Naïve Bayes framework is provided by a simple theorem of probability known as Baye’s rule:P(c|x) = P(x|c)P(c)/P(x)This can be explained by a simple example. Say that we have these two classification categories with a total of 250 documents. A document can be classified as either interesting or not interesting. For each document, we count how many times each word is used and save it in a table.Classification Football Politics I Love TotalInteresting 100 10 50 20 150Not Interesting 10 100 50 10 100Table 3.1: Naïve Bayes ExampleTable 3.1 shows that in the 150 articles that are interesting, the word football is men?tioned 100 times.If we want to classify a text we will use the formula in Equation 1 to calculate the probability of the text belonging to each of the classification categories and choose the one with the highest value.Probability that “I love football” belongs to the Not interesting category:P(NotInteresting|I, Love, Football)=P(I|NotInteresting)P(Love|NotInteresting)P(Football|NotInteresting)P(NotInteresting)/P(I)P(Love)P(F ootball) = 0.5 ? 0.1 ? 0.1 ? 0.4/P(evidence) = 0.002/P(evidence)Since 0.0172458 > 0.002 we have classified the text “i love football” as interesting.DatasetThe dataset used for this project is the collection of articles provided by BBC News. The dataset is downloaded from “Insight Resources”. These collection of articles is provided for use as benchmarks for machine learning research. All rights, including copyright, in the content of the original articles are owned by the BBC.The dataset consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.These articles belong to five class labels given as: Business, Entertainment, Politics, Sport, Tech5.5 Feature Representation of DocumentsThis is one of the most important tasks of Document classification. In Feature representation of documents, documents are converted into feature vectors. There are many approaches in which this is done. 5.5.1 Binary VectorizerOne of the simplest being a binary feature vector. In this method, for all the words in the vocabulary, the words that occur in the document at least once, is counted positive (1) whereas the words that do not occur is not counted (0). Thus, each document is represented as a vector of words with values of each word mapped to either 0 (if it does not occur in that document) or 1 (if it does take place in the document). Since a document may not contain a lot of words that are there in a dictionary, it will have most of the words in feature vector with value ‘0’. Thus, there is a lot of storage space wasted. To overcome this limitation, we make use of the sparse matrix. In the sparse matrix, we store only the words whose value is non-zero, resulting in significant storage saving.5.5.2 Count VectorizerThough Binary Feature Vector is one of the simplest, it does not perform that well. It does capture, whether certain words exist in the document but it fails in capturing the frequency of those words. For this reason, Count vectorizer (also termed as Term Frequency vectorizer) is generally preferred.In count vectorizer, we do not just capture the existence of words for a given document but also capture how many times it occurs. Thus, for each word in a vocabulary that occurs in a document we capture the number of times it occurs. Thus, the document is represented as a vector of words along with the number of times it occurs in the document. For this approach, also Sparse matrix is used since most of the words in vocabulary will likely have a frequency of ‘0’. 5.5.3 TfIdf VectorizerThe count vectorizer captures more detail than a simpler binary vectorizer, but it also has a certain limitation. Although count vectorizer considers the frequency of words occurring in a document, it does it irrespective of how rare or common the word is. To overcome this limitation TfIdf (Term Frequency Inverse Document Frequency) vectorizer can be used. TfIdf vectorizer does consider the inverse document frequency (distinguishability weight of the word) along with the frequency of each word occurring in a document, in forming the feature vector. Let us say we have a document that contains a word ‘catch’ 10 times and the word ‘baseball’ 2 times. Here, if we just used term frequency, we will give more weight to the word ‘catch’ compared to the word ‘baseball’ since it occurs more frequently in the document. However, the word ‘catch’ might frequently be occurring across multiple categories whereas the word ‘baseball’ might be occurring in very few categories that are related to sports or baseball. Thus, the word ‘baseball’ is a more distinguishing feature in the document. In inverse document frequency, we determine the distinguishability of the word which we then multiply with term frequency to get the new weight of each word in the document. How the TfIdf is calculated is shown below: (Term Frequency) = Number of times term/word occurs in the document.IDF (Inverse Document Frequency) = log (N/ 1 + {d ? D : t ? d})Here, N is a total number of documents in the corpus and {d ? D: t ? d} is a number ofdocuments where term t appears. TfIdf is calculated as:    TfIdf= TF * IDF, if the word occurs less across multiple documents than its distinguishability or in more technical terms it’s IDF value will be high and the word that frequently occurs across many documents will have a low IDF value. Thus, in TfIdf the feature vector will not be based solely on term frequency of words but will be a product of term frequency along with its IDF value.

Post Author: admin


I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out