world of Artificial Intelligence and Machine Learning has come out of what is
referred to as Artificial Intelligence’s dark winter period during the 80’s,
where the technology of that time was not powerful enough or advanced. In more
recent years the power of Computing and Data Science has improved dramatically
hence the recent interest in Artificial Intelligence research and development.
The main aim is to try and predict with a certain percentage of accuracy the stage
of breast cancer. In this research paper we use machine learning classification
techniques, i.e., Support Vector Machine, KNearest-Neighbors and Naïve Bayes to
develop the required models to predict. In our analysis we found the techniques
which has the highest accuracy rate is
by followed by and
learning, Classification, Support Vector Machine, KNearest-Neighbors, Naïve
cancer is one of the oldest diseases known to mankind and it can be traced back
to the ancient times and its mostly common amongst women. Thus, early
predication or detection of breast cancer is very crucial to the survival of a
person who has been diagnosed with the disease. Breast cancer refers to the
malignant tumour which has developed from cells in the breast. It was reported
that over 508 000 women died of breast cancer during 2011 and the survival
rates are very low especially amongst developing nations mainly due to lack of
this paper, we are trying to solve a breast cancer prediction problem using
varies Machine Learning techniques such as K-Nearest-Neighbors, Support Vector
Machines, Decision Trees, Neural Networks and Naïve Bayes just to name a few.
This will be done be analysing data collected by Dr. William H. Wolberg from the
University of Wisconsin Hospitals, Madison. The approach taken in this research
paper is from a classification stand point where we use three classification
techniques: Support Vector Machine, K-Nearest-Neighbors, Naïve Bayes
order to get a full understanding of the problem and possible solutions, a
literature review was conducted which showed that Machine
Learning is not new to the world of cancer research. There
have been many studies around cancer prediction using machine learn techniques
such as decision tress, statistical approaches and artificial neural networks.
mentioned above, there are varies types and stages of breast cancer thus
defining the problem is very crucial to finding the best possible results and
in this paper, we will predict whether the cancer is Benign or Malignant.
main aim in this paper, is to find or evaluate the most effective and efficient
classifier for predicting breast cancer in terms of the accuracy.
to the vast amounts of structured and unstructured data available, find the
appropriate data in the correct format to build a Machine Learning algorithm that
will help to predict the type of breast cancer was a challenge. The dataset
used is the Wisconsin Breast Cancer (original) datatset from UCI Machine
Learning Repository which has 699 instances and 11 attributes, since the first
attribute is an identification attribute and does not form part of the data are
model needs to predict thus we had to remove it as it would provide us with
misleading results based on our prediction.
with most dataset, there are 16 missing attribute value from attributes 1-6
which we replaced with -99999. The quality of data also plays a major role in
the machine learning
distribution: Benign 458 (34.5%) and Malignant 241(65.5%)
a good understanding of what machine learning is and how it works is very
important to understanding the methods used to predict the outcome of the
dataset. Machine learning is an extension of Artificial Intelligence which provides
computer systems the ability to learn and improve from experience without any
human intervention. Machine learning programs employ algorithms to process,
train and test data. These algorithms are normal categorized as supervised
learning, unsupervised learning and reinforcement learning. Classification is
one of the most crucial parts of supervised learning.
and testing on the same data could lead to a huge mistake as the algorithm will
not learn if we teach it on the same data and will not be able to handle new
events, this will lead to overfitting. To avoid such mistakes, we used
cross-validation to split the training data and the testing data in order to
allow our model to predict more accurately and to produce useful results.
this paper the first classification machine learning technique was Naïve Bayes.
This algorithm is statistical based as it uses probability on each feature or
attribute belonging to a particular class in order to make a prediction. The
calculation assumes independence of each attribute.
second classifier utilized was the K-Nearest-Neighbors which is algorithm based
on the entire training dataset. This algorithm is assist with new unforeseen
data instances as it searches the training dataset for the k-most similar
instances then the prediction of the feature or attribute of the most common
instance is returned as the prediction of an unforeseen instance. K-Nearest-Neighbors can be interpreted as given
a test instance find the k-nearest example.
SKlearn . K-Nearest-Neighbors Example Figure
final classifier used was the Support Vector Machine which is an algorithm that
can also be used on regression problems. This algorithm assists in finding any outliners
within our dataset, the algorithm assigns new examples to one feature or attribute
thus it makes it a non-probabilistic classifier.