Got the chance to go through this paper on Machine Learning. It was recommended to me by my mentor. This is a very good paper to understand the whole perspective on Machine learning. Author has summarized very well about the supervised learning methods and when you should consider using each method.
It took me some time to go through the whole paper and also I need to come back on various topics in the article to have the better understanding. I highly recommend this paper to all those who are starting with the machine learning journey and would like to see the importance of each ML algorithm for supervised learning.
It took me some time to go through the whole paper and also I need to come back on various topics in the article to have the better understanding. I highly recommend this paper to all those who are starting with the machine learning journey and would like to see the importance of each ML algorithm for supervised learning.
These are some of the points that I would like to summarize :
Supervises Learning : If the instances are given with known labels
Unsupervised Learning : Instances are unlabeled.
The main aim to go through various ML algorithms is to understand the strengths and limitations of each algorithms. While solving the problem we should consider using the combination of algorithms for better accuracy.
Just using one algorithm and trying to increase the accuracy may not help always.
Issues with supervised learning :
The first step towards using ML is to collect the dataset.
While preparing the dataset try to find out the fields which are most informative. If that information is not available we can use brute force method ie measure everything in the hope that the right feature can be isolated.There are challenges with the brute force as it requires pre processing.
Some of the methods used for pre processing & cleansing of dataset described by author are :
Algorithm Selection :
The author talks about the importance of using the right algorithm. The prediction accuracy is the main criteria for the classifier evaluation. The techniques generally used for the classifier accuracy are - split the training set, cross validation method, Leave one out method. Leave one out is the special case of cross validation process. The disadvantage of this method is that it requires lot of computation but gives the most accurate estimate of the classifiers error rate.
Supervised Machine Learning techniques -
1) Logic Based Algorithms - Symbolic learning Method
a) Decision Trees: These are the trees that classify instances based on the feature values. The decision trees are easy to comprehend.
b) Rule based classifier: Here the author talks about the difference between the Decision tree and the Rule based classifiers. The main difference emphasized by author is that the decision tree evaluate the average quality of number of disjoint sets whereas the rule classifier only evaluate the quality of the set of instances that is covered by the candidate rule.
Another difference is that the decision tree uses the Divide & conquer approach while the Rule based classifier uses the Separate and conquer method. For smaller datasets its always advantageous to use the divide and conquer method.
c) Inductive Logic Programming : Some of the characteristics emphasized by author for Inductive Logic Programming are - Expressive representation, Ability to make use of logically encoded background knowledge & extensive use of Inductive Logic programming in Relational Data Mining.
2) Perceptron Based Techniques :
Perceptron is described as - If X1 to Xn are input features and W1 to Wn are prediction weights/vectors, than perceptron algorithm computes the sum of weighted inputs iXiWi and adjust the threshold : if sum is above threshold output is 1 else 0.
Author describes the advantage of perceptron based technique when the number of features are high.
Some of the disadvantages are that it cannot deal with numerical problems and also perceptron like methods are binary.
Neural Networks :
If a straight line can be drawn to separate the input instances into correct category, perceptron culd find the solution. IF the straight line cannot be drawn, it is difficult to reach the point where all instances are classified properly.
Artifical Neural Networks helps to solve this problem.
Neural network consist of 3 types of units - Input Unit, Output Units and in between of input & output units, hidden units are present.
ANN depends on 3 fundamental aspects - Input & activation function of the unit, network architecture and weight of each input connection.
The most common method used to find the value of the weight is Back Propogation Algorithm.
The major disadvantage of the ANN is their lack of ability to reason about their output in a way that can be effectively communicated.
3) Statistical Learning Algorithms :
The author describes that as compared to ANN, statistical approach define the probability model to define if a an instance belong to a particular class. Bayesian networks & Instance based method belongs to this category.
Bayesian Network :
Bayesian Network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
Author explains that computational difficulty in exploring a previously unknown network is one of the limitation of Bayesian Network.
Author also emphasizes on the feature of BN as compared to decision tree or neural networks - possibility of taking into account prior information about a given problem.
It is also important to note that BN are not useful for data sets with many features.
Naive Bayes Classifier :
NB classifiers belongs to the family of Bayesian Network with strong independence assumptions between the features. They are composed of DAG's with single parent and many children with the assumption of independence among the child nodes.
Author also emphasizes that the assumption of independence among child nodes is always wrong and that is the reason that NB classifiers are less accurate as compared to other sophisticated algorithms like ANN.
Some of the methods used to overcome the independence assumption and improve the classifier accuracy in research are - CR method (Network has the limitation that each feature could be related to only one other feature), Selective Bayesian Classifier ( to include a feature selection stage. This is to improve the classification performance by removing the irrelevant features).
Author emphasizes that the majr advantage of NB classifier is the short computational time for training.
Instance-based learning :
These are also called the lazy learning algorithms as they delay the generalization process till the classification process. They require less computational time during the training phase and more during the classification phase due to this reason.
One of the example talked about the author is k nearest neighbor. Some of the characterstics of K nearest neighbor are - they have lasrge storage requirement, sensitive to choice of similarty fnction, lack well defined method to choose k ie its nearest neighbors.
4). Support Vector Machine
SVM runs around the concept of margin. Author describes that if we consider the hyperplane and the margin that separates two data classes, maximizing the margin can help in to reduce the upper bound on expected generalization error.
SVM is very useful where the number of features are large as compared to number of training instances. The author describes how Sequention Minimal Optimization can relatively quickly solve the SVM QP problem.
5). Characteristics of each Algorithm
Author described the characteristic of each algorithm and have put the point of view on where & why to use each algorithm. I have summarized the highlight of that below :
SVM & Neural Networks - Multi-dimensions & continuous features.
Logic Based - Discrete/categorical features
SVM - Requires large sample size
KNN - very sensitive to irrelevant features, lot of storage space, require complete set with missing values eliminated.
Neural Networks - presence of irrelevant features can make training inefficient, require complete set with missing values eliminated.
Naive Bayes - High Bias, little storage, robust to missing values.
Decision Tree, Neural Network, SVM - High Variance
At the end, author summarizes how these machine learning algorithms could be combined together to get the better outcome.
Thoroughly enjoyed reading and summarizing the article here in this blog !
- Variable-by-variable data cleansing
- Use of visualization tool.
- Instance selection to handle noise - This is more of the optimization problem. For sampling the instances methods can be used such as Random sampling, Stratified sampling etc
- Handling the incomplete feature values - Methods like ignoring features with unknown values, Selecting most common feature value, mean substitution, regression & classification, treating mission values as special values can be used.
- Feature subset selection
- Feature construction/transformaton
Algorithm Selection :
The author talks about the importance of using the right algorithm. The prediction accuracy is the main criteria for the classifier evaluation. The techniques generally used for the classifier accuracy are - split the training set, cross validation method, Leave one out method. Leave one out is the special case of cross validation process. The disadvantage of this method is that it requires lot of computation but gives the most accurate estimate of the classifiers error rate.
Supervised Machine Learning techniques -
1) Logic Based Algorithms - Symbolic learning Method
a) Decision Trees: These are the trees that classify instances based on the feature values. The decision trees are easy to comprehend.
b) Rule based classifier: Here the author talks about the difference between the Decision tree and the Rule based classifiers. The main difference emphasized by author is that the decision tree evaluate the average quality of number of disjoint sets whereas the rule classifier only evaluate the quality of the set of instances that is covered by the candidate rule.
Another difference is that the decision tree uses the Divide & conquer approach while the Rule based classifier uses the Separate and conquer method. For smaller datasets its always advantageous to use the divide and conquer method.
c) Inductive Logic Programming : Some of the characteristics emphasized by author for Inductive Logic Programming are - Expressive representation, Ability to make use of logically encoded background knowledge & extensive use of Inductive Logic programming in Relational Data Mining.
2) Perceptron Based Techniques :
Perceptron is described as - If X1 to Xn are input features and W1 to Wn are prediction weights/vectors, than perceptron algorithm computes the sum of weighted inputs iXiWi and adjust the threshold : if sum is above threshold output is 1 else 0.
Author describes the advantage of perceptron based technique when the number of features are high.
Some of the disadvantages are that it cannot deal with numerical problems and also perceptron like methods are binary.
Neural Networks :
If a straight line can be drawn to separate the input instances into correct category, perceptron culd find the solution. IF the straight line cannot be drawn, it is difficult to reach the point where all instances are classified properly.
Artifical Neural Networks helps to solve this problem.
Neural network consist of 3 types of units - Input Unit, Output Units and in between of input & output units, hidden units are present.
ANN depends on 3 fundamental aspects - Input & activation function of the unit, network architecture and weight of each input connection.
The most common method used to find the value of the weight is Back Propogation Algorithm.
The major disadvantage of the ANN is their lack of ability to reason about their output in a way that can be effectively communicated.
3) Statistical Learning Algorithms :
The author describes that as compared to ANN, statistical approach define the probability model to define if a an instance belong to a particular class. Bayesian networks & Instance based method belongs to this category.
Bayesian Network :
Bayesian Network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
Author explains that computational difficulty in exploring a previously unknown network is one of the limitation of Bayesian Network.
Author also emphasizes on the feature of BN as compared to decision tree or neural networks - possibility of taking into account prior information about a given problem.
It is also important to note that BN are not useful for data sets with many features.
Naive Bayes Classifier :
NB classifiers belongs to the family of Bayesian Network with strong independence assumptions between the features. They are composed of DAG's with single parent and many children with the assumption of independence among the child nodes.
Author also emphasizes that the assumption of independence among child nodes is always wrong and that is the reason that NB classifiers are less accurate as compared to other sophisticated algorithms like ANN.
Some of the methods used to overcome the independence assumption and improve the classifier accuracy in research are - CR method (Network has the limitation that each feature could be related to only one other feature), Selective Bayesian Classifier ( to include a feature selection stage. This is to improve the classification performance by removing the irrelevant features).
Author emphasizes that the majr advantage of NB classifier is the short computational time for training.
Instance-based learning :
These are also called the lazy learning algorithms as they delay the generalization process till the classification process. They require less computational time during the training phase and more during the classification phase due to this reason.
One of the example talked about the author is k nearest neighbor. Some of the characterstics of K nearest neighbor are - they have lasrge storage requirement, sensitive to choice of similarty fnction, lack well defined method to choose k ie its nearest neighbors.
4). Support Vector Machine
SVM runs around the concept of margin. Author describes that if we consider the hyperplane and the margin that separates two data classes, maximizing the margin can help in to reduce the upper bound on expected generalization error.
SVM is very useful where the number of features are large as compared to number of training instances. The author describes how Sequention Minimal Optimization can relatively quickly solve the SVM QP problem.
5). Characteristics of each Algorithm
Author described the characteristic of each algorithm and have put the point of view on where & why to use each algorithm. I have summarized the highlight of that below :
SVM & Neural Networks - Multi-dimensions & continuous features.
Logic Based - Discrete/categorical features
SVM - Requires large sample size
KNN - very sensitive to irrelevant features, lot of storage space, require complete set with missing values eliminated.
Neural Networks - presence of irrelevant features can make training inefficient, require complete set with missing values eliminated.
Naive Bayes - High Bias, little storage, robust to missing values.
Decision Tree, Neural Network, SVM - High Variance
At the end, author summarizes how these machine learning algorithms could be combined together to get the better outcome.
Thoroughly enjoyed reading and summarizing the article here in this blog !
No comments:
Post a Comment