Overview of the most popular machine learning algorithms
There is such a thing as the "No Free Lunch" theorem. Its essence lies in the fact that there is no algorithm that would be the best choice for each task, which in particular concerns teaching with the teacher.
For example, it can not be said that neural networks always work better than decision trees, and vice versa. The efficiency of algorithms is influenced by many factors, such as the size and structure of the data set.
For this reason, we have to try many different algorithms, checking the effectiveness of each on the test set of data, and then choosing the best option. Of course, you need to choose among the algorithms appropriate to your task. If you draw an analogy, then when cleaning the house, you will most likely use a vacuum cleaner, a broom or a mop, but not a shovel.
Machine learning algorithms can be described as learning objective function fthat best relates the input variables Xand the output variable Y: Y = f(X).
We do not know what the function is f. After all, if they knew, they would use it directly, and not try to learn with the help of various algorithms.
The most common problem in machine learning is the prediction of values Yfor new values X. This is called prognostic modeling, and our goal is to make the prediction as accurate as possible.
We present to your attention a brief overview of the top 10 popular algorithms used in machine learning.
1. Linear regression
Linear regression is perhaps one of the most well-known and understandable algorithms in statistics and machine learning.
Predictive modeling primarily concerns minimizing the model error, or in other words, as accurate as possible prediction. We will borrow algorithms from different areas, including statistics, and use them for this purpose.
Linear regression can be represented as an equation that describes a straight line that most closely shows the relationship between input variables Xand output variables Y. To compose this equation, we need to find certain coefficients Bfor the input variables.
For example: Y = B0 + B1 * X
Knowing X, we must find Y, and the goal of linear regression is to find the values of the coefficients B0and B1.
To evaluate the regression model, various methods are used, such as linear algebra or least squares.
Linear regression has existed for more than 200 years, and during this time it was carefully studied. So here are a couple of practical rules: remove the similar (correlating) variables and get rid of the noise in the data, if possible. Linear regression is a fast and simple algorithm that is well suited as the first algorithm for learning.
2. Logistic regression
Logistic regression is another algorithm that has come to machine learning straight from statistics. It is well used for binary classification problems (these are problems in which we get one of two classes at the output).
Logistic regression is similar to linear in that it also requires finding the values of the coefficients for the input variables. The difference is that the output value is converted using a nonlinear or logistic function.
The logistic function looks like a large letter S and converts any value into a number between 0 and 1. This is very useful, since we can apply the rule to the output of the logistic function for binding to 0 and 1 (for example, if the result of the function is less than 0.5, then at the output we get 1) and class predictions.
Due to the way the model is trained, logistic regression predictions can be used to display the probability of a sample being assigned to a class of 0 or 1. This is useful in cases where you need more justification for forecasting.
As in the case of linear regression, logistic regression performs its task better if you remove unnecessary and similar variables. The logistic regression model is rapidly trained and well suited for binary classification problems.
3. Linear discriminant analysis (LDA)
Logistic regression is used when you need to classify a sample in one of two classes. If there are more than two classes, it is better to use the LDA algorithm (Linear discriminant analysis).
The presentation of the LDA is quite simple. It consists of the statistical properties of the data calculated for each class. For each input variable this includes:
- The average for each class;
- Dispersion calculated for all classes.
Predictions are made by calculating the discriminant value for each class and selecting the class with the largest value. It is assumed that the data has a normal distribution, so it is recommended to delete the anomalous values from the data before starting work. This is a simple and effective algorithm for classification problems.
4. Trees of Decision Making
The decision tree can be represented as a binary tree, which is familiar to many by algorithms and data structures. Each node represents an input variable and a separation point for this variable (provided that the variable is a number).
The leaf nodes are the output variable that is used for prediction. Predictions are made by traversing the tree to the leaf node and outputting the class value at that node.
Trees quickly learn and make predictions. In addition, they are accurate for a wide range of tasks and do not require special preparation of data.
5 . Naive Bayesian Classifier
Naive Bayes is a simple but surprisingly effective algorithm.
The model consists of two types of probabilities, which are calculated with the help of training data:
- The probability of each class.
- The conditional probability for each class for each value of x.
After calculating the probability model, it can be used to predict with new data using the Bayes theorem. If you have real data, then, assuming a normal distribution, it is not particularly difficult to calculate these probabilities.
Naive Bayes is called naive, because the algorithm assumes that each input variable is independent. This is a strong assumption, which does not correspond to real data. Nevertheless, this algorithm is very effective for a number of complex problems such as classification of spam or recognition of handwritten figures.
6. K-nearest neighbors (KNN)
K-nearest neighbors - very simple and very effective algorithm. The KNN model (K-nearest neighbors) is represented by the entire set of training data. Quite simply, is not it?
The prediction for a new point is done by searching for K nearest neighbors in the data set and summing the output variable for these K instances.
The only question is how to determine the similarity between the data instances. If all the characteristics have the same scale (for example, centimeters), then the easiest way is to use Euclidean distance, a number that can be calculated based on differences with each input variable.
KNN may require a lot of memory to store all the data, but it will quickly make a prediction. Also, training data can be updated so that predictions remain accurate over time.
The idea of the nearest neighbors can work badly with multidimensional data (a lot of input variables), which will negatively affect the efficiency of the algorithm in solving the problem. This is called the curse of dimension. In other words, it is worth using only the most important variables for prediction.
7. Vector Quantization Networks (LVQ)
The disadvantage of KNN is that you need to store the entire training set of data. If KNN showed itself well, then it makes sense to try the algorithm LVQ (Learning vector quantization), which is devoid of this shortcoming.
LVQ is a set of code vectors. They are selected at random in the beginning and over a certain number of iterations are adapted to best generalize the entire data set. After learning, these vectors can be used for prediction in the same way as in KNN. The algorithm looks for the nearest neighbor (the most suitable code vector) by calculating the distance between each code vector and the new data instance. Then, for the most appropriate vector, the class (or number in the case of regression) is returned as a prediction. The best result can be achieved if all data is in the same range, for example from 0 to 1.
8. Support Vector Method (SVM)
The support vector method is probably one of the most popular and discussed algorithms of machine learning.
A hyperplane is a line separating the space of input variables. In the method of support vectors, the hyperplane is chosen so as to best divide the points in the plane of the input variables by their class: 0 or 1. In a two-dimensional plane, this can be represented as a line that completely separates the points of all classes. During the training, the algorithm looks for coefficients that help to better classify classes by a hyperplane.
The distance between the hyperplane and the nearest data points is called the difference. The best or optimal hyperplane separating the two classes is the line with the greatest difference. Only these points are important in determining the hyperplane and in constructing the classifier. These points are called support vectors. To determine the values of the coefficients that maximize the difference, special optimization algorithms are used.
The method of support vectors is probably one of the most effective classical classifiers, which is definitely worth paying attention to.
9 . Bagging and random forest
A random forest is a very popular and effective machine learning algorithm. This is a kind of ensemble algorithm, called bagging.
Butstrap is an effective statistical method for estimating a value like the mean. You take a lot of subsamples from your data, count the average for each, and then average the results to get a better estimate of the actual average.
Bugging uses the same approach, but decision trees are most often used to evaluate all statistical models. Training data is divided into many samples, for each of which a model is created. When a prediction is to be made, each model makes it, and then the predictions are averaged to give a better estimate of the output value.
In the random forest algorithm, decision trees are built for all the samples from the training data. When building trees, random tags are selected to create each node. In particular, the obtained models are not very accurate, but when they are combined, the quality of the prediction is significantly improved.
If a high-variance algorithm, for example, decision trees, shows a good result on your data, then this result can often be improved by using bagging.
10 . Hosting and AdaBoost
Hosting is a family of ensemble algorithms, the essence of which is to create a strong classifier based on several weak ones. To do this, first one model is created, then another model that tries to fix the errors in the first one. Models are added until the training data is ideally predicted or until the maximum number of models is exceeded.
AdaBoost was the first truly successful bootstrap algorithm developed for binary classification. It is from him that it is best to begin acquaintance with the booming. Modern methods such as stochastic gradient boosting are based on AdaBoost.
AdaBoost is used together with short decision trees. After creating the first tree, its effectiveness is checked on each training object to understand how much attention should be paid to the next tree to all objects. Those data that are difficult to predict are given more weight, and those that are easy to predict are smaller. Models are created sequentially one after another, and each of them updates the weights for the next tree. After building all the trees, predictions are made for new data, and the effectiveness of each tree is determined by how accurate it was on the training data.
Since in this algorithm much attention is paid to correcting model errors, it is important that there are no anomalies in the data.
A couple of words in the end
When newcomers see the whole variety of algorithms, they are asked the standard question: "Which one should I use?" The answer to this question depends on many factors:
- The size, quality and nature of the data;
- Available computing time;
- Urgency of the task;
- What do you want to do with the data.
Even an experienced data scientist will not say which algorithm will work better before trying a few options. There are many other algorithms of machine learning, but the above are the most popular ones. If you are just getting acquainted with machine learning, then they will be a good starting point.
Post a Comment