These days the concept of Machine Learning is evolving rapidly. The understanding of it is so vast and open that everyone is having their independent thoughts about it. Here I am putting mine. This blog is my experience with the learning algorithms. In this blog, we will get to know the basic difference between Artificial Intelligence, Machine Learning, and Deep Learning. We will also get to know the foundation Machine Learning Algorithm i.e Univariate Linear Regression.
Intermediate knowledge of Python and its library (Numpy, Pandas, MatPlotLib) is good to start. For Mathematics, a little knowledge of Algebra, Calculus and Graph Theory will help to understand the trick of the algorithm.
A way to Artificial intelligence, Machine Learning, and Deep Learning
These are the three buzzwords of today's Internet world where we are seeing the future of the programming language. Specifically, we can say that this is the place where science domain meets with programming. Here we use scientific concepts and mathematics with a programming language to simulate the decision-making process. Artificial Intelligence is a program or the ability of a machine to make decisions more as humans do. Machine Learning is another program that supports Artificial Intelligence. It helps the machine to observe the pattern and learn from it to make a decision. Here programming is helping in observing the patterns not in making decisions. Machine learning requires more and more information from various sources to observe all of the variables for any given pattern to make more accurate decisions. Here deep learning is supporting machine learning by creating a network (neural network) to fetch all required information and provide it to machine learning algorithms.
What is Machine Learning
Definition: Machine Learning provides machines with the ability to learn autonomously based on experiences, observations and analyzing patterns within a given data set without explicitly programming.
This is a two-part process. In the first part, it observes and analyses the patterns of given data and makes a shrewd guess of a mathematical function that will be very close to the pattern. There are various methods for this. Few of them are Linear, Non-Linear, logistic, etc. Here we calculate the error function using the guessed mathematical function and the given data. In the second part we will minimize the error function. This minimized function is used for the prediction of the pattern.
Here are the general steps to understand the process of Machine Learning:
- Plot the given dataset on x-y axis
- By looking into the graph, we will guess more close mathematical function
- Derive the Error function with the given dataset and guessed mathematical function
- Try to minimize an error function by using some algorithms
- Minimized error function will give us a more accurate mathematical function for the given patterns.
Getting Started with the First Algorithms: Linear Regression with Univariable
Linear Regression is a very basic algorithm or we can say the first and foundation algorithm to understand the concept of ML. We will try to understand this with an example of given data of prices of plots for a given area. This example will help us understand it better.
With this data, we can easily determine the price of plots of the given area. But what if we want the price of the plot with area 5.0 * 10 sq mtr. There is no direct price of this in our given dataset. So how we can get the price of the plots with the area not given in the dataset. This we can do using Linear Regression.
So at first, we will plot this data into a graph.
The below graphs describe the area of plots (10 sq mtr) in x-axis and its prices in y-axis (Lakhs INR).
Definition of Linear Regression
The objective of a linear regression model is to find a relationship between one or more features (independent variables) and a continuous target variable(dependent variable). When there is only feature it is called Univariate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.
Here we will try to find the relation between price and area of plots. As this is an example of univariate, we can see that the price is only dependent on the area of the plot.
By observing this pattern we can have our hypothesis function as below:
f(x) = w * x + b
where w is weightage and b is biased.
For the different value set of (w,b) there can be multiple line possible but for one set of value, it will be close to this pattern.
When we generalize this function for multivariable then there will be a set of values of w then these constants are also termed as model params.
Note: There is a range of mathematical functions that relate to this pattern and selection of the function is totally up to us. But point to be taken care is that neither it should be under or overmatched and function must be continuous so that we can easily differentiate it or it should have global minima or maxima.
Error for a point
As our hypothesis function is continuous, for every Xi (area points) there will be one Yi Predicted Price and Y will be the actual price.
So the error at any point,
Ei = Yi – Y = F(Xi) – Y
These errors are also called as residuals. These residuals can be positive (if actual points lie below the predicted line) or negative (if actual points lie above the predicted line). Our motive is to minimize this residual for each of the points.
Note: While observing the patterns it is possible that few points are very far from the pattern. For these far points, residuals will be much more so if these points are less in numbers than we can avoid these points considering that these are errors in the dataset. Such points are termed as outliers.
As there are m training points, we can calculate the Average Energy function below
E (w,b) = 1/m ( iΣm (Ei) )
our motive is to minimize the energy functions
min (E (w,b)) at point ( w,b )
Little Calculus: For any continuous function, the points where the first derivative is zero are the points of either minima or maxima. If the second derivative is negative, it is the point of maxima and if it is positive, it is the point of minima.
Here we will do the trick - we will convert our energy function into an upper parabola by squaring the error function. It will ensure that our energy function will have only one global minima (the point of our concern). It will simplify our calculation that where the first derivative of the energy function will be zero is the point that we need and the value of (w,b) at that point will be our required point.
So our final Energy function is
E (w,b) = 1/2m ( iΣm (Ei)2 )
dividing by 2 doesn't affect our result and at the time of derivation it will cancel out for e.g
the first derivative of x2 is 2x.
Gradient Descent Method
Gradient descent is a generic optimization algorithm. It iteratively hit and trials the parameters of the model in order to minimize the energy function.
In the above picture, we can see on the right side:
- w0 and w1 is the random initialization and by following gradient descent it is moving towards global minima.
- No of turns of the black line is the number of iterations so it must not be more or less.
- The distance between the turns is alpha i.e the learning parameter.
By solving this left side equation we will be able to get model params at the global minima of energy functions.
Points to consider at the time of Gradient Descent calculations:
- Random initialization: We start this algorithm at any random point that is set of random (w, b) value. By moving along this algorithm decide at which direction new trials have to be taken. As we know that it will be the upper parabola so by moving into the right direction (towards the global minima) we will get lesser value compared to the previous point.
- No of iterations: No of iteration must not be more or less. If it is lesser, we will not reach global minima and if it is more, then it will be extra calculations around the global minima.
- Alpha as learning parameters: when alpha is too small then gradient descent will be slow as it takes unnecessary steps to reach the global minima. If alpha is too big then it might overshoot the global minima. In this case it will neither converge nor diverge.
Implementation of Gradient Descent in Python
This is the basic implementation of Gradient Descent algorithms using numpy and Pandas. It is basically reading the area-price.csv file. Here we are normalizing the x-axis for better readability of data points over the graph. We have taken (w,b) as (0.1, 0.1) as random initialization. We have taken 100 as count of iterations and learning rate as .001.
In every iteration, we are calculating w and b value and seeing it for converging rate.
We can repeat this calculation for (w,b) for different values of random initialization, no of iterations and learning rate (alpha).
Note: There is another python Library TensorFlow which is more preferable for such calculations. There are inbuilt functions of Gradient Descent in TensorFlow. But for better understanding, we have used library numpy and pandas here.
RMSE (Root Mean Square Error)
RMSE: This is the method to verify that our calculation of (w,b) is accurate at what extent. Below is the basic formula of calculation of RMSE where f is the predicted value and the observed value.
Note: There is no absolute good or bad threshold value for RMSE, however, we can assume this based on our observed value. For an observed value ranges from 0 to 1000, the RMSE value of 0.7 is small, but if the range goes from 0 to 1, it is not that small.
As part of this article, we have seen a little introduction to Machine Learning and the need for it. Then with the help of a very basic example, we learned about one of the various optimization algorithms i.e. Linear Regression (for univariate only). This can be generalized for multivariate also. We then use the Gradient Descent Method for the calculation of the predicted data model in Linear Regression. We also learned the basic flow details of Gradient Descent. There is one example in python for displaying Linear Regression via Gradient Descent.