“Machine learning” is the kind of tech buzzword that is both alluring and vague. From tech giants like Google and Facebook to just-off-the-ground startups, machine learning seems to be everywhere. But what does it actually entail? And what is the best way to get some practical experience with this powerful technology?
This post summarizes my introduction to the basics (very basics) of machine learning. It also represents my minimum viable product of learning, so to speak, and will hopefully serve as an encouragement to others with little experience that the subject can in fact be approachable.
Overview: Supervised Learning with Linear Regression
Machine learning is a tool that can give us insight into large datasets. But its real power comes from being able to process data and then make predictions and decisions based on data it has previously processed. Boiled down to its simplest form, when we talk about machine learning we are asking the computer this question, “Given this set of data, what can you tell me about a new data point that you have not yet seen?” In this blog post, I will talk about supervised learning, in which we train the machine to mimic and extend a dataset.
What does it mean to mimic and extend a dataset?
Let’s look at the simplest example, a line of best fit. If we have a dataset, say data about the relationship between an apartment’s square footage and its rent price, we can plot our data and draw a line that describes the data in the most accurate way. (For the sake of this example, let’s forget about other factors like location.)
We can say the best fit line mimics the data because it describes the data trend, but it doesn’t map exactly to the points we plotted. We can say it extends the data because it goes beyond the points we plotted. Given a new x value, say $1800, we could find the square footage of that apartment.
These ideas of mimicking and extending are fundamental to machine learning. Whether through a line or something more complicated, we want to build a model of our data that will allow us to make predictions about new, related data. And we give up some accuracy of mimicking in order to gain some accuracy in extending.
Training a Machine to Find the Best Fit Line
For linear regression, these are the steps to train the machine to find the best fit line:
- Start with a dataset and a random line. This line will be the baseline from which the computer begins adjusting.
- Compare the random line to the dataset. How far off from the dataset is our line? In other words, what was the cost of the line?
- Update the line to be a slightly better fit. Assess how bad that line is.
- Keep doing this process of updating and assessing until we have a line that pretty accurately describes the data, or at least describes it least-badly.
That’s it! We are training it to mimic the data. In more complex and powerful forms of machine learning, the mathematics and complexity of the algorithm may vary, but these steps of assessing and adjusting are still at the heart of the learning process.
Now, let’s go into more detail about these steps.
1. Generate a Hypothesis Line
This is how the machine starts learning: a random guess. We need to start somewhere in order to improve. For linear regression, we can randomly generate a slope and a y-intercept, and we have our function: y = mx +b. In machine learning jargon, we’ll refer to the y-intercept as theta-0 and the slope as theta-1.
2. Calculate Cost
For this random line we generated, we now want to see how far off it is from our dataset. We want the cost of the line. For every x value in our dataset, we can check the distance between the line and our datapoint. That is, we can find the difference between the y values. To get the cost of the line, we add up all these differences in the y values. (There is more complex math involved in this calculation, but for the sake of this post, I will skip over those details.)
This line is a pretty good fit for the data. The distance between every data point and the line is relatively small.
This hypothesis line is a pretty bad fit for the data. For an x value of $800, our data says that the square footage should be around 600 square feet, but the hypothesis line says that the square footage is closer to 800 square feet.
3. Update the Line
We want to update our theta-0 and theta-1 so that they create a line that describes the data less badly. Calculus is our friend here. We can take the partial derivative of cost with respect to theta-0 and the partial derivative of cost with respect to theta-1. Together, these two calculations form a vector that describes a gradient of cost. The cost function can be represented as a three-dimensional surface with high points and low points.
We are trying to find the point of lowest cost, the minimum on the surface. The gradient vector points in the direction of maximal decrease. Finding a less bad line means incrementally moving towards the point of minimum cost and we update theta-0 and theta-1 along the way.
Factors that Influence Finding Minimum Cost
Each time we assess the cost of the line and update the line to have a lower cost, we move closer to the minimum point on the gradient. There are some additional factors that influence if and how we are able to get to this minimum.
Learning rate: Learning rate determines how much we update the line every time. If we imagine ourselves walking along the surface of the cost gradient, learning rate is what tells us how far to move. When we find the direction of maximal decrease, we then walk along the surface in that direction for as long as the learning rate tells us. A large learning rate may tell us to walk a mile while a small learning rate may tell us to walk five feet.
Learning rate affects the granularity of the learning; in other words, it is how fast we reach the minimum. Learning rate can also in some cases affect our ability to ever reach the minimum. If the learning rate is very large, we may walk in the direction of the minimum and keep walking past it. Again and again we will walk past the minimum and never hit it. In practice, it might be most effective to start out with a larger learning rate for efficiency and then gradually decrease the rate as we approach the minimum.
Cost threshold: Sometimes it may be impossible to find the point of minimum cost. Whether because of our learning rate or because a straight line is a particularly bad fit for the data, it may be enough to get to a point of relatively small cost, even if it is not the true minimum. We can set a cost threshold that is our “good enough” number. Once the cost hits this threshold, we can be satisfied that our line of best fit is good enough to reasonably mimic our data.
Summary and Takeaways
So after all this line-drawing and gradient descending and learning-rating, where did we end up? Remember that this process of assessing and adjusting was so that we could model our data relatively well with a line, and so that we could ask the line questions about new data.
Linear regression is a simple machine learning algorithm that is a great introduction into the technical details of what it means for a machine to learn. It is by no means a very nuanced algorithm, and it is probably overkill in most cases to use machine learning to find a line of best fit, when statistical analysis can accomplish the same thing. But as it is with learning anything technically complex, a simple jumping off point is the perfect place to start.