The Significance of Data in Machine Learning

Machine Learning is the hype among developers these days and is used extensively not only to build innovative and smarter solutions but also to enhance the performance of the existing solutions hence rendering them more valuable. Machine learning is only a subset of Artificial Intelligence which is an overarching field. The difference between these technologies is explained here. https://alphabold.com/artificial-intelligence-machine-learning-and-neural-networks/

The information available on machine learning can be overwhelming and complex when searched upon, but the idea here is to keep things simple.

Let’s start by defining machine learning as

“The ability of the machine to learn from the experience without having to be explicitly programmed.”

Machine learning is about making machines more independent by embedding some sort of mechanism into them, so they can improve over time.

To get better understanding of machine learning, let’s compare it to the traditional way of programming. The following workflow depicts the working of a traditional logic-driven program:

machine learning traditional logic-driven program

Here we see that the program is built upon logic. We give input data and based on the logic, the program produces an output. The important thing to note here is that the program will always produce the same output for a given input.

When compared with the machine learning approach, data is more central to ML solutions than logic is to the traditional ones.

machine learning approach

ML program gets trained on the data rather than any hard-coded logic, which allows ML algorithms to learn over time mimicking the human learning behavior. Let’s suppose you have data set about weather information for last 10 years and you train your ML program on this data. After training, you give input data to the program and it produce an output based on the learning. From this output, you can calculate the accuracy of your algorithm and feedback the result into the training data, this will increase the dataset and hence the program will improve its output next time.

Machine learning can be divided into three subcategories:

  1. Supervised machine learning
  2. Unsupervised machine learning
  3. Reinforcement learning

We have separate content coming up on reinforcement learning so here our focus is on the other two types which are supervised and unsupervised machine learning.

Supervised Machine Learning

Supervised machine learning can be seen in two perspectives and I will try to touch on both to make you understand this concept.

Let’s first try to learn machine learning through some math. We all know that a function in math is:

y = f(x)

Where ‘x’ is known as independent variable and ‘y’ is known as dependent variable, given the input value ‘x’ the function ‘f’ produces an output ‘y’. This function is a collection of pre-defined steps that are performed on the input value x to produce the output y. In supervised machine learning approach, this function is replaced by a statistical model that is kept empty. We feed this model with a dataset that initially contains both input and output values and the model learn from this data and builds an input-output relationship internally. After this, we give the actual data which only consist of input values ‘x’ and the model then predicts the output value ‘y’ based on the input-output relationship that it has developed previously.

Let’s consider that we have the following dataset:

Input (x)Output (y)
24
39
636
749
981

We feed this data to our model, the model will learn from these values using some statistics and will build an input-output relationship. After this training, let’s suppose we feed the following data to our model:

Input (x)
5
12

And the model should return the following output values ‘y’

Input (x)Output (y)
525
12144

(Note: The model will produce the correct output only if it had learned perfectly and mapped the input-output function correctly otherwise it will produce wrong outputs which can then be improved with time)

This input-output in math is referred to as features and label in Machine learning. Feature is one or more column in your input dataset and label is the output that you are trying to predict based on those features.

label in Machine learning

In the above dataset, let’s suppose we are trying to predict the car that a person is going to buy based on the family members, age and salary of the person. Family members, age and salary are the features and car which is the final choice, or the result is called label.

The model is trained on the dataset containing both the features and label. Based on the dataset, the model tries to map the input-output function. When you provide another input containing only features, the model predicts the corresponding output or the label.

Unsupervised Machine Learning

Unsupervised machine learning is used when we don’t have label data or in other words, we don’t know the output. In unsupervised ML, you just feed bunch of data to the model and the model will learn from the dataset and will make different groups and classify each data point into one of the groups based on the similarity of features. This is known as clustering which is a type of unsupervised machine learning.

unsupervised machine learning

Let’s suppose if we feed the above data to the model, the model will try to make groups and try to put each data point into a group. The model may classify these input data into sports items and girls' stuff.

The input data is important for machine learning because your model is totally dependent on the data. The only thing ML model gets as input is the dataset and the accuracy of predictions made by the model is directly dependent on the quality of data you feed to the model. Just imagine a little kid, if the kid is grown up in better environment, that kid will learn better and will make better decisions. Similarly, ML model gets mature with time as you feed more and more data to it just like a kid gets mature with age and experience. According to some surveys, 70% time of an ML project is spent on the gathering and preparing the dataset.

We, at AlphaBOLD, specialize in implementing innovative Machine learning solutions to help grow your business.

Contact us, we will be more than happy to help you.

Leave a Reply

Your email address will not be published. Required fields are marked *