Machine Learning algorithms today heavily depend on data. In fact, If there is no data, there is no machine learning, it's as simple as that. To reason about why that is the case, let's for a second think about what machine learning is. Machine learning can be described as:

"Machine Learning is the study of computer algorithms that can improve automatically by the use of data"

If we try to translate this into a more informal lingua, we can refer to computer algorithms in the context of machine learning as machine learning models or simply models.

The part about automatic improvement by the use of data can be translating into learn from data. So another version of the sentence could look like the following:

"Machine Learning is the study of models which can learn from data"

As we can see, the special part about machine learning models, is the fact that they have the ability, in contrast to other types of computer algorithms, to learn from data. Without data to learn from, the models are just empty shells, which can't do anything. How the models learn from data is unfortunately not within the scope of this post, but it has something to do with loss functions and optimization through gradient descent.

Examples of machine learning models are: regression models, random forest trees, support vector machines, deep artificial neural networks. Each of the models have different capabilities and pros and cons.

What is data?

Since data is such an essential part of machine learning models, it is perhaps useful to dig a bit further into the what data refers to. At the very core, data is simply a collection of 0's and 1's. But luckily, we can abstract away from the 0's and 1's by instead thinking about data as different kind of file types. Where some common examples are:

  • Text data (e.g. .txt- and .html files)
  • Image and video data (e.g. .png-, .jpeg and .avi files)
  • Audio data (e.g. mp3- and .wav files)
  • Tabular data (e.g. .xlsx and .csv files)

Data acquisition

For data to be useful for a machine learning model, the data needs to be clean and well-structured. But often, real-world data is not clean and structured. What can be even worse, however, is when no data or very little data exists.

To go from having no data or very little data to a ready-to-use data set, which a machine learning model can learn from, is no trivial task. One way of approaching the task, is by splitting it into the following 4 sub-tasks:

  1. Data Gathering (e.g., through sensors, Internet scraping or questionnaires)
  2. Data Preparation (e.g., through removal of anomalies and noise and transformation features)
  3. Data Exploration (e.g., by exposing features and insights in the data)
  4. Data Representation (e.g., representing the data in relevant formats and structure it into datasets)

We will go into the details of each sub-task in up-coming posts.

Thank you for reading! Please upvote and/or leave a comment below if you found this post interesting

Get back to my other posts here