Article

Machine Learning 101: Understanding Its Role in Cybersecurity

In the world of cybersecurity, Machine Learning (ML) is becoming an increasingly essential tool. To understand the impact of machine learning, it’s first important to understand the basics. Which is exactly what we cover below.

What Is Machine Learning?

Machine Learning is a subset of Artificial Intelligence (AI) where systems learn from data, identify patterns, and make decisions with minimal human intervention. Unlike traditional programming, where a programmer provides a machine with a set of explicit instructions, ML allows the system to learn from experience and improve over time.

Terminology

  1. Features/Labels
    In ML, data is divided into features and labels.


    • Features are the inputs used to predict outcomes. In the case of OT cybersecurity, features could include network traffic data, device configurations, or historical vulnerability information.

    • Labels represent the target variable we want the model to predict. For example, in OT systems, labels could be whether a network packet is benign or malicious based on certain patterns.
  1. Types of Features: Continuous/Categorical


    • Continuous Features are numerical values that can take any range, such as temperature or network bandwidth.

    • Categorical Features are values that fall into discrete, defined categories, such as device type (e.g., router, sensor) or protocol used (e.g., TCP, UDP). These categories help the model identify and predict based on grouping, rather than continuous measurements.

  2. Dimensionality Reduction

Dimensionality reduction refers to the process of simplifying a dataset by reducing the number of features (variables) while preserving its essential information. This is typically done in machine learning to improve performance, reduce computational complexity, and avoid the curse of dimensionality—where too many features can lead to overfitting or difficulty in analyzing the data effectively.

Training Process for Supervised Learning

The training process in machine learning is iterative, and it requires several important steps to ensure that the model learns effectively. Here’s an overview of the key stages:

  1. Input Data
    The first step involves providing the model with input data. This data consists of features (input variables), then asked to predict the labels (output variables) to learn / discover patterns. For instance, a model could be trained on rows of data from network traffic logs, each containing features like source IP, destination IP, and protocol type, and then asked to produce the label indicating whether or not the session was secure.

  2. Quiz the Model
    After the model has seen a set of data, it is then "quizzed" or tested to predict the labels from the features it has been given. The model will make predictions based on what it has learned during training.

  3. Repeat
    Training is a continuous process. After the model is quizzed and evaluated for accuracy, it is refined and improved by adjusting its parameters. This step is repeated multiple times until the model’s performance meets the desired thresholds (ex. 98% accuracy). Ideally, by continuously feeding the model more data, testing its predictions, and adjusting it, the model gets better at understanding patterns and making accurate predictions.

Unsupervised learning looks different: algorithms analyze data without any labeled outcomes or explicit instructions. Instead of being told what to look for, the algorithm explores the data on its own to uncover patterns or structures that exist.

Types of Machine Learning Models

Machine learning encompasses a wide variety of algorithms, each suited to different types of problems. Here are four foundational types:

Linear Regression
One of the simplest and most interpretable models, linear regression is used to predict a continuous outcome based on one or more input variables. It assumes a linear relationship between inputs and outputs, making it ideal for problems where the data follows a clear trend, such as forecasting sales based on advertising spend or predicting temperature based on time of year.

Real-world example: Zillow uses linear regression to estimate home prices by analyzing square footage, number of bedrooms, location, and other features.

Decision Trees
Decision trees split data into branches based on feature values, creating a flowchart-like structure that is easy to visualize and understand. They are particularly useful for classification tasks, like determining whether a transaction is fraudulent or categorizing customer segments. However, they can overfit the data unless carefully pruned or paired with ensemble methods.

Real-world example: Financial institutions use decision trees to automate loan approvals by evaluating credit score, income, and debt-to-income ratio.

Neural Networks
Inspired by the human brain, neural networks are composed of layers of interconnected "neurons" that can learn complex patterns in data. They power many of today’s AI applications, including image recognition, natural language processing, and voice assistants. While powerful, they require large datasets and computational power, and are often considered black-box models due to their lack of interpretability.

Real-world example: Apple’s Face ID uses neural networks to recognize and authenticate users by analyzing the geometry of their faces.

Reinforcement Learning
Unlike supervised learning models, reinforcement learning involves an agent learning to make decisions by interacting with an environment. It receives rewards or penalties based on its actions and uses that feedback to improve over time. 

Real-world example: Google’s DeepMind used reinforcement learning to train AlphaGo, the first AI to defeat a world champion in the game of Go.