Learn

December 17, 2024

4 Minute Read

What Is Small Data In AI?

By Muhammad Raza

Not long ago, Big Data was seen as a management revolution. Enterprise IT invested heavily to acquire large volumes of information, all to drive business decision making. And it turned out well for large enterprises with enough computing resources and data engineers to extract few, but meaningful, insights from exploding volumes of raw data.

The technology and philosophy of big data was appealing to business decision makers because billions of connected devices and users produce several exabytes of data every day. Every data point was a continuation of a trend, pattern, or story that business organizations could — at least in theory — exploit to make profitable data-driven decisions.

But that’s not how it always turned out.

The failure of big data

Several years ago, Gartner estimated that 85% of all data projects failed to deliver the desired outcomes. This stat suggested that organizations were jumping onto the Big Data bandwagon without aligning their objectives with the technology and data assets they sought.

Certainly, customers and end users did not find data-driven technologies appealing all the time, either. Consider the Facebook-Cambridge Analytica story from 2018 where user information was harvested without their explicit consent. Just take a peek at the countless ads on any ecommerce website with no personal relevance.

These use cases turned out to be exploitative or annoying, perhaps both.

It turns out that you don’t always need Big Data. You don’t always need to package all sources of data to make decisions unique to every user. In fact, modern AI technologies are now adopting capabilities to encapsulate knowledge-based intelligence from data and information that is:

Small
Specific
Feature-rich

Consider this simple example: you can train a deep learning model for self-driving cars to stop at a red traffic light. Such a model training dataset must both:

Contain large volumes of target scenarios that capture all the varied situations arising at a real-world traffic stop.
Rely on a specialized model and learning algorithm to maintain the desired knowledge when the model is exposed to other scenarios for different self-driving use cases.

A similar limitation is observed for modern LLMs trained on big data. GenAI tools such as ChatGPT can perform well on some tasks — but not on all tasks. They can’t necessarily provide reason or logic to their arguments. (The ongoing issue of “black box” outputs.)

Perhaps this is why we are yet to see a universal AGI model that performs exceptionally well on all tasks.

How do humans really learn?

Toward that goal, AI research and the scientific community is looking into the true ways that humans really learn: based on logic and reasoning. This is usually achieved by integrating small but highly specific data together with some established logic or knowledge.

If you think about the traffic stops example again, humans simply need to identify a red light and apply the traffic rules logic to all scenarios at a traffic stop junction.

This brings us to the definition of Small Data.

So, what is small data?

Small Data refers to a relatively small set of information that is sufficient to capture adequate insights about a specific use case. Here are some clear examples:

A small set of data points from an IoT sensor
EEG signals from a few subjects undergoing brain activity research
A few images of traffic stop scenarios with different lights

As data analyst Austin Chia describes:

Small data is traditional structured data that can be easily analyzed using tools like Microsoft Excel, Google Sheets, or SQL. It is usually generated in smaller volumes and follows a specific format, making it easier to manage and analyze.

Analyzing small data doesn’t require large AI models with billions of parameters. Since the data distribution describes fewer features, it can be analyzed using traditional statistical methods, on low-power IoT and edge-computing devices.

Use cases for small data

This capability can allow business organizations to build highly tailored services. For example:

A personalized wearable healthcare monitoring device can infer patient health using relatively few measurements.
A coffee shop can identify retail patterns — such as peak hours and flavor preference — based on only a few days of shopping data.

These use cases are simplistic, of course. Existing knowledge and logic are used to define relationships or model the parameters. An inference is produced when the parameters reach a threshold value.

But what about the more advanced and complex use cases?

Take the example of LLMs. We know that LLMs perform well on generic conversation tasks. But what about specific math problems and programming styles? Do you need to train the models on every single code snippet published on Stackoverflow to learn a particular programming style or paradigm?

Small data vs. big data

In these cases, large models trained on big data can serve as backbone models — a base model state that is further fine-tuned and adapted to perform well on a specialized task. It may require more than just small data to fine tune an LLM, but still small relative to the pretrained backbone model. It will, however, require knowledge or logic as means to train the model.

For example, models such as ChatGPT rely on the so-called Reinforcement Learning with Human Feedback (RLHF) learning algorithm. In simple words, we can say two things:

The reinforcement learning aspect uses examples. (That is, interactions of a system with its environment in response to an input.)
The human feedback aspect introduces logic and established knowledge as small data.

Indeed, it is the logic and established knowledge that is sufficient to redirect and adapt model learning in such a way that it performs very well on all tasks related to the small dataset.

Summarizing small data vs. big data

Drawing from our article on Big Data vs. Small Data Analytics, we can summarize their key differences as follows:

Data size: Small data refers to datasets that are relatively smaller and can be easily processed using traditional methods. Big data is massive in volume and requires advanced tools and techniques for analysis.
Variety: Small data is usually structured and organized, coming from well-defined sources such as databases or spreadsheets. Big data, however, comes from various sources and can be unstructured or semi-structured.
Velocity: Small data is static and doesn’t change frequently. Big data streams in continuously at high speeds.
Sources: Small data typically comes from internal sources (like customer databases) while big data can come from both internal and external sources, like social media platforms.
Insights obtained: With small data, you can easily draw insights from the data using basic statistical methods. Big data requires advanced analytics tools and techniques to extract meaningful insights.
Scope: Small data is usually focused on a specific problem or question while big data analytics aims to explore multiple questions, patterns, and correlations at once.

As more organizations experiment with language models and AI, our hunch is that small data will become increasingly important. Perhaps we’ll see a time where small data itself is the star of many business experiments, and we reserve big data only for the use cases that truly require and benefit from it.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Learn 7 Min Read

Cloud Automation Explained

Cloud automation streamlines cloud management by automating deployment, scaling, and maintenance. Learn its benefits, challenges, and best practices.

Learn 3 Min Read

What is Spoofing? An Introduction

Learn about spoofing: types like email, IP, DNS, and GPS spoofing, how attackers exploit them, and tips to protect yourself from these cyber threats.

Learn 8 Min Read

Blockchain & Crypto Conferences 2025: The Complete Guide

Whether you are looking for a small regional meetup or an enormous global summit, we’ve compiled the complete 2025 guide to Blockchain & crypto conferences that is sure to have the right event for you!

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram