Not long ago, Big Data was seen as a management revolution. Enterprise IT invested heavily to acquire large volumes of information, all to drive business decision making. And it turned out well for large enterprises with enough computing resources and data engineers to extract few, but meaningful, insights from exploding volumes of raw data.
The technology and philosophy of big data was appealing to business decision makers because billions of connected devices and users produce several exabytes of data every day. Every data point was a continuation of a trend, pattern, or story that business organizations could — at least in theory — exploit to make profitable data-driven decisions.
But that’s not how it always turned out.
Several years ago, Gartner estimated that 85% of all data projects failed to deliver the desired outcomes. This stat suggested that organizations were jumping onto the Big Data bandwagon without aligning their objectives with the technology and data assets they sought.
Certainly, customers and end users did not find data-driven technologies appealing all the time, either. Consider the Facebook-Cambridge Analytica story from 2018 where user information was harvested without their explicit consent. Just take a peek at the countless ads on any ecommerce website with no personal relevance.
These use cases turned out to be exploitative or annoying, perhaps both.
It turns out that you don’t always need Big Data. You don’t always need to package all sources of data to make decisions unique to every user. In fact, modern AI technologies are now adopting capabilities to encapsulate knowledge-based intelligence from data and information that is:
Consider this simple example: you can train a deep learning model for self-driving cars to stop at a red traffic light. Such a model training dataset must both:
A similar limitation is observed for modern LLMs trained on big data. GenAI tools such as ChatGPT can perform well on some tasks — but not on all tasks. They can’t necessarily provide reason or logic to their arguments. (The ongoing issue of “black box” outputs.)
Perhaps this is why we are yet to see a universal AGI model that performs exceptionally well on all tasks.
Toward that goal, AI research and the scientific community is looking into the true ways that humans really learn: based on logic and reasoning. This is usually achieved by integrating small but highly specific data together with some established logic or knowledge.
If you think about the traffic stops example again, humans simply need to identify a red light and apply the traffic rules logic to all scenarios at a traffic stop junction.
This brings us to the definition of Small Data.
Small Data refers to a relatively small set of information that is sufficient to capture adequate insights about a specific use case. Here are some clear examples:
As data analyst Austin Chia describes:
Small data is traditional structured data that can be easily analyzed using tools like Microsoft Excel, Google Sheets, or SQL. It is usually generated in smaller volumes and follows a specific format, making it easier to manage and analyze.
Analyzing small data doesn’t require large AI models with billions of parameters. Since the data distribution describes fewer features, it can be analyzed using traditional statistical methods, on low-power IoT and edge-computing devices.
(Related reading: predictive modeling & predictive vs. prescriptive analytics.)
This capability can allow business organizations to build highly tailored services. For example:
These use cases are simplistic, of course. Existing knowledge and logic are used to define relationships or model the parameters. An inference is produced when the parameters reach a threshold value.
But what about the more advanced and complex use cases?
Take the example of LLMs. We know that LLMs perform well on generic conversation tasks. But what about specific math problems and programming styles? Do you need to train the models on every single code snippet published on Stackoverflow to learn a particular programming style or paradigm?
In these cases, large models trained on big data can serve as backbone models — a base model state that is further fine-tuned and adapted to perform well on a specialized task. It may require more than just small data to fine tune an LLM, but still small relative to the pretrained backbone model. It will, however, require knowledge or logic as means to train the model.
For example, models such as ChatGPT rely on the so-called Reinforcement Learning with Human Feedback (RLHF) learning algorithm. In simple words, we can say two things:
Indeed, it is the logic and established knowledge that is sufficient to redirect and adapt model learning in such a way that it performs very well on all tasks related to the small dataset.
Drawing from our article on Big Data vs. Small Data Analytics, we can summarize their key differences as follows:
As more organizations experiment with language models and AI, our hunch is that small data will become increasingly important. Perhaps we’ll see a time where small data itself is the star of many business experiments, and we reserve big data only for the use cases that truly require and benefit from it.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.