The amount of data organizations handle has created the need for faster data access and processing. Data Denormalization is a widely used technique to improve database query performance.
This article discusses data normalization, its importance, how it differs from data normalization and denormalization techniques. Importantly, I’ll also look at the pros and cons of this approach.
Data denormalization is the process of introducing some redundancy into previously normalized databases with the aim of optimizing database query performance. It introduces some pre-computed redundancy using different techniques to solve issues in normalized data. These techniques include:
However, data denormalization introduces a trade-off between data write and read performances.
(Understand databases vs. data warehouses and how CMDBs work.)
Let’s first look at a normalized database. A normalized database will have a single copy of each piece of information, and the related data will reside in separate tables. You have to use queries to combine data from multiple tables and use that data in real-world applications.
However, assume that the amount of data increases, or that there are complex queries joining multiple tables. This can degrade the performance of the database significantly or lead to crashes.
For example, pretend that you run an online shop where customers can order products. Such an e-commerce platform typically contains customer data in one table and order data in another. The order data table contains a foreign key to the customer data table to relate to customer information.
Suppose the website needs to display all orders placed by a customer. A normalized data store would require a join between the orders and customer tables to achieve that. Imagine the online shop got a high volume of orders and customers. In that case, this join operation could be computationally intensive and slow down the performance of the website.
In this instance, we can introduce data denormalization to improve performance in such circumstances. You can either:
Data normalization is the process that removes data redundancy by keeping exactly one copy of each data in tables. It maintains the relationship between data and eliminates unstructured data. There are mainly four ways to apply data normalization: first, second, and third normal forms, and Boyce and Codd Normal Form (3.5NF).
A normalized database helps standardize the data across the organization and ensures logical data storage. Normalization also offers organizations a clean data set for various processes, improves query response time, and reduces data anomalies.
So, we can sum up the differences in data denormalization and normalization in two key ways:
Database administrators use several data denormalization techniques depending on the scenario. However, remember that those techniques have their own pros and cons. Here are some examples of data normalization techniques used by database specialists:
This technique can be used when there are expensive join operations and data from multiple tables are frequently used. Here, that frequently used data will be added to one table.
For example, let’s say there are two tables called customer and order. If you want to display customer orders along with their names, adding the customer name to the order table will reduce the expensive join operation. However, it will introduce massive redundancies. Here's an illustration:
Table splitting is the process of decomposing a table into multiple smaller tables so they can be queried and managed easily. Table splitting can be done in two ways: horizontal table splitting and vertical table splitting.
Splitting table rows into smaller tables. Each table will have the same columns. This approach is useful when data tables can be separated based on regions, physical locations, tasks and many more scenarios.
For example, imagine a table containing student information for all departments in the science faculty of a university. As the diagram illustrates, this table can be split according to each department, such as computer science, chemistry, maths and biology.
Here, only a smaller data set will have to be queried compared with the original table. Thus, this technique enables faster query performance for department-based queries.
Vertical splitting is splitting a table based on columns, applying the primary key to each partition.
For example, think that a hospital maintains a ‘Patients’ Table with patient ID, name, address and medical history columns. We can create two new tables from it using vertical partitioning: ‘Patient_details’ and ‘Patient_medical_history,’ as shown in the below figure.
This approach is best suited when some table columns are frequently accessed more than others. It will allow getting only the required attributes, eliminating unnecessary data.
Consider the following example. Let’s say there are two tables, Student and Student_Grades:
If the application requires displaying the total marks for the students with their details, we can add a new derived column that contains the total marks for all the assignments for each student. Therefore, there is no need to calculate the total marks each time you query the database.
This technique creates a full or partial copy of an existing table, which will be stored in a separate location and optimized for faster query performance. Generally, the mirrored table will be used for read-heavy workloads using techniques like creating additional indexes and data partitioning. That mirrored table can create read-heavy processes like analytics queries.
This approach involves creating replications of databases and storing them either in separate database instances or on a physical server. However, it involves complexities like maintaining multiple copies of data and keeping them in sync, which can be costly and require more resources.
Materialized views are pre-computed query results stored in a separate table. They are typically ‘join’ and ‘aggregation’ queries that are quite expensive and result in frequently accessed data. Next time, the database can pull the data from the view when needed rather than execute the same query repeatedly.
Data Denormalization brings several advantages for organizations.
Querying data from a normalized data store may require multiple joins from different types of tables, depending on the requirement. When the data grows larger, it will slow down the performance of Join operations. It can negatively impact the user experience, especially when such operations are related to frequently-used functionalities.
Data denormalization allows us to reduce the number of joins between tables by keeping frequently accessed data in redundant tables.
Data denormalization reduces the complexity of queries by reducing the number of join queries. It enables developers and other application users to write simple and maintainable codes. Even novice developers can understand the queries and perform query operations easily.
Plus, this simplicity will help reduce bugs associated with database operations significantly.
Denormalization reduces the number of database transactions when reading data. This approach is particularly helpful when a high user load results in a heavy load of database transactions. This reduced number of transactions accommodates varying user loads, improving the scalability of applications.
Organizations use data to generate endless reports, such as usage statistics and sales reports. Generating such reports can involve data aggregation and summarization by searching the whole data set. Data normalization techniques like mirrored tables allow organizations to optimize the databases specifically for daily report generation without affecting the performance of master tables.
As discussed in the above section, data denormalization offers several advantages. However, this technique can also have some disadvantages that you may need to consider when using it.
Data normalization removes redundant data. At the same time, denormalization creates redundant data over normalized databases to improve data read performance at the expense of data update and insert performances. There are several denormalization techniques, such as partitioning tables and introducing derived and pre-joining tables. Some advantages of this technique include improved user experience, query performance, scalability, faster report generation, and reduced data complexity.
Yet, denormalization also introduces several cons, like data redundancy, increased complexity, maintenance and storage costs and data inconsistencies.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.