- Eliminate redundant (useless) data
- Ensure data dependencies make sense
- Maintain data integrity
When presented with a few examples on how to apply data normalization, modern IT professionals generally adopt the notion of normalizing data rather quickly. Largely because it feels intuitive and fits perfectly with the “Don’t Repeat Yourself” (DRY) (Don’t Repeat Yourself) principle, another concept that computer engineering students are aware of. Each value is stored in the database in one place and can be accessed directly, or through joins between tables.
A database in normal form is relatively simple to query, update and reason about, and with the small data sets used in university and personal projects, performance is not a concern.
Performance vs normalization
A production database serving thousands of users is a different beast. Tables with millions of rows are common, and multiple joins between them, often required by a complex business domain, can hinder performance, even when carefully indexed.
The perfectly normalized database is not necessarily the most performant one, and as the data set grows the latency problems compound. To achieve satisfactory performance, it is usually necessary to cache some data and store some frequently accessed data in a denormalized format.
What is denormalized data?
Denormalization is the process of adding redundant copies of data to improve performance of read operations on large data sets. By flattening out data, reducing the number of joins and performing grouping operations in advance the queries to fetch data are simpler and take less time to execute, thereby improving performance.
While you will get a performance boost when reading data from a database which has denormalized data tables, write operations (Create, Update, Delete) take more time and place more load on the database server. There are also increased storage requirements for the duplicated data, and increased development complexity as denormalized data sets must be recalculated when users edit the data which comprises the denormalized data set.
Denormalized data at Nintex Promapp®
Here are a few examples of how we use denormalized data at Nintex Promapp® for better performance:
Denormalized tables in our databases
- DashboardNotifications – Customer dashboard notifications are calculated ‘on the fly’ and in a nightly background job. This mechanism pulls together data from several parts of the application and stores it in one table, so the dashboard views are performant for customers.
- ProcessView and Edit Rights – The logic determining the processes an individual user can see and do in their Nintex Promapp® site has complex business rules which translate into multiple tables and joins in the normalized database. To speed up process viewing and editing requests these rights are calculated and stored in a single denormalized table. The table is updated whenever an action which will affect user rights is performed.
Our Reporting API surfaces an Odata feed of client data, which can then be surfaced through a 3rd-party application (Microsoft Excel and Power BI are explicitly supported). Allowing the Reporting API to access and query our relational database directly would place a heavy load on this resource and hinder performance for users using the Nintex Promapp® application for standard operations such as process viewing or editing. Therefore, we export data asynchronously to a separate database with events fired when an update is made affecting the surfaced data.
Azure Cognitive Search
To improve search request performance in Nintex Promapp®, we export searchable entity data to Azure Cognitive Search and then make it accessible through a Search API microservice. The search indexes process inbound text into tokens and stores them in inverted indexes for fast access. When a natural language data set is processed by an appropriate indexer (for example: English language, German language, etc.) lookup is of fixed size complexity and remains performant even as the amount of data in the index grows.
Another advantage is that the indexes can return results for words from the same word family as the search term, for example, a search for ‘mouse’ will also return results containing ‘mice’, searching for ‘eat’ will also return results containing ‘ate’.
Finally, rules can be provided to an index about which results to boost, depending on which searchable field a search term is found in.
To the future
As we split our legacy application into a modern microservice based architecture denormalized data sets may play a part in surfacing performant, scalable solutions for our clients.
The problem with Object Relational Mappers
ORMs such as Entity Framework and Hibernate, while relatively simple to develop with, are not the answer to writing performant database queries. Operations such as accessing a nested property can be translated into an inefficient query, and care must be taken when iterating through a collection that additional database queries are not performed in each iteration of the loop, the notorious N + 1 problem.
Careful profiling of database access and tuning of queries is required for anything beyond simple data requests. Some say that developers should “learn their tools” and tune code to prevent unnecessary database access or inefficient queries being executed, but this approach doesn’t seem correct to me. It places the burden on developers to understand implementation details that in a strict object-oriented approach should be encapsulated within the ORM library.
In addition, the tuning mentioned above often leads to code that results in database access code leaking outside the data access layer and written in a style that is difficult to understand, but is translated into an efficient query by the ORM.
Please note that this is not a dig at the providers of these ORM libraries. Converting standard computer language code into performant, correct database queries is inherently complex. The Microsoft documentation on Entity Framework performance notes these difficulties and urges users to learn concepts such as deferred execution as well as measuring query performance.
Using denormalized data for read operations
So what to do about the issues detailed above? An approach gaining favour is using denormalized data for read operations. One simple approach is to use indexed views on your standard normalized database tables to provide an option for performant read operations.
A more complex option is to split read and write databases, with updates to the write database being synced to the read database to provide eventual consistency. These are generally used in conjunction with the Command Query Responsibility Segregation (CQRS) pattern which has clear separation of write (command) and read (query) operations. The denormalized tables or views can be accessed with a thin database access layer in the code which allows you to query the data sets more or less directly.
An ORM is still often an appropriate choice for write operations, especially when mapping a rich Domain Driven Design (DDD) Domain layer to a normalized database. Create, Update and Delete operations are generally run less frequently than the Read queries, and are likely to be composed of simpler, smaller data sets, making performance less of a concern. An ORM in this case can also speed up development of a solution.
Chill out and learn to love denormalized data
Once you get over the mental hurdle of breaking away from having your data stored in perfectly normalized databases a whole world of data options opens up. You may have multiple data solutions for your application, each one expressly chosen for the operations it will be handling. Each solution has pros and cons to weigh up and learn about as you work with them. Give it a try, and just like me, you may just end up surprising yourself by chilling out and learning to love denormalized data.
Interested in learning more about Nintex Promapp and what it can do for your organization? Click here to request a demo!