Projections: Creating Read-Optimised Data

Data continuously flows through organisations, and on its way, it’s transformed many times to serve different purposes. Functional design emphasises data transformation—projections— rather than repeatedly mutating state in a single database. One particularly useful form of projection is the creation of read-optimised database structures tailored to well-defined use cases. This is the focus of this article.

What is a projection?

To understand projections, think of a projector that is transforming a small diapositive photo through a lens into a new representation optimised for a wider audience.

In the same way, a projection selects, reshapes, and focuses data from an event stream into a read-optimised database model designed for a specific purpose.

Creation of Read-Optimised data

Within the CQRS (Command and Query Responsibility Separation) pattern, the query data is stored optimised for read, while the command data is stored as events reflecting what has happened and in which order. The data in the events gets projected from their original form into structures optimised for many different and well-defined type of queries.

A similar approach can be applied when a service utilises events from another system or service, and stores that data in its own database. This kind of data should be stored in a read-optimised way to serve e.g. a query from the UI. The events can be transformed into several data structures if needed, redundancy is fine as the data is used for queries only and gets updated via projections from the same event stream.

Lucene-based databases such as Elasticsearch support this model particularly well. Their document-oriented approach and use of indices align naturally with the idea of projecting data. For this reason, this type of database will be used as an example in this article.

When creating projections, it’s usually a good idea to build specific projection classes, with the only purpose of hooking into the stream of incoming events and project the data into read-optimised structures. In languages that supports LINQ and Lambda this can be done with a functional approach, making it possible to plug in one or even several projections into the stream of events.

Changes to the Read-Optimised Data Structure

As the data projections are very purposed, they will need to change if the requirements change.

One way is to make the service handle both the new and old model and apply the changed structure only to the new data coming in. There are several problems with that, for example it can be harder to query the data if the same database index contains documents with different structures. Migration code can be written migrating the data from one model to another, which is risky as it is a usually code that is separate from code normally generating the data.

Let’s assume the events consumed by the service are stored in an event log that contains all events from the moment they were first produced.

In this scenario the service can reprocess all events from the beginning and rebuild the full data structures into new indices. This gives you the possibility to completely change and re-optimise the query data, make e.g. three indices instead of two, and also correct old data if needed.

If you want to reprocess all data while the service is still running and used by users, you need to support both the new and the old model. The index names can include a version number which allows you to dynamically switch the service over to the new version of the indices with all the reprocessed data.

Reprocessing Events

If there is a defect causing the read-optimised data to be wrong, it can be reprocessed for the single purpose of fixing the error. In this case there might be no need for a version change of the indices, as long as they have been designed for idempotency.

Idempotency is the property of an operation where performing it multiple times produces the same result as performing it once.

When building read-optimised data, a great way to make it idempotent is to utilise parts of the data structure to create a composed key which can even include a date (e.g. day or month). It’s still advisable to avoid data that can change, but instead use synthetic ids that are part of the data.

By utilising idempotent keys, the events can be reprocessed many times and if the data exists it gets updated, if not it is added.

Be careful though, if all historic events are kept in the event log your read-optimised data will travel back in time as the events get re-processed which causes the user to see the wrong state. There are different mechanisms to handle this problem, e.g. by only storing the latest events in the event stream, or to always use index versioning.

Join data from Different Event Sources

In some cases, it’s not enough to create read-optimised data from only one event source. There are many ways to join the data, e.g. you can store it in separate tables in a SQL database and join at read. Document databases are in general not built for performant joins, but there are other options like having a job running regularly to update the joined index.

For a more real-time approach a separate service can be built with the purpose of joining, and if needed also aggregate, the data into new events. This service can also serve as an anti-corruption layer to handle changes done in other systems and sub-domains. By joining the events, the approach of read-optimised data can be fully fulfilled, and no joins need to be done at read.

Real-time data joins and aggregations quickly become complex; maintaining event order, handling late arrivals, and efficiently managing memory under heavy load are all challenging. For scenarios like this, it’s advisable to leverage stream processing platforms such as Kafka Streams or Apache Flink, which provide built-in support for event ordering, windowed aggregations, and stateful processing, enabling scalable and reliable real-time data pipelines.

This kind of tools can also provide the possibility to utilise a LINQ/Fluent syntax to process the incoming data and create new streams (code example by ChatGPT):

// Windowed aggregation: sum amounts per minute
        var orderSummary = orderObjects
            .GroupByKey()
            .WindowedBy(TimeWindows.Of(TimeSpan.FromMinutes(1)).Grace(TimeSpan.FromSeconds(30)))
            .Aggregate(
                () => 0m,
                (key, order, aggregate) => aggregate + order.Amount
            );
        // Output to a projection (read-optimised) topic
        orderSummary
            .ToStream()
            .MapValues((windowedKey, totalAmount) => 
                $"{windowedKey.Key} -> TotalAmount={totalAmount}, WindowStart={windowedKey.Window.StartTime}")
            .To("order-summary-topic");
        var topology = builder.Build();
        var stream = new KafkaStream(topology, config);
        await stream.StartAsync();

A few examples of how this provides a functional programming approach:

Immutable data transformations: Each operation produces new streams or intermediate objects, rather than modifying the original stream. Example: orderStream.MapValues(...)
First-class functions: Functions are passed as arguments to transformations (MapValues, Aggregate, GroupByKey).
Declarative/pipeline style: You describe what you want done, not how to mutate state step by step, and the stream is built as a pipeline of transformations: Map → GroupBy → Window → Aggregate → ToStream. This is very similar to LINQ-style functional programming.

Performance

It’s worth mentioning performance as it can be a challenge processing streams with huge amounts of data and updating a database creating read-optimised indexes or data structures. There is a lot to cover when it comes to throughput, memory usage vs disk usage, and aggregations, but on a very high level keep this in mind: Bulk, Batch and Buffer.

Bulk: Grouping multiple database updates or inserts into a single operation will significantly improve performance.
Batch: Process and consume events in batches whenever possible, e.g. to prepare for efficient bulk database operations.
Buffer: Temporarily pause the event consumer to let events accumulate, enabling more efficient batch processing before writing to the database.

AI and the Future

ChatGPT iteself predicts a big mental shift in programming: Most AI-generated code will not be “maintained”. Instead it will be generated based on the automated tests that forms the source of thruth. Whether this is a correct prediction or not can be discussed, but if AI will regenerate the code rather than adjusting it, would it also rather regenerate the database structures instead of modifying them?

Thanks for reading all the way here and if you want, have a look at similar subjects in my blog posts about Functional Domain Modeling and How to work with Software Architecture.