Part I: Lambda Architecture
Introduction:
Before going into technical details, let me summarize the purpose of this blog series and what tools and technologies would be required to achieve the end goal. In real-world scenarios, most businesses generate high throughput of data. This means they generate a high velocity of data, and because the costs of running analytics on live data are high, they may not always be able to get insights from it.
Nonetheless, we will now discuss using Lambda architecture for real-time analytics with Low TCO (Total Cost of Ownership) and efficient data processing on massive datasets.
Find out more about our Business Intelligence offerings
Lambda Architecture
Let’s, discuss what Lambda Architecture is and how it helps us with Real-time Big data analytics.
Do not panic; we will discuss all the architecture components and decipher the logic behind them.
Regardless of the tools and platform that we will use for data processing, let’s go through the basic building blocks of it.
We can categorize all components under Hot and Cold ingestion of data. The cold path is dedicated to batch processing and running analytics on staged data, whereas the hot path is dedicated to Real-time data analysis.
- Batch Layer
- Speed Layer
- Serving Layer
To summarize the layers’ process: we get incoming Real-time data stream in the master repository in the Batch layer; we then keep Master data in the batch layer and cache the latest data stream in the speed layer. Data consumers run queries on real-time views and on historical batch views or combine both views in queries for gaining data insights.
1. Batch Layer
This layer is responsible for keeping master data and pre-compiled datasets. It could be considered a single source of truth for your business where data is immutable and eternally true. We can reconstruct the serving layer and speed layer from scratch using only batch layer data if something goes wrong on those layers.
Another segment of the batch layer is pre-Computed datasets. These datasets are based on data in the master repository available on the batch layer, but these datasets are compiled for frequently requested information.
The batch layer also serves as a speed layer correction because it can aggregate large volumes of historical data and provide full fault tolerance as compared to streaming systems.
Ready to find out more? Contact us.
2. Speed Layer
This layer is responsible for providing the latest data in the business; it bridges the gap between historical information and streaming real-time data from live operations. Live data exposed through a speed layer buffer contains only the predefined size of streaming data. It is fast and provides near real-time views on streaming datasets. Any service that should be used for the speed layer must possess following characteristics. It should be Fault Tolerant, supports random reads, writes on real-time data, and must be scalable based on real-time ingestion.
3. Serving layer
The final layer is responsible for merging the output of batch views with the speed layer’s real-time views to respond to user queries. It gives the lowest latency and the highest accuracy to the results of operations performed on the master dataset. If we want to further optimize read operations performance, we could create indexes on batch views.
Summary
If I summarize Lambda architecture, it is designed to allow ad-hoc queries against your live and historical data with fault tolerance and scalability. As we have already discussed each layer in this architecture could easily scale based on the workload. Moreover, it also gives extensibility on analytics; we could easily add new functions or pre-computed views on Batch layers for diverse analytics-most of the data-savvy companies are using this architecture with their own sets of technologies. In my next blog, I will explain how Azure services help us implement lambda architecture for Real-time big data analytics. Stay tuned! If you have questions, leave a comment below or reach out to us via our contact us page!