In the Part 1 of this article I had written about what should be the goals of a BI Modernization strategy and how grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy. The three groupings by data latency are Real-Time, Operational and Analytical Reporting. In this article I will cover possible architectures for each of these use models.
The key design assumption for these architectures are the following –
1. Companies will have multiple visualization tools
2. A single architecture and set of tools might not be able to satisfy all the three data latency use cases.
The characteristics of real time reporting are that users need the data “as it is any point” in the source system when they run the report. The users need data in real-time from one data source like Oracle EBS, SAP, or Salesforce, and the volume of data required for each report is relative small — few tens of thousands at most and the UI needs of the reports are simple, which means mostly they need a table like report. Given the fact that there are multiple source systems that need real-time reporting each might have a different way of accessing the data. Some might have JDBC/ODBC interfaces where you can run SQL queries while others might have an API interface. The other consideration is that there might be multiple visualization tools which can be used for real-time reporting. So, given the fact that there are multiple back end systems and multiple visualization platforms and there is no need to join one system to another (no need for query federation) a central data modeling tool would be ideal. This will avoid creating data models in different data visualization tools and give the capability to restrict the data volume and query performance on the source systems, so the reporting platforms do not impact the source system resources. A leading data virtualization tool like Denodo will be a good solution for this use case and I have written about it here in detail. The data virtualization tool will provide the semantic layer and it will enable the various BI tools to connect to it so if a new BI tool is introduced there is no need to redo the data model.
The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. Each of these systems can have data in a different shape and format and hence it is important to bring the required data into a central database. If you have a lot of data sources and millions of rows that need to be brought in, then query federation is usually not an option as there will be significant performance differences between the different sources and hence user experience might not be optimal. So, you will need to decide on an ELT/ETL tool to move the data and a database to host the data. Once the data is brought into the central database then the data modeling exercise begins where the data is reshaped and joined appropriately based on the reporting requirements. After the data modeling is done then business friendly labeling can be applied and calculated columns like currency translations, gross margin can be added. Once this is complete, then the reporting and data visualization activity can begin.
There are two approaches companies can make –
- Choose one tool which does all the tasks — ETL/ELT, Database, Data Modeling, and Data Visualization. There is a new industry term for these kinds of tools called Unified Data and Analytics Platforms (UDAP). These UDAP tools don’t have best of breed features in any of these four domains but the advantage is that they have a single stack which leads to well-integrated set of capabilities which should lead to significant time to value. Examples of these tools include Incorta, Birst, Domo, Pentaho and others. I specifically have good experience with Incorta which has been very good for operational reporting from large complicated on-premise and cloud data sources and joining them at scale. I have written about the Incorta architecture and capabilities here.
2. Use best of breed software to create operational reporting stack
A best of breed approach can also be taken to create the operational reporting stack. In this case, you will have to choose an ETL/ELT including CDC mechanism which can get near real-time data into the target database quickly and without much transformation in the process. I have talked about how to enable real-time analytics to Snowflake here in detail. The transformation will happen after it is loaded into the target database. In this case, an ETL/ELT tool like Matillion, Stitch, FiveTran and others can be used to move the data into the target database. In today’s world the target database totally makes sense to be one of the major cloud platforms like Snowflake, Redshift, or Google Big Query. If you are using a tool like Stitch or FiveTran which are primarily data ingestion tools then using dbt to transform the data will be very useful. Matillion does both ingestion and transformation as well as orchestration.
This stack works best if all the major data sources are on the cloud. If there are large volumes of on-premise data and data is needed to be loaded several times a day, I don’t think this will be a very robust data pipeline. The biggest advantage is that you will have your BI tool of choice for operational, analytical and real-time reporting. You will have to evaluate your data sources, volume and frequency of loads to make sure this is the best architecture for your operational reporting.
Analytical reporting has the following characteristics –
1. Analyze data trends over months, quarters or years.
2. Frequency of data loads can be daily, weekly or monthly from various sources. No need for loading several times a day.
3. The data is usually stored for the long term in the form of snapshots.
4. Data is summarized to a more aggregate form.
5. Data can come from multiple sources.
6. Analysts and users prefer to use their favorite BI tools. The end users skew more towards senior management.
The proposed architecture for analytical reporting is not too different from the best of breed tools for operational reporting. This architecture should address all the six major characteristics of analytical reporting.
The important distinction between operational reporting and analytical reporting are the reduced frequency of data loads, use of snapshots and aggregations to store data for long term, and the use the best self-service data visualization tool that your organization has access to. The modern cloud data warehouses like Snowflake, Redshift and Google Big Query are well suited for this purpose as they perform fast, provide cheap storage for snapshots and data retention and can be accessed by a wide variety of BI tools for visualization.
When you look to modernizing the data architecture for your company, it might be helpful to classify your current BI landscape into real-time, operational and analytical reporting buckets. This will give clarity on how significant each of these use models are for your company. Once that is done you can design an architecture for each of these and then build your existing reports and analysis into one of these three architectures and then continue using these platforms for your future requirements.
I would love feedback on whether thinking about your BI landscape in this way using data latency requirements is useful to you and your organization. I look forward to your comments.