Blog

How to develop and nurture data warehouse, in the advent of cloud computing

02 Dec, 2020
Xebia Background Header Wave

Introduction

In the age of digital world, with the advent of emerging technologies, the amount of data being generated from an enterprise operation is increasing rapidly. This phenomenon of rapid and huge data arriving out of an enterprise eco-system requires new methods of data handling. This paper elucidates cloud computing methodologies while developing and nurturing data warehouse platform, enabling to make use of the data, and transforming an enterprise into ‘Data Driven Decision Making’ enterprise.

Typical view of Data Warehouse architecture

Information insights from data is vital for every organization to stay relevant and competitive; in order to achieve this every organization builds a data-warehouse to mine data in various sizes using various technologies and methodologies. Data warehouse (DW) helps to build Business-Intelligence and is an integral part of the Digital system of an enterprise.

The DWs act as:

  • A central repository of data from heterogeneous sources.
  • A central repository of huge amount of historical data

The DWs help an enterprise to:

  • Identify vivid patterns of business transactions
  • Create vivid data reports and analytics.

On-Premise computing eco-system to build DWs

Traditionally the DWs are built in an on-premise and/or private cloud environment. The process of developing and nurturing the data-warehouse involves:

  • Computing platform
    • Extracts the data
    • Transforms the data
    • Loads the data into the target storage system, aka database.
  • Database platform
    • The storage
    • Ability to handle the large amount of data in an optimal way

The standalone application which does the ETL functionality, is connected to multiple data sources, and ingests the data into the target database, at various stages in the process of developing the DWs.

ETL as a standalone application, the view of DW eco-system

Problems with Non-Cloud computing eco-system

In the process of building DWs, the non-cloud legacy technologies make the computational and/or data management platforms monolithic. As the DWs store huge amount of data, the eco-system is prone to be resource demanding and leads to problems, such as:

  • High response-time
  • Long duration of computing
  • Difficulties in resources scaling up and scaling down.
  • Huge resource inventory

These problems occur at two places, the computation platform, and the database. This paper emphasizes on the computational platform and related methodologies to be followed to alleviate the problems. The advantages and comparison of cloud based DWs is outside the scope of this paper.

The computational platform that performs ‘Extract-Transform-Loading’, is built using a custom developed COTS application, known as Data-Integration/ETL application.

In this approach, as the number of data sources increases, and/or the data size increases, the computation platform, depicted in the above image as ‘Data Integration/ETL’, would be under crunch of resources. In order to solve the problem, the resource of Data-Integration/ETL application must be increased, and the possible approaches are:

  • Increasing the physical resources in an instance of Data-Integration/ETL application
  • Increasing the instances of Data-Integration/ETL application

These approaches lead to:

  • Overhead of IT-operations
  • Increased CapEx
  • Increased OpEx
  • Lack of agility in resource utilization

The database platform would also face resource related issues as the resources established in an on-premise are always finite and scaling up leads to similar problems as summarized in the case of computational platform.

Cloud computing eco-system to build DWs

In modern computing, the cloud platform resolves the challenges that existed in the traditional computing platform; and enables the process of developing and nurturing the DWs to be more agile and responsive to enterprise requirements, in an optimal way.

Transitioning to the cloud and deploying modern computing architecture based technologies would help an organization to:

  • Lower the CapEx
  • Lower the OpEx
  • Increase agility in developing enterprise DWs with heterogeneous data sources.
  • Optimizing OpEx by scaling the resources up and down

A multi-tasking monolithic computing platform can be broken down into smaller containerized computing systems, such a strategy is termed as micro-computing strategy. Containerized computing platform is termed as micro-computing platform, and each such container-based computing platform performs one job at a time.

Advantageous of using micro-computing strategy can be summarized as:

  • Optimized OpEx
  • Ability to invoke a micro-computing platforms as and when needed by a business process.
  • Releasing the resources on completing the job, and enabling the resources to be available for other micro-computing platforms
  • Optimized resource utilization, by scaling the resources up and down dynamically
  • Data streaming

The cloud native technologies and careful orchestration amongst the jobs helps complete the data Extraction-Transformation-Loading activities efficiently.

Such a scenario can be depicted as:

Typical architectural view of Data warehouse using Cloud technologies

The emerging cloud technologies have made the development and nurturing of data management platform based on workflow easier and helped to resolve the problems that existed in traditional data management platform.

Key activities that can be considered while making micro-computing containers:

  • Data collection
  • Data cleaning & transformation
  • Data Ingestion for storage
  • Data storage & management

To realize the cloud DWs, an enterprise can use the services of leading cloud platform providers namely Amazon-AWS, MS-Azure, and Google-GCP. To summarize, following are the different technological options available for establishing a cloud data-warehouse.

  • Computing
    • Container, micro-services based data Extraction-Transformation-Loading platform
  • Data platform
    • Amazon: S3, RDS, Redshift and others; GCP: BigQuery, Cloud Storage and others; MS:BLOB, SQL-DW and others
  • Data Streaming
    • Apache-Kafka
  • Workflow orchestration and management
    • Chronos, Azkaban, Apache-Airflow, Quartz and others

At coMakeIT, we have expertise in delivering Data-Flow (Event Based) based data management platform across the cloud platform providers, AWS, GCP, and Azure.

For an assessment and help you with the cloud transition, please visit our Cloud Transformation @ CoMakeIT

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts