The client wanted to use their business intelligence tool to query their historical data for 90 days in under two seconds with more than 80 concurrent users. Their current system included data in Hadoop clusters and Infobright DB on-premise cluster, which was unable to handle their data analytics requirements.
Beyondsoft’s Big Data consulting team proposed a solution using Vertica, a columnar database on Amazon Web Services (AWS). The AWS cloud solution would provide the added scalability, elasticity, and performance that the customer wanted. The project consisted of creating Vertica clusters in a repeatable manner and a pipeline-based approach for Vertica DDL. It also included moving large amounts of data daily to the Vertica cluster.
The project consisted of three phases:
- Phase 1 involved creating a repeatable deployment process through infrastructure as code (Terraform) for Vertica cluster. The Vertica AMI was procured from AWS Marketplace. Beyondsoft engineers added the ability to launch different Vertica nodes as part of a cluster through tags to AWS Elastic Load Balancer (ELB). The Vertica cluster consist of two AZ’s in active/active nodes, with 16 nodes total and 90 days of data, which came to approximately 40TB. AWS Systems Manager Service (SSM) and CloudWatch Logs are used for administration of the cluster. This Vertica infrastructure as code also integrates with the customer’s self-servicing tool for their developers.
- Phase 2 included a pipeline-based approach for Vertica DDL. Vertica DDL is pushed through the pipeline using Liquibase, a java framework for database change and deployment. This ensures that production Vertica clusters are not touched manually for schema changes.
- Phase 3 involved setting up ETL from the Hadoop cluster to Vertica using a producer/consumer pattern. The ETL code is written in python code with on-demand Fargate containers, which extract data from Hadoop and store it in zipped files in S3. From there, jobs are created to load data into Vertica from S3. The data is around 120GB/day with around 570M rows loaded at its peak. The customer-facing java application has several dashboards which are able to procure data from Vertica in under two seconds query time with concurrent usage.
Vertica on AWS, AWS SSM, AWS CloudWatch logs, AWS S3, AWS ELB, AWS Fargate, AWS Parameter Store, AWS ECR, Python, Jenkins, etc.
Beyondsoft educated the client’s data analytics team around the newly created solution and Terraform and provided a runbook, to enable them to both manage and add to the solution in future. Beyondsoft also provided education on the various AWS services and customized training sessions on various topics.
Moving from an on-premise cluster to the cloud increased the scalability, agility, and performance of the whole solution. Taking a DevOps approach through data pipelines decreased go-to-market time for code changes. Infrastructure as code provided a repeatable way to create infrastructure, increasing operational consistency and reducing bugs.