One of Qualicom’s financial clients wanted to improve the manageability and performance of their on-premises big data Hadoop environment. The existing environment required considerable effort and expertise to monitor the cluster and tune performance according to workload.
To improve processing performance as well as time to market for new applications, they decided to migrate the Hadoop environment to the AWS EMR big data platform on the cloud. AWS EMR is a managed cluster platform that is easy to use and manage, low cost, elastic to scale up and down processing instances within a cluster, reliable, secure, and flexible.
What we did
Qualicom participated in the development of a transformational framework running on AWS Elastic MapReduce (EMR). We partnered with the client’s project team to deliver the following contributions to the project:
- Designed and implemented a framework on the AWS EMR clusters using Apache Spark to retrieve streaming data in AVRO format from Confluent Kafka and save it to HDFS/S3 storage.
- Implemented ETL jobs to load daily CSV/TSV files into Hive tables, parse and transform the data, and save it back into Hive tables.
- Created batch jobs using Apache Spark to read, analyze and aggregate data from large source Hive tables and generate reports that were then saved to destination Hive tables.
- Developed a mini-batch processing library to accelerate the development of Spark jobs to handle failure recovery, delta selection, and overall application structure, allowing the project developer to put more focus on business logic implementation.
- Implemented auto-recovery on failure, ensuring that duplicate messages were not introduced during recovery.
- Used Apache Airflow to trigger Spark jobs in AWS EMR according to schedule.
Created utilities for various purposes, including configuring message sizes, rates, and volumes for performance testing and generating notification messages for the respective business units.
How it helped
As the Bank expands its use of big data frameworks, migrating to AWS EMR on the cloud simplified the infrastructure and increased productivity as staff could concentrate on business logic (data mapping, transformation, filtering, manipulation and so on) rather than managing the environment. Productivity was also increased as client staff did not need to develop expertise in Spark programming.
Reliability and cost-performance were improved through the ability to auto-scale the configuration according to workload and by retrying failed tasks and automatically replacing poorly performing instances.