There were a number of challenges associated with building this data application. Primarily:
phData helped the team define their data product infrastructure as code and coherent pipelines, rather than discrete multi-tenant services. The pipelines incorporated key security recommendations from the CSO and cloud center of excellence teams. Additionally, the services were scheduled and managed dynamically to ensure workloads matched the resources needed to complete the work queues.
phData is the leader in end-to-end services for machine learning and data analytics on AWS. phData has demonstrated the ability to develop data-driven applications with a focus on automation, which inherently leads to cost savings for our customers.
The solution included a number of basic elements within AWS, including S3, IAM, and CloudFormation. Source systems publish data to S3 using a number of methods including integrations directly with S3 or utilizing batch processing scripts to push data at specific times.
EMR was chosen for a number of different reasons:
In order to gain visibility into the overall data-pipeline landscape, where data was within the application and any errors that came up during the process, an end-to-end workflow solution was required. They chose Airflow to support this requirement.Â
Source systems publish data to S3 which was the most efficient way for that system, and Airflow takes over the higher-level orchestration ensuring data moves appropriately from S3, through StreamSets, EMR, and eventually to Redshift for data warehousing. Airflow doesn’t do any of the real work, such as actual S3 copies, or JDBC connections. It simply tells the various services to do the work.
Now, a pipeline can start at a scheduled time, and the first step in the workflow would be an API call to StreamSets to execute the copy of data and start the format conversions. Once Airflow determines the StreamSets pipeline is complete, Airflow will execute the Spark applications on EMR via JobFlow.
Airflow supports a number of different integrations, including StreamSets and AWS native services. Airflow also allows data engineers to develop workflows using Python. The benefit to this is that the customer can develop workflows like any other software team within the company using a standard, well defined CI/CD process in Jenkins.Â
They chose to utilize StreamSets within all their data ingestion processes, which allows them to see ingestion status across the entire enterprise. StreamSets pipelines are hosted on EC2 which provides the capability to use IAM instance profiles. We gave the StreamSets servers explicit read permissions to S3 and explicit EMR permissions to execute Spark applications. This was a key security consideration.
The customer took advantage of AWS Savings Plans which helped reduce the cost of EMR compute. The cluster also utilizes EC2 Spot Instances as much as possible for even more cost savings. Since most of the processing jobs aren’t time-sensitive and jobs can execute overnight, utilizing Spot instances was fairly easy.Â
For the dev environment, the cluster is a long-running cluster which allows engineers to test Spark applications on the fly. However, for the production cluster, Airflow manages the starting and stopping of EMR.
AWS RDS Postgres and Redshift were utilized as authoritative and data warehousing solutions. Internal analysis tools were used to run queries against Redshift and provide reports to business decision makers. Postgres was used for data authority storage. Unique identifiers were assigned for various sets of data. Other applications wanting to use actuary data utilize this authority to ensure they are accessing the most up-to-date data, which also ensures there is no duplication of that data.Â
The customer wanted to build a truly cloud-native application for its annuity data. Because of this, we opted to utilize AWS Glue Catalog for storing metadata and schema information.
All of the infrastructure represented in the architecture diagram was managed with CloudFormation and Jenkins. This lets developers push CloudFormation changes to Git which triggers Jenkins jobs and creates, updates, or deletes CloudFormation stacks. AWS Console access for manual changes was very limited and utilized mostly for proof-of-concept work when developing the application.Â
The solution also included Jupyter notebook access. This provided a couple of benefits. First, data science teams utilized the data and executed ad-hoc Spark jobs to review data quality. Secondly, the data engineering team used the notebook to test minor changes to Spark applications.Â
EMR offered the customer a modern processing engine that scaled dynamically to meet resourcing requirements. It’s transitive nature, along with spot instances to provide computing needs, proved to be cost-effective. The service was familiar to the existing developers that had backgrounds in Spark which accelerated the adoption. In the end, EMR was successfully implemented within a very short timeline.
Learn how phData can help solve your most challenging data analytics and machine learning problems.
Subscribe to our newsletter
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.