Investment Firm Improves Data Quality by Applying Data Engineering Best Practices to Its Existing Data Pipeline
Customer's Challenge
A global investment firm receives a large amount of data from third-party partners to help them manage private investment portfolios. The firm already had a dedicated data pipeline feeding into Microsoft Azure, but the input files often had inconsistencies.
The customer wanted to both establish internal quality control and create uniform data contracts with external partners to ensure file consistencies.
phData's Solution
The phData team both established a testing process in the data pipeline and equipped the customer with what they needed for better data contracts.
Using the key Azure feature, Great Expectations (for testing, documentation, and profiling), the phData team implemented a data ingestion pattern that can be scaled and reused as needed. The expert team also provided a framework for taking on new data sources that makes it easier for the client to onboard new data providers and maintain data quality.
Results
Ultimately, this solution means that what the customer sees in their PowerBI instance is already quality assured: they don’t have to deal with multiple formats or multiple criteria with each new batch.
Put another way: when IT receives a dataset for ingestion, the architecture now ensures that the data follows a certain contract. There’s no need to worry about the quality details anymore, whether it’s defining attributes with each batch or running a separate validation process.
Most importantly, the customer has a mechanism they can work with to onboard new datasets and a framework to onboard new data partners. Unlike before, they’re now working with high-quality data files, every step of the way.
The Full Story
As a global investment firm, the customer needs three things to be successful in its data management: speed, security, and accuracy.
The customer’s IT organization is responsible for managing all the critical data assets related to the private funds managed by the global investment firm. But it’s not just customer data held internally that they’re responsible for. The IT team’s work includes building data pipelines in Microsoft Azure to ingest datasets from external data providers (the institutions where customers hold their investment portfolios).
Ingesting this external data allows the client to track its customer’s assets across platforms. With the data, they can make better business decisions based on the debt profile of assets and how the portfolio changes over time.
To move forward, they needed to improve the speed and security of this external data, but also its accuracy, building, testing, and validation into the pipeline. They also needed a solidified approach to third-party data contracts to ensure file consistencies in the long term.
Why phData?
The customer needed a partner they could trust to maintain their existing data pipeline and optimize data quality.
With phData’s expertise in both Microsoft Azure (the core technology) and navigating complex data relationships, we demonstrated the confidence they needed to move forward.
Revisiting data quality
The phData team quickly realized that the customer had data quality issues with every data provider they worked with.
This isn’t unique to this customer, either. When IT has data coming from external systems, quality is always a concern. For the phData team, solving data quality was always in-tune to solving their larger data ingestion problems.
It became clear that it wouldn’t be enough to just tweak the pipeline or put new contracts in place: we first needed to take a step back to identify the data quality issues they have, where they stem from, and how we could resolve them.
More than anything, data types matter (string, array, boolean, etc.). For example, things can get confusing, fast, especially if you try to combine a Boolean search with true/untrue statements in the data. The data providers for the customer didn’t understand this. Running the team through data modeling exercises helped the customer better understand their data—in turn, it helped them communicate more effectively about what they needed.
Overall, this jumpstarted an internal conversation on standardization and engineering best practices. This education layer is one of our favorite parts of working with customers on difficult data problems.
Revamping the data pipeline
On top of the education piece, the phData team implemented a scalable and reusable data ingestion pattern for the client.
To say we “built” the data pipeline wouldn’t quite be accurate: the team utilized what the customer already had in place, but put better information architecture and data practices around it.
In short, phData identified and recommended a process to detect and identify data quality issues early in the pipeline. This means that bad data is now caught early instead of being identified downstream. As a bonus, the design and implementation was completed in a very short amount of time: around eight weeks.
The tools that the phData data science team put to use included:
- Great Expectations (a library that can be used to identify and alert on data quality issues during the ingestion process)
- Azure Data Factory
- Azure Key Vault
- Data Lake Storage
- Parquet
- PowerBI
The team leveraged Azure DevOps to develop these capabilities, from ingestion to feeding data into PowerBI.
Take the next step
with phData.
Looking into better data ingestion for your organization? Learn how phData can help solve your most challenging problems.