Tracking Financial Crime with Azure Confidential Computing and Sarus Smart Privacy Solution

Tracking Financial Crime with Azure Confidential Computing and Sarus Smart Privacy Solution

Authors: Maxime Agostini, Lindsey Allen, and Wolfgang M. Pauli

Combining Azure confidential computing capabilities with Sarus unlocks new possibilities to combine sensitive data from multiple parties. Working with multiple banks, we demonstrated how a joint solution can track financial crime by pooling transaction data. It significantly improved the prediction power compared to siloed data while achieving the highest standards of data security and privacy.

Confidential computing solutions are increasingly popular to manipulate sensitive data. They bring unprecedented levels of security for processing personal data in a cloud environment. Bringing data from multiple parties onto a confidential computing platform protects against leakage to the cloud vendor and to other parties. But data is only useful when it can be queried, and confidential computing alone provides little protection for the output of queries. How can one guarantee that the outputs themselves do not leak confidential information?

This problem is commonly referred to as output privacy and has been a field of intense research over the past decades. In 2006, Cynthia Dwork, then a researcher at Microsoft, introduced the concept of Differential Privacy as a mathematical definition of the leakage risk of personal information in a computation's output.

The universality of differential privacy makes it the perfect tool for scaling data access while minimizing leakage risk. It can be applied to any data processing workflow without requiring bespoke compliance processes. Sarus implements it in all data manipulation so that outputs never reveal sensitive information to data consumers.

In our work, data scientists built models to track and predict criminal activities in financial transactions. The flexibility of Sarus enabled them to experiment many advanced detection approaches, both rules-based and machine learning-based, and successfully ship powerful detection models. At no point during the analysis were they able to see transaction data.

The architecture that solved both input and output privacy

The following services were provisioned:

  • A Database running in a confidential VM, implementing additional features such as Transparent Data Encryption, so data is both in use and at rest.

  • An Azure Key Vault Managed HSM to store encryption keys

  • A Sarus instance running on a separate confidential , in an isolated subnet

  • A standard data science for data practitioners.

  • An Azure Bastion was set up for all administration tasks. Users logged in to the data science VM through a web interface using their local or credentials.

The setup implemented the highest standard of security based on zero trust principles. It used Microsoft for Cloud and ensured all traffic remained private. The SQL Server instance could only be queried by the Sarus instance. The Sarus API could only be queried from the data science VM and was set up so that results complied with a strict privacy policy. It enforced that outbound information could not include any confidential information. Data ingestion was fully end-to-end.

The Data Science VM was the only entry point to the architecture. It was accessible via the Azure Bastion from the Internet and NSG strictly restricts its access to the Sarus VM on port 80. It could not reach any other resources in the virtual Azure hosted environment. It also had outbound access to the Internet.


Making the data science workflow fully privacy-preserving while keeping the experience unchanged

Once the ingestion is finished, the data scientists can start working on the remote data using the Sarus API. The API can be queried either in SQL or using a python SDK. Each data processing query is checked against the privacy policy and rewritten to comply with differential privacy objectives. The modified queries are then sent to the SQL Server Database for execution. In the case of non-aggregate queries such as SELECT *, the queries are run against high fidelity synthetic data that had been generated beforehand. The synthetic data itself was generated using Sarus differentially-private machine-learning solution, protecting against all incidental privacy leakage risk in the synthetic data itself.

The data scientists leveraged Sarus Private Learning SDK to build a comprehensive data processing pipeline and design detection models. The SDK wraps the most common data manipulation libraries (e.g.: numpy, pandas, scikit-learn). Methods from those libraries are sent by the SDK to the Sarus API for execution on the real data. But before that, the Sarus software compiles the desired computation into a differentially-private version so that the output is safe[1].

The data scientist progressively built up their data pipeline while interacting with the remote data through Sarus. They were able to design models that achieved the same performance as if they had been allowed to download the entire data on their workstation. They used the same tools and wrote identical code. The crucial difference is that the entire project was carried out without granting access to a single bit of user-level information.


Data science python code sample that is captured by the SDK and executed on remote data

A successful demonstration that can be scaled to many more use cases

We demonstrated that data collaboration on sensitive data can be delivered at scale using modern cloud architectures provided by Azure. The end-to-end data protection does not get in the way of the creativity of data scientists and analysts to solve new challenges. The applications go well beyond financial crime, with an obvious fit in insurance, healthcare, smart cities, or mobility.

About Sarus: Sarus is a privacy startup that enables research and data science on confidential data. It implements the most recent research in privacy technologies technology and brings it directly to the . It was part of Y Combinator W22 batch and is a Microsoft Partner available on the Azure Marketplace.

Contact us:

[1] The reader can find out more more on differentially-private compilation of data pipeline by looking at this presentation from the PEPR conference


This article was originally published by Microsoft's AI - Customer Engineering Team Blog. You can find the original article here.