Oakland

How to create a secure Azure Data Platform

There are many methods you can use to secure your data platform and the data contained within it within Azure. The security controls that will be most effective for each data platform differ based on the usage of the platform, the data sources for the platform and many other factors; Having a holistic view of the potential options for securing your platform through each of the security layers below will ensure a platform is both fit for purpose and secure.

Diagram

Azure data platform structure

Implementing and maintaining good security within a data platform can be a timely and expensive endeavour so is often overlooked. It can require specialist knowledge to design and implement and make it more complex to connect systems and resources. However, this has to be balanced against the impact of a security breach which could be financial, reputational and have safety implications. During the design phase of a data platform, the sensitivity of data which it will contain, and potential impacts of a breach should be analysed in order to determine an appropriate level of security controls which should be designed into the platform to appropriately mitigate this risk.

Defence in Depth Security:

  1. Data Governance and Classification
  2. Data Protection
  3. Access Control
  4. Authentication
  5. Network Security
  6. Threat Identification and Remediation
  7. Disaster Recovery

 

  1. Data Governance and Classification

Data Governance and Data classification in the context of security is assigning a security rating to data based on the sensitivity of the information contained. Examples of commonly used classifications are ‘Public’, ‘Internal’ and ‘Confidential’. Data in Azure SQL Databases, Azure SQL Managed Instance and Azure Synapse can be allocated a ‘Classification Label’ and an ‘Information Type’. This can be done manually using T-SQL statements or done within the portal within the ‘Data Discovery & Classification’ tab, which can also automatically infer classifications and information types.

Providing classifications to data is beneficial to security as it enables the ability to monitor access to data of different classifications. Additional data governance tools such as Azure Purview can provide additional data classification features such as the ability to associate classifications to data from sources other than those listed above.

  1. Data Protection

It cannot be assumed that data is protected by default even in Azure Platform as a Service (PaaS) services. The nuances in the differences between data protection between services must be understood in order to create a fully protected data platform. Data encryption is an important part of data protection as it protects data from being useable if it is accessed through malicious activity. Many Azure services provide a certain level of data encryption by default, but this cannot be assumed to be true. Azure SQL Database and Azure SQL Managed Instance both have Transparent Data Encryption (TDE) enabled by default, this service is also available for Azure Synapse Analytics Dedicated SQL Pools, however in this case it is not enabled by default. Given the rise in popularity of lakehouse based architectures within data platforms, it is also worth noting that Azure Data Lake Storage (ADLS) also utilised encryption at rest and in transit, to secure underlying data.

  1. Access Control

Access to data and resources should be granted using the principal of least privilege. So users are only granted access they need to perform their duties and no more. Access should be regularly reviewed and when no longer required it should be revoked. Azure tenants should be carefully designed with Management Groups, Subscriptions and Resource Groups to reduce unnecessary resource visibility, for example preventing the Marketing Department from seeing or accessing Finance Department resources.

Typically the fewer people who have access to a resource or data the more secure it is, this limits the chance of users accidently or deliberately leaking or altering potentially sensitive or business critical data. It’s not only access to data which should be carefully considered, access to resource configuration is also important. For example, if a user is granted contributor access to a resource they could delete or alter the resource by mistake. Additionally, they could make changes which compromise the security of the platform, such as altering or removing Network Security Group (NSG) rules without understanding the consequences, permissions like these should be limited to those with requirement for it and required technical understanding.

Sensitive data can be protected from unauthorised access through the use of data masking. This feature is available to Azure SQL Databases, Azure SQL Managed Instance and Azure Synapse Analytics. The amount of data which can be viewed but different users or user groups can be dynamically defined using policies. For example a Database can be configured so that members of the Finance team are able to view full credit card numbers, HR personnel are able to view the last 4 digits of the card number and developers are only able to a series of ‘X’s. The dynamic data masking feature on the supported services listed above can be configured using T-SQL statements or within the Azure portal on the ‘Dynamic Data Making’ page.

Resource locks can be applied within Azure to control which users can perform certain operations on resources. There are two types of resource locks: ‘Read-Only’ and ‘Delete’. ‘Read-Only’ locks allow users with access to view a resource but they cannot make any changes to it. ‘Delete’ locks stop users from being able to delete resources. These features prevent accidental resource deletion or reconfiguration.

  1. Authentication

Authentication is the method by which users or services verify their identity when attempting to access another service. Azure services have support for authentication and access control with Azure Active Directory (AAD). AAD offers many features to enhance security such as Single Sign On (SSO) which allows users to use one set of credentials to sign on to multiple services.

By default, Role Based Access Controls (RBAC) permissions on resources are granted with authentication using AAD, which can be done for individual users or users grouped into management groups. Data access can also be authenticated using AAD with support for Multi Factor Authentication (MFA) through the use of Azure SQL Database, Azure SQL Managed Instance and Azure Synapse Analytics. This removes the need for additional passwords to access SQL and reduces the likelihood of many users signing in using the admin credentials when this is not required. Also integrating SQL access with AAD means if employees are removed from the organisations AAD for example due to leaving the company then their access to SQL will also be automatically removed without needing to manually delete their SQL user or rotate the password of a shared login.

Authentication is required between services and resources, traditionally this authentication is done using a username and password combination, however this is vulnerable to these credentials being leaked. A more secure method of authenticating between systems is using Managed Identities. When using Managed Identities in Azure, the identity of a resource or service is registered within AAD and other services can use AAD tokens to authenticate, rather than  a username and password.

  1. Network Security

An appropriately designed network security framework in Azure can protect a data platform from attack and unauthorised access, whilst also allowing for functional communication in and out of the network. Features offered by Azure to create a secure and functional network topology include private endpoints to secure PaaS services, Azure firewall and virtual networks with associated NSGs.

Virtual networks create groups of connected services which can be protected from unwanted inbound and outbound communication. Virtual networks are split into subnets, these subnets can have associated NSGs which are lists of rules which allow or deny traffic. When assigning NSGs all communication should be blocked as a default and only exceptions for specific purposes should be made in order to minimise communication. Reducing the number of allowed protocols and allowed ports for communication on your virtual network reduces the routes an attacker could take into your network.

Private endpoints within Azure are available for a range of PaaS services such as Azure Data Lake Gen 2 and Azure SQL Database. A private endpoint creates a Network Interface Card (NIC) which is associated to the virtual network. Then public access to these services can be disabled and only traffic to and from the network can be allowed, creating a private connection. The use of private endpoints can reduce the risk of data leaks as data is always contained within the private network and is not transferred publicly over the internet.

Azure Firewall can be used to inspect and analyse traffic coming from outside an Azure environment into it and between spokes within an Azure hub and spoke model. The firewall can inspect and block unwanted traffic to your Azure environment. Azure Firewall is integrated with Azure Monitor, Azures logging and alerting offering, so metrics from the firewall can be inspected.

Many data platforms require connectivity to on premise systems, these connections can pose a potential security risk of not properly configured and secured. There are several methods for connecting cloud and on premise systems within Azure. When choosing what method to use to connect systems the cost of the connection must be balanced with the required security for the connection. Azure ExpressRoute can provide a dedicated private connection between an Azure environment and an on premise system so no data transferred is exposed to the internet hence making this a very secure option. However Express Route is a costly option. Another option for creating this connection is to use a Site to Site Virtual Private Network (VPN) which transfers encrypted data between the on premise system and Azure over the internet. A Site to Site VPN is typically cheaper than Azure Express Route, but due to data being transferred over the internet it is considered to be less secure.

  1. Threat Identification and Remediation

It can often be difficult to identify security threats to a data platform, you can collect logs and query them in order to identify suspicious or threatening activity but it can often be difficult to filter out unnecessary logs and identify useful information. The Microsoft feature ‘Microsoft Defender for Cloud’ provides detailed recommendations of how resources can be configured to be more secure and of active cyber security alerts such as logins from unusual locations. Defender for Cloud also provides recommendations for remediation steps which should be completed to resolve potential security issues.

Most resources within Azure have the ability to export diagnostic logs and metrics to Azure Monitor or Azure Log Analytics Workspace which are central repositories where logs can be queried and alerts can be set based upon these logs. For example security logs can be exported from Virtual Machines which detail login attempts, then Azure Log Analytics Workspace can be used to query failed login attempts and an alert can be created to email nominated users when a failed login attempt has been made.

  1. Disaster Recovery

In a worst case scenario, a cyber attack (or accidental deletion) could result in services and data being deleted or un recoverable. In this scenario a robust and timely disaster recovery plan which minimises the Recovery Point Objective (RPO) is imperative. This can be achieved through a variety of methods. Storing templates for infrastructure deployment, and ARM templates for orchestration services such as Azure Data Factory in Repos will mean a platform can be redeployed quickly and easily.

Many Azure PaaS storage services have features to simplify recovering lost data which are enabled by default. For example Azure Synapse Analytics creates regular restore points on dedicated SQL pools and a deleted or corrupted database can be redeployed from these restore points.

Azure provides geo-replication features to protect resources and data against a potential disaster at one of its data centres or regions. If enabled on resources geo-replication can create a replica of the resource within a different availability zone or region so if the hardware containing the primary resource is damaged the resource is still available through the replica. This feature is available on many resources in Azure including Azure SQL Database and Azure Storage Accounts.

Abigail is a Senior Data Engineer at Oakland