Microsoft Purview
Data Governance for the cloud, on-premise, multi-cloud and office 365 workloads.
Introduction
Most organisations are exploding with data that has been collected, transformed, and reported on, but this data is often not well-tracked as the organisation becomes more data-driven, increasing two pain problems that have been growing for the last few decades:
- How can we audit all this data to protect against data leaks and unexpected data loss?
- How can data users discover data in an environment that changes constantly?
Data Governance Products help mitigate these problems, among others, but are often complex due to requiring:
- The ability to scan a large variety of data sources
- A highly customised user interface
- A powerful search engine to find data assets by many different types of metadata attributes
These are just some of the main requirements that create a software marketplace full of products that are often expensive and hard to implement and maintain.
These products also need to ingest large amounts of sensitive organisational data to meet user requirements, ironically creating a Data Governance concern in itself!
Purview aims to ease the pain of Data Governance by being feature-rich, easy to deploy, maintain and secure. But is it worth the cost, and can it compete with bespoke Data Governance companies that have a head start measured in years or even decades?
Purviews Features
- Its connectors are very Microsoft-focused but cover most of its ecosystem: Azure, SQL Server, Power BI, and Office 365. If you’ve already bought heavily into Microsoft, you can scan most or all your data assets automatically.
- It focuses less on connectors made by other companies but still covers many popular data products like SAP, Salesforce, Oracle, GCP Big Query, AWS S3, and Snowflake.
- It offers a lot of flexibility in managing Data Catalog users with 9 different roles to choose from and syncs up to your Azure Active Directory groups and users.
- Can classify data with 200+ pre-built classifications, as well as custom classifications.
- Business Glossary with an extensive text editor. Ability to add contacts for roles like Data Owner and Steward to each data asset.
- Pre-built reports to quickly check insights such as what percentage of data has a Data Owner and the percentage of new data assets in the last month.
- Offers an API and Python SDK for making custom data sources where connectors don’t exist or mass updating existing scanned data assets.
- It doesn’t offer much insight into Data Quality of scanned data assets, which can be found in other Data Governance products. However, it could theoretically push Data Quality metrics to Purview via its API.
- Data Sharing allows users to give other users read-only Data Lake data access without having to copy data.
- Can ingrate Data Governance with Master Data Management using Profisee
- Purview is relatively new after only being available to customers for a few years but it is receiving heavy investment from Microsoft, with new features appearing monthly.
Deployment
- As someone who has designed and built many data platforms, I highly value any product that can be deployed quickly, has low maintenance, and will meet strong client IT & security requirements. I believe Purview is stronger than most Data Governance products in this area.
- It is as easy to deploy and maintain in Azure as any SaaS data governance product but also offers a choice – 20 plus regions to deploy into, including the UK.
- Purview can also keep all traffic in and out of its server on its private network using Private Endpoints, never touching the public internet, offering an extra layer of data security when creating a Data Catalogue.
- Can scan Azure Data Products via Managed Identity authentication offering high-security data connections without worrying about managing passwords.
- It can connect directly to scan on-premise and other cloud data assets, though it requires some technical knowledge.
Costs
Automated Data Governance tooling is often not cheap, with costs starting in the thousands of pounds for most products. Purview is no exception: it has a base price of £250 per month and will cost more if scanning large workloads. However extra capacity is costed using the pay as you go model, so you only get charged extra when scanning lots of data.
There are also additional extra costs depending on which features are used.
Due to the pricing being highly variable in Purview, many organisations will build a Proof of Concept to road-test Purview for a month or so to accurately measure costs.
Alternatives
Note this isn’t a comprehensive list and is a quickly evolving space with new exciting start-ups entering all the time.
- Build your own:
- Excel: low cost, low maintenance if data structures don’t update regularly, doesn’t require any specialist skills to build. While we suspect this is the most common type of data catalogue used, we feel nervous about doing a Data Catalog in a data tool infamous for having poor Data Governance. It does not scale and requires lots of manual effort to work out data lineages and classify data sensitivity.
- Automate your own solution by extracting schemas of databases and files. This is a nice quick way of generating a data catalogue with low maintenance and little extra costs. You can also build a dashboard on top of the Business Intelligence (BI) platform of your choice. It requires minimum effort if the number of data assets is small, though adding features like data lineage and classifying data sensitivity will require a reasonable amount of engineering effort.
- Databricks Unity Catalog – ideal for Databricks heavy data platforms, as it is a free extra. Though it will only scan what Databricks can scan. You can integrate with other Data Governance products and update schema as they update in real time, which you don’t see much in other Data Governance products.
- Other clouds Data Governance solutions like AWS Glue Data Catalog and GCP Dataplex. Both are arguably less feature-rich than Purview, especially for non-technical users, though they are easier to implement if most of your data assets are in their respective clouds. Also, both are used to ingest into other larger Data Governance products.
- Mature products like Informatica and Talend. These tend to charge by the user and are more commonly found on-premise (though they can be configured and maintained in the cloud on Virtual Machines). They will likely cost the most; sometimes, this is significant, but these are well-trusted and reliable. Add the most value if you buy into the rest of their data platform ecosystems.
- New products like Atlan and Immuta often focus on providing Data Governance to more recent data products like Databricks and Snowflake but also often focus on making deployments into the cloud more accessible by offering deployments via Docker or Kubernetes. Immuta also provides a single pane of glass for fine-grain data access across many popular data products that allows data access controls at a column and row level.
- Open Source software like Datahub and Amundsen, both built by large tech companies (LinkedIn and Lfyt respectively). These are the go solutions if your organisation has the technical capacity to build and maintain complex workflows. They offer a lot of customisation and no licence costs, so they can be much cheaper at scale and be more custom tailored to fit an organisation’s Data Governance needs.
Summary
In an increasingly challenging data governance market, Azure Purview is a serious option to consider despite missing some features compared to more bespoke Data Governance companies.
However, using Purview if you spend hundreds per month or more on Data Governance is not a good return on investment, or you want a solution that easily integrates Data Governance and Quality.
If you are looking for a Data Governance product that is easy to deploy, secure, catalogue, and classify data assets, and provides some customisation through APIs and user interface at a competitive cost, then we think Purview is a good contender.
Jake Watson is a Senior Data Engineer at The Oakland Group