Dynamic Row Level Filtering and Column Level Masking in Azure Databricks
Background
In modern data platforms and analytics architectures, the discussion surrounding Data Security is essential. Ensuring the protection of data is paramount, and within any data architecture, the primary objective is to grant access to data only to those individuals who possess the appropriate authorization. Superusers, on the other hand, are granted access to all data. To achieve this, we need to implement the row-level filtering based on a user’s permission on the tables. If we have requirement where we would like to hide some sensitive column for the users, we need to implement the column level masking.
In the past, Azure Databricks utilized Dynamic Views and Hive functions for row-level filtering and column masking. You can get more details of this older method in a blog post I authored a couple of years ago. Although it proved effective, it involved some additional steps. Notably, the implementation of row-level filtering and column masking occurred within the Dynamic View, creating an extra object, and requiring extra maintenance effort. Furthermore, it lacked security control at the base table level, which was a notable limitation.
Azure Databricks has recently released the feature to perform row-level filtering and column-level masking directly on the base table. This blog post will discuss these features in-depth.
Scenario
We have the flight details data where we can see the details of the origin and destination airport for a flight and a few more information related to delays, date, and distance. We have a group of users with specific access requirements as follows.
· Admin: Admin privileges grant unrestricted access to all data.
· User1: User1 is restricted to viewing data associated with “ATL” and “JNU” airports, without access to delayed data.
· User3: User3 is limited to data pertaining to the “ABE” location, excluding delayed data.
It’s important to note that user roles may undergo changes in the future, and our goal is to minimize the effort required for role modifications.
Prerequisites
- We need the Azure Databricks workspace with Unity Catalog feature enabled.
- We need a cluster other than single user/ single node cluster.
- We need Databricks runtime ≥ 12.2 LTS.
Scenario Execution
Prepare the Base Table
We are using the sample data from the data bricks-dataset here. We will call this as base table.
df = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)
df.write.format("delta").mode("overwrite").saveAsTable("myfirstcatalog.mytestdb.testPe
rmissionFlightData")
Create a Mapping table.
The mapping table consists of two fields, namely “username” and “airports.” Users designated in this table should have access to the airport details specified within it.
%sql
DROP TABLE IF EXISTS mytestdb.airport_to_user_mapping;
CREATE TABLE IF NOT EXISTS mytestdb.airport_to_user_mapping(username string, airports ARRAY<STRING>) USING delta;
INSERT OVERWRITE mytestdb.airport_to_user_mapping (username, airports)
VALUES
('testuser1@samarendrapandaoutlook.onmicrosoft.com', Array("JNU", "ATL")),
('testuser3@samarendrapandaoutlook.onmicrosoft.com', Array("ABE"));
select * from mytestdb.airport_to_user_mapping
Row Level Filtering
Create the SQL FUNCTION
We are here creating the filter function which will be associated with the base table. Our mapping table is functioning as a lookup table. During the Spark runtime, the base table will be joined with the lookup table, and the results will be generated based on the conditions defined in the function. We will see this in query execution plan later in this blogpost.
We already created an account-level group in Azure Databricks Account console with the name “admin”, the admin user is already added in that group.
%sql
CREATE OR REPLACE FUNCTION
mytestdb.airport_row_filter
(airport_origin_param string, airport_destination_param string)
RETURN
IS_ACCOUNT_GROUP_MEMBER('admin') or
exists
(Select 1 from mytestdb.airport_to_user_mapping
where username=Current_USER()
AND (
array_contains(airports, airport_origin_param) OR
array_contains(airports, airport_destination_param) )
)
Within this function, we return “true” if the user is an admin member; otherwise, we filter the “origin” and “destination” columns from the base table using values from the “airports” column in the mapping table.
Attach the SQL Function to the Base table.
%sql
ALTER TABLE mytestdb.testPermissionFlightData
SET ROW FILTER mytestdb.airport_row_filter ON (origin, destination);
Provide Permission to the User1 to read the table.
%sql
GRANT SELECT ON mytestdb.testPermissionFlightData
TO `testuser1@samarendrapandaoutlook.onmicrosoft.com`
Column Level Masking
We are going to create one more function which will mask the “delay” column from the base table as per the scenario requirement.
Create the SQL Function
%sql
CREATE OR REPLACE FUNCTION mytestdb.column_delay_mask(delay STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('admin'), delay, '****Redacted****');
In this function, we are returning the column value “delay” if the user is from the admin group, otherwise we are masking the data.
Link the SQL Function to the Base table.
%sql
ALTER TABLE mytestdb.testPermissionFlightData
ALTER COLUMN delay SET MASK mytestdb.column_delay_mask;
Testing
Admin account:
We can see all the combination record, and also able to see the “delay” column when the query is executed by the admin account.
TestUser1 Account
When we execute the query from the Testuser1 account, we can see the data as per the airports mapping that we have done in the mapping table. Also, the delay column is masked.
Let’s change the Role of the user.
We can see the TestUser1 can now see the newly mapped airports data.
Execution plan
From the Execution plan of the query, we can see the mapping table and the base table were Left joined at the beginning and the row filtration happened from 1.39 M rows to 54.77K rows.
Conclusion
There are various tools and techniques that can be applied to achieve the above use cases, however having the ability to do the Row Level filtering and column masking directly in the base table, that too dynamically is a very useful technique for any complex Data Security architecture. I trust that the insights provided here will prove helpful in implementing your Data Security challenges. If you come across any alternative techniques or have further questions, please feel free to share your findings and insights.
Hope this helps!