Semantically Secure

Semantic layers should be a 'must have' for Infosec teams

May 02, 2024

brown wooden door with brass door knob — Photo by Brock Wegner on Unsplash

I’ve previously written briefly about this being an advantage for semantic layers in my semantic superiority series, but I’ve been thinking more and more on this topic recently. I really feel like it needs shouting from the rooftops, so I’ll do some of that here.

TLDR; semantic layers make it much easier and better to secure access to data.

Why do I say this?

Let's imagine that you didn't have a technological semantic layer. There is always a semantic layer - if it's not technological then it's human. Without one, you don't know what your data means.

If someone from Infosec was evaluating which security policies should apply to a data store, the first thing they would ask would be: "What does it contain?". This question is really a semantic question. They don't care that it contains X tables, Y columns or Z parquet files. They care about whether the data is about customers or not - GDPR. Does it contain PII attributes of those customers? Does it contain any sensitive healthcare data - HIPAA? Does it contain any sensitive financial data? Does it contain any PCI payments data? Can the data be used to calculate company performance metrics?

All these questions are actually semantic questions. They are about the entities with attributes and measures in the world of the business the Infosec person operates in. The raw structure of the data isn't particularly interesting. Semantics have to be applied to the raw data to make it possible to be evaluated from an Infosec POV.

Where it isn't possible to do this, meaning that the data is nebulous in format and unknown in meaning, Infosec will rightly want to deny access to the data. They don't know what is in it, so it could have any of the types of sensitive data they want to secure. This data is in effect useless - it can't be given to anything else to be consumed safely. It's actually a liability and it could be argued that it shouldn't be stored at all.

Where the meaning of the data is known and the format of fields is understood, it is possible to abstract the entities with attributes and measures into a semantic layer. Where you know what the entity is, it becomes simple for the Infosec team to make policy. The customer entity should only be accessible by Customer Services. The revenue measure of the orders entity should only be accessible by Finance. The email attribute of the customer entity is classified as PII explicitly in the semantic layer. This attribute is only accessible by the role used by the service that contacts the customer.

Having a technological semantic layer in place makes Infosec's job orders of magnitude easier. They can see at a glance what data is available for consumption to consumers, what it relates to and what it means. They can be less risk averse and give access more freely. They can easily create roles for appropriate levels of access to entities and down to individual attributes and measures. Deciding who or which services should have these roles is also then simple. The access is specific, rather than being concerned with a range of types of data in a data store, and accepting a level or risk of inappropriate access - this risk is mostly eliminated.

Infosec can also be informed of new entities, attributes and measures defined in the semantic layer, so they can proactively make policy. They can then apply this policy to roles and create new ones if needed. They can adjust access to roles at the same time as changes to the semantic layer are made.

The process of defining the semantic layer asserts what data is and what it means. These assertions can be challenged and tested by Infosec before changes to the semantic layer are merged and then access given. You say this URL type field is not PII and can be given access to widely... firstly, I'll check if it ever has email address parameters.

In fact, semantic layers can even abstract semi-structured data like URL/JSON format. Rather than giving access to the whole structure in one field, which is what often happens in data stores, individual fields can be parsed out of the structure, on the fly, to be exposed as attributes or measures in the semantic layer. This is much less risky from an Infosec point of view, as it's specific and explicit. It's easier to test than testing a whole structure (which may not be safe to expose in its entirety).

With universal semantic layers, not only is data secured more easily by Infosec, but access is also more efficiently and proactively given. If access is easy for users, then they are less likely to use workarounds, like using someone else's login to get access to data they need. This, in turn, reduces risk for Infosec to manage.

In the era of universal semantic layers, which can govern and provide access to all of a company's data stores, the idea of giving direct access to any single data store for consumption seems foolish. There will certainly be a need for access to data stores for transformations like dbt and other data pipeline jobs. However, for end consumption of data, whether by people or services, access via the semantic layer should be what Infosec demand and command.

This concept isn't really so alien, if we take a look at where software engineering has already trodden. The idea of giving a consumer access to a production database is considered outright stupid today. They are supposed to be given access to an API, or a UI which utilises this API, to access the data in the production database. This API can then enforce role-based access control and restrict what the user can access to what is appropriate... I hope this is beginning to sound very familiar.

The difference between the API as the access point in software engineering, and the semantic layer as the access point for analytical data, is largely the nature of the data available. APIs in SWE are often used with ORMs, which are equivalent to semantic layers from an OLTP context. This kind of access doesn't allow for aggregation of data at the point of access - any aggregated data would have to be pre-aggregated and stored in the OLTP database.

Universal semantic layers today allow for a data model to be defined with entities and their attributes and measures. At the point of access, it is possible to request for a measure split by many permutations of entity attributes. This is a dynamic grain to the data returned, depending on the parameters in the request. This is the difference between semantic layers and ORMs. Semantic layers from a single API can allow for a great range of access. Many transformations can happen to data to enrich it, before being exposed via a semantic layer's APIs. Therefore, it is likely that a much greater range of information is available from a semantic layer that in a counterpart ORM in a production system.

It is common in SWE to have many services or micro-services which provide access to different kinds of OLTP data. In data, you can use one universal semantic layer to provide access to a whole company's data. This is then only one point of access for an Infosec team to govern. If Infosec teams expect that SWE teams provide access to production databases with APIs and underlying ORMs, they should expect all analytical data access to be from the APIs of universal semantic layers.

Infosec teams should be demanding that universal semantic layers are in place in their companies. They should be banging the table for them at infrastructure and architecture committee meetings. If someone wants access to analytical (OLAP) data, the answer should be... "via the semantic layer" with a forceful, commanding "...please".

davidj.substack

Discussion about this post