Data Vault: Learn Business Vault Secrets


Hello Data Vaulters! Welcome to another in our series of
Data Vault video presentations. This time we want to look
at the Business Vault. Business Vaults are an extremely powerful feature of the Data Vault method but it isn’t given much publicity. We’ve seen many confused teams who’ve lost time and had to unpick mistakes. It’s a pity but for a bit of guidance those mistakes could have been avoided. Before we get started, who are we? We’re Datavault UK – the UK’s leading Data Vault consultancy. Responsible for the UK Data Vault User Group where those who are interested in Data Vault can meet, to
learn and share their experiences. And for data engineers, we’ve written and
support a package you can use with the free DBT tool, to build your own Data
Vault solution. This diagram shows the various layers mentioned in the Data Vault method. A commonly mentioned component is the Business Vault. This is the place where we store the results of business rule calculations and other
types of derived data as well. Dan defines the Business Vault as follows. I think the key points are: that it contains wholly derivable data and that it’s calculated after the Raw Data Vault and before loading the Information
Marts. So let’s start by busting the first misunderstanding. The Business Vault isn’t really a separate layer. It’s not a separate schema. It’s held inside the Raw Vault. Yes – the Business Vault has new tables
but they’re overlaid on the Raw Vault structure. Mart’s then feed from both the Raw Data Vault and the Business Vault tables. What types of data can the Business Vault hold? We might need to pre-calculate data for efficiency reasons, so it can be consumed by a downstream dashboard. We might need to calculate helper tables that drive better performance. We might calculate new values such as ratios. You might be using the results of data
science for our business. And finally we might be checking data quality and want to calculate quality measures. Let’s look at the first item on the list. Some processing might be needed to get data out of the Vault. Your users might not want the full granularity of data held in the Raw Vault. They may want aggregated, filtered or masked data in their reporting Marts. The point here is the data isn’t transformed. It’s simply added up, grouped, masked or otherwise processed before it’s consumed. There are three possible places to build consumption rules. You can run them in the Raw Vault, creating Business Vault tables. You can
run them on the way out to the Mart or you can run them in the end-user BI tools cache. If processing is intensive and the results used by more than one Mart – you’re best running the rule inside the Vault to populate Business Vault tables. If processed data is used in one Mart only. You can run those rules as you
populate the Mart. And finally, you might consider implementing local processing
rules in the end user BI tool if the data is only used in one display. Now things to consider when deciding where to code the presentation rules are –
security – it’s better to keep data in one place in the Vault and only let out
what’s needed. Performance and cost – the Vault is generally cheaper and faster. What your BI tool can and can’t do. It may have limits on data volumes or a
number of feeds per day. What your users actually need. Do they need aggregate data? It’s usually enough for them. And who needs access to what. And what
overlaps there are between groups of end-users. Next on the list is helper tables.
These are the Data Vault Pit and Bridge tables. These are really powerful
techniques that support virtual reporting Mart’s and dimension and fact
tables. You could spend a while on these but we don’t have time. So we’ll
introduce them here you can check Dan’s book for some more information. I’m
sure we’ll cover them off in a future video. So what does a Pit or point in time
table do? Say we have a Satellite, customer details from our CRM system.
This table has an effective from date. There’s no start or end dates here. So to find which Satellite rows are
valid for a given date. We have to find the record for each customer primary key
as the greatest effective from date before the reference date. This can be
quite an expensive calculation. However SQL window functions help to
reduce that load. One Satellite table may be OK for that sort of processing. What if you’ve got two, three or more Satellites off the same hub. Perhaps, one Satellite is fed throughout the day another one gets a weekly update and
others are fed daily. Which values are valid for given date? There’s quite a calculation – I don’t want to repeat that each time that we query the data. So a Pit table holds a pointer to the record in each of the satellites that’s valid
for each comes customer for each day. So for example, and this is common in many businesses, our users might want to process end of day data only. The Pit table then is calculated after loading the Raw Vault as a Business Vault
process and we find the end of day records for each customer in each
satellite and insert references to the Pit table. When we want to find the
relevant satellite records we do an equi-join with the Pit table, which is
much faster than looking for the greatest end date or greatest effective
date. The table, the Pit table, is quite long but thin, and we can trim that table to
only hold a few months of reference data. But now we can look at Bridge tables.
It’s common to have queries that navigate Hubs and Links to fetch data
from Satellites across the data model. Here we have an example, we’ve got three
Hubs joined by two Links but the real model could join many, many, more and possibly radiate out across multiple chains as well. The SQL to query that is quite repetitive.
But you can make mistakes if you’re writing that – if you’re not concentrating.
And different grains can also cause the query to grow and so to have
performance problems. To bridge tables, they pre-calculate the navigation for each reference date, just like the Pit tables do for Satellites. They’re the great tools to deliver aggregates since they can also store the sum of values as extra
attributes to the table. If you take Pit and Bridge tables together the secret is, and if you look closely at them, you’ll see that the Pit table is a
star schema dimension and the Bridge is quite close to a fact table. Pit and
Bridges then pre-calculate the work involved to load a Mart – and you can
actually build a Mart to view on these helper tables to give you extra agility. Business rules calculate new data. So that rule could work with a table, for example ratio of two columns. Or it can work across tables – say, we want to select a customer name from two
tables. We want it from this table here unless it’s missing, in which case we
look it up in that table over there – an integration rule. Business rule results are stored into business Satellites. And these look like other Satellites, except the source column is recorded as a business rule. Business rule Satellites hang off the same hubs or links that the raw satellites use. So here’s a common use case for a business rule satellite is to select the best view of records across a set of source systems. So here – if sources disagree, which one should we prefer for the date of birth or name or address and so on. We could code that decision process as a rule, run it and populate a clean business rule Satellite after each raw load to store those results. Another feed for the Business Vault is
the results of data science. So let’s say a data scientist has run some experiments and they found something interesting and useful for the business. So they built a model and deployed it into production. The model takes some data from a Data Lake and from our Data Vault and it produces some new data. The model is just really another source system as far as the Data Vault is concerned. So we feed the results into a staging layer and load them back into the Vault as normal. Finally you might want to measure data quality. One way of doing this is to write views on the data to expose quality problems – you could, for example, expose orphaned Satellites, malformed Hub keys, or missing dates, or out of range data and so on. So if the data is OK, then the views are empty. If not, we can see the offending records in the view. These views are almost a form of Satellite which we can attach to a Hub or Link record giving the rule its context. So how can we implement our business rules? We can calculate our business Satellites and populate them directly from a business rule, which is an option. Perhaps there’s a better way of doing this. Let’s say we have our regular Data Vault load, as illustrated. Stage data through ETL into the Raw Vault. We then apply business rules to that raw data. If we treat our rules engine as another source system. We can feed that output back to the staging layer. And then load the data just as any source data. As we’ve already built robust loading utilities, we can use them to load that
business for all data properly. Well, that’s a quick overview of the Business Vault. In summary – don’t be confused – the Business Vault is just a part of the Vault not a separate layer. Business rules can create new Satellites and sometimes they may create a Link or Hub too. There are different types of rule – summaries, derivations, helper tables, data science output and quality checks. And we can drop and recalculate business rules at any time. If you need to get hold of more detailed information, you can download User Group presentations from our UK User Group website. We also offer Data Vault and Information Governance rated blogs and white papers from our company site. If you’d like to experiment with the Data Vault system We have developed a package for the free to use ‘dbt’ tool. It will generate SQL to load the Vault from your metadata. Check it out – again on our website. Thanks for listening and
hope to see you again!

Leave a Reply

Your email address will not be published. Required fields are marked *