Thursday, December 1, 2022

The hidden cost of using a metadata driven ingestion framework

 I came across a few implementations of metadata driven ELT/ETL frameworks, designed and developed by some big consulting firms, perhaps great minds think alike, they all use hashed values  to detect changed records, to be able to avoid duplication and insert changes, this indeed saves a lot of effort and simplify the design.

However one of the major drawbacks of this approach is that often this makes optimization impossible, for example in delta lake, use merge into statement based on the hashed key makes the operation extremely expensive.  this will make partioning not useful or possible, often ended up high running cost.

It will probably make more sense to also define a partition key, zorder columns plus key columns used to identity uniqueness in the metadata, so the delta tables will be created using the optional partition key and zorder columns, then update the merge into statement to use join conditions based on the key columns so it will be able to do parition proning at least, usually the partition key should be part of the key columns  and in most cases is a date column , optionally defining the zorder columns to be the business key etc will also help.

No comments:

Disable Microsoft Defender for Cloud for Visual Studio Subscription (MSDN)

I use a visual studio pro subscription which comes with $150 azure cloud credit, for some reason Microsoft Defender for Cloud was turned on ...