Thursday, December 1, 2022

The hidden cost of using a metadata driven ingestion framework

 I came across a few implementations of metadata driven ELT/ETL frameworks, designed and developed by some big consulting firms, perhaps great minds think alike, they all use hashed values  to detect changed records, to be able to avoid duplication and insert changes, this indeed saves a lot of effort and simplify the design.

However one of the major drawbacks of this approach is that often this makes optimization impossible, for example in delta lake, use merge into statement based on the hashed key makes the operation extremely expensive.  this will make partioning not useful or possible, often ended up high running cost.

It will probably make more sense to also define a partition key, zorder columns plus key columns used to identity uniqueness in the metadata, so the delta tables will be created using the optional partition key and zorder columns, then update the merge into statement to use join conditions based on the key columns so it will be able to do parition proning at least, usually the partition key should be part of the key columns  and in most cases is a date column , optionally defining the zorder columns to be the business key etc will also help.

Sunday, May 8, 2022

Configure Snowflake with External oAuth using Azure AD and device flow with MFA

Following the article at https://docs.snowflake.com/en/user-guide/oauth-azure.html and https://community.snowflake.com/s/article/How-To-Test-Azure-OAuth-Connection-To-Snowflake-End-To-End-Using-Python-User-Credentials-Flow, it is possible to implement a password flow for authenticating to snowflake using user's own credentials and assume roles they have been granted access to. however this hit an issue with MFA, unless you trusted the ip range of sagemaker this will not work.

To work around this issue, one can enable the public client feature ont he Azure AD client App and use msal to implement device flow instead. please refer to https://github.com/Azure-Samples/ms-identity-python-devicecodeflow for the sample code.

Wednesday, January 12, 2022

spark + jupyter notebook on ubuntu

 

Step 1: download spark from https://spark.apache.org/downloads.html

Step 2: unzip

           $  tar zxvf ../spark-3.x.x.tar.gz

Step 3: setup bash by adding the following t~/.bashrc

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Step 4: install jupyter notebook:  
        $ pip install jupyter
Step 5: Start Spark: 
                $ start-all.sh
Step 6: install findspark package: 
        $ pip install findspark
Step 7: launch jupyter notebook: 
        $jupyter notebook
Step 8: create a new notebook and add the following code for testing:
import findspark

findspark.init()

import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.sql("select 'spark' as hello ")

df.show()

Disable Microsoft Defender for Cloud for Visual Studio Subscription (MSDN)

I use a visual studio pro subscription which comes with $150 azure cloud credit, for some reason Microsoft Defender for Cloud was turned on ...