My Technical Diary (George Wen)

Sunday, July 14, 2024

Disable Microsoft Defender for Cloud for Visual Studio Subscription (MSDN)

I use a visual studio pro subscription which comes with $150 azure cloud credit, for some reason Microsoft Defender for Cloud was turned on and it consumes more than half of that $150 credit.

Unfortunately I have not come across a easy switch, some online post suggested there is a place to swith on/off resources but for some reason I can't find that configuration page either.

At last I lookup the az cli command manage to do all these in one line of code:

az security pricing list |jq -r '.value[].name' | xargs -n 1 -I {} az security pricing create -n {} --tier "free"

Thursday, March 28, 2024

Elevating LLM Deployment with FastAPI and React: A Step-By-Step Guide

In a previous exploration, I delved into creating a Retrieval-Augmented-Generation (RAG) demo, utilising Google’s gemma model, Hugging Face, and Meta’s FAISS, all within a Python notebook. This demonstration showcased the potential to build a locally-run, RAG-powered application.

The conceptual flow of using RAG with LLMs. (Source)

This article aims to advance that groundwork by deploying the model and RAG functionality via FastAPI, with a subsequent consumption of the API through a straightforward ReactJS frontend. A notable enhancement in this iteration is the integration of the open-source Mistral 7b model and the Chroma vector database. The Mistral 7b model is acclaimed for its optimal balance between size and performance, surpassing the Llama 2 13B model across benchmarks and matching the prowess of Google’s gemma model. Continue here

Friday, March 1, 2024

Streamlining Real-Time CDC and Data Replication with Debezium and Kafka

In today’s fast-paced digital landscape, efficient data management and replication are more critical than ever. This article walks you through setting up a streamlined, real-time Change Data Capture (CDC) and data replication pipeline using Debezium and Kafka. We’ll leverage Docker Compose for a simplified testing environment, avoiding the complexities of server provisioning.

For those considering cloud-based solutions, options like Confluent Cloud offer a Kafka service with a free trial. Alternatively, Azure Event Hubs or AWS’s Managed Kafka services provide robust platforms for handling large-scale data streams.

Full article can be read here: https://medium.com/@george.vane/streamlining-real-time-cdc-and-data-replication-with-debezium-and-kafka-b4d3bc56e2ab

Monday, June 12, 2023

Database replication using Confluent (Kafka) and Debezium

I have been playing with confluent cloud and Debezium for a little while and found it is extremely useful in streaming data ingestion, the usual use case I came across includes the following two scenario:

1. Use Debezium CDC connector to generate change records to Kafka topics, dump the change records to either cloud storage or to delta lake, this is usually called the raw zone, you can then subsequently consume these change records in your favorate data platform, such as Databricks or Snowflake, both have a rebust streaming ingestion support.

2. Another way is often you just want to have a copy fo the production database for analytics usage, hence a like for like replication is what you need, you can you jdbc sink connector for that, the additional benefits is that you can replicate data to different target database platform, for example mysql to SQL Server, postgres to SQL Server, mysql to postgres etc.

Friday, May 5, 2023

Migrating IBM DB2 to Google Bigtable and achieving FIPS compliance encryption using Java custom encryption library

This is about a project I undetook recently, the purpose was to migrate large volume of on-prem db2 data to google bigtable using DataProc, a spark based solution on Google Cloud, a few things that are notable from the project:

1. I have to use SCALA to develop the solution due to the fact that the encryption libary was developed in Java and althoguh it has interoperativity to Python, it does have lot of limitation which stoped me from using Python... on the other hand, SCALA and JAVA just work together seamlessly.

2. for FIPS compliance, I have to use bouncecastle library, which introduces issues in managing dependencies, "dependency hell" as some named it, at then end I have to use mevan to manage dependencies and shade sbt due to the complexity.

3. I used hbase-spark connector for talking to bigtable, Since I am using spark 3, I have to complied the connector libaray manually, see https://github.com/apache/hbase-connectors/tree/master/spark

(this project was done about a year ago)

Handling Large Messages With Apache Kafka

While working on handling large messages with Kafka, I came across a few useful reference articles, bookmarking here for anyone who needs them:

https://dzone.com/articles/processing-large-messages-with-apache-kafka

https://www.morling.dev/blog/single-message-transforms-swiss-army-knife-of-kafka-connect/

https://www.kai-waehner.de/blog/2020/08/07/apache-kafka-handling-large-messages-and-files-for-image-video-audio-processing/

https://docs.confluent.io/cloud/current/connectors/single-message-transforms.html#cc-single-message-transforms-limitations

Tuesday, April 4, 2023

Object Tracking Demo

In a proof of concept project I undertook a while ago, YOLO (You Only Look Once) object detection model was used in combination with the Deep SORT (Simple Online and Realtime Tracking) algorithm to track objects in real-time. The aim was to showcase how this technology can be applied to traffic monitoring, specifically measuring the time it takes for vehicles to pass through a road junction. The project demonstrated the effectiveness of the combination of these technologies in accurately detecting and tracking vehicles as they move through a monitored area. The results showed that this method could provide accurate data on traffic flow, which could be useful for traffic management and infrastructure planning purposes. The time taken for vehicles to pass through the road junction was also measured as part of the project.