Problems: Redundancy

Different kinds of Normal Forms:

1NF, 2NF,3NF etc

Data Modelling

Star Schema

Star schemas denormalize the data, which means adding redundant columns to some dimension tables to make querying and working with the data faster and easier. The purpose is to trade some redundancy (duplication of data) in the data model for increased query speed, by avoiding computationally expensive join operations.

In this model, the fact table is normalized but the dimensions tables are not. That is, data from the fact table exists only on the fact table, but dimensional tables may hold redundant data.

Resources: https://www.databricks.com/glossary/star-schema

snowflake schema:

A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for business intelligence and reporting in OLAP data warehouses, data marts, and relational databases.

In a snowflake schema, engineers break down individual dimension tables into logical subdimensions. This makes the data model more complex, but it can be easier for analysts to work with, especially for certain data types.

It's called a snowflake schema because its entity-relationship diagram (ERD) looks like a snowflake, as seen below.

A snowflake schema diagram with a central fact table that connects to multiple dimensional tables and subdimensional tables via foreign keys.

How to improve an existing Data Architecture?

How do you choose Datalake vs Datawarehouse for your solution.

Types of Data Stores - OLTP, ODS, OLAP, Data Mart, Cube, etc - https://www.youtube.com/watch?v=bZoO48yPi-Q

References: https://www.youtube.com/watch?v=9ToVk0Fgsz0

ETL Design Patterns

reference: https://www.youtube.com/watch?v=EmBwKA6vI14

Data Governance

Data governance is the system of internal policies that organizations use to manage, access, and secure enterprise data. While systems may vary in complexity from organization to organization, they always have some common features: internal processes, policies, defined roles, metrics, and compliance standards. The goal of the system is to help people efficiently and securely use the vast amounts of data generated by today’s enterprises.

Benefits:

Improved Data Management
Provide meaning and quality of data, Improved Data Quality
Regulatory, more consistent Compliance
Reduced costs and Increased Value.
Single source of Truth.
Establishing Data ownership & Accountability

Eg: Exchange datasets

Data Lineage

reference: https://www.youtube.com/watch?v=a4HPjtRHaHk

Data Modeling

What is an ER(Entity Relationship) Model?

An Entity Relationship Diagram is a diagram that represents relationships among entities in a database. It is commonly known as an ER Diagram. An ER Diagram in DBMS plays a crucial role in designing the database. Today’s business world previews all the requirements demanded by the users in the form of an ER Diagram. Later, it's forwarded to the database administrators to design the database.

Normalized Model?

Normalization is the method of arranging the data in the database efficiently. It involves constructing tables and setting up relationships between those tables according to some certain rules. The redundancy and inconsistent dependency can be removed using these rules in order to make it more flexible.

There are 6 defined normal forms: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. Normalization should eliminate the redundancy but not at the cost of integrity.

Normalization is the technique of dividing the data into multiple tables to reduce data redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization is the technique of combining the data into a single table to make data retrieval faster.
STAR Schema, Snowflake Schema comes into Picture in this case.

De-Normalized Model?

Deformalizing tables is basically creating a single source of truth when it comes to querying vast number of fact records that might have dimensional data spread across several tables. Having Facts and dimensions in a single table makes aggregations simpler, as the query runs faster.

Relational and Dimensional Models?

The relational model uses a collection of tables to represent both data and the relationships among those data.

Eg: Raw Layer

referencs: https://www.geeksforgeeks.org/relational-model-in-dbms/

Dimensional Model:

Dimensional models are generally used for data warehousing scenarios, and are particularly useful where super-fast query results are required for computed numbers such as “quarterly sales by region” or “by salesperson”. Data is stored in the Dimensional model after pre-calculating these numbers, and updated as per some fixed schedule.

Eg: Curated Layer

What's the Difference Between a Data Warehouse, Data Lake, and Data Mart?

A data warehouse stores data in a structured format. It is a central repository of preprocessed data for analytics and business intelligence.

A data mart is a data warehouse that serves the needs of a specific business unit, like a company’s finance, marketing, or sales department.

A data lake is a central repository for raw data and unstructured data. You can store data first and process it later on.

Source: https://aws.amazon.com/compare/the-difference-between-a-data-warehouse-data-lake-and-data-mart/#:~:text=Data%20warehouses%20typically%20store%20data,from%20another%20existing%20data%20warehouse.

How to choose between star and snowflake schemas?

Generally speaking, a star schema is recommended if you have a few dimensions with low cardinality and limited levels of hierarchy. Additionally, if you need fast and simple queries with aggregated data and have ample storage space to tolerate some data redundancy, then this schema is ideal. On the other hand, a snowflake schema is best suited when you have many dimensions with high cardinality and multiple levels of hierarchy. This type of schema is also helpful when you need complex and detailed queries with granular data and want to avoid data inconsistency. Moreover, a hybrid approach that combines both schemas can be used depending on the specific data warehouse scenarios and trade-offs. For instance, a star schema can be used for simpler and more stable dimensions while a snowflake schema can be used for more complex and dynamic ones. Finally, you can also use a constellation schema that has multiple fact tables sharing some common dimension tables to support different business processes or analytical needs.

Source: https://www.linkedin.com/advice/0/how-do-you-choose-between-star-snowflake-schema?src=go-pa&trk=sem-ga_campid.20321745998_asid.149140187205_crid.664020243553_kw._d.c_tid.dsa-2085899622689_n.g_mt._geo.9008393&mcid=7080236969011671040&cid=&gad_source=1&gclid=CjwKCAjwmrqzBhAoEiwAXVpgogHJyqTxgqb_9HB4C_nm4VdC_n491F3p2Yv3h2atQlhbdi-Bmjx0GRoCbd0QAvD_BwE&gclsrc=aw.ds

How to create a star schema?

A star schema is the Backbone of Dimensional Modeling. Below are the basic steps those I follow for this schema creation.

1. Understand and Select Business Process or Objective.

2. Define granular information to be stored in a table.

Ideally, what should a row in the fact table need to represent?

Eg: Daily trade value. etc

3. Identify Dimensions - like Calander, Location, Products etc

4. Identify Facts - measurable things like transactions, purchases etc

5. Build Star or Snowflake around a fact linking it with Dimensions.

Hadoop and Spark by Leela Prasad

Sunday, June 16, 2024

Design Patterns in Data Engineering

Tuesday, June 11, 2024

Normalization in DBMS

Data Modelling

Star Schema

Monday, June 10, 2024

Database Architecture