References: https://www.youtube.com/watch?v=9ToVk0Fgsz0
ETL Design Patterns
reference: https://www.youtube.com/watch?v=EmBwKA6vI14
Data Governance
Data governance is the system of internal policies that organizations use to manage, access, and secure enterprise data. While systems may vary in complexity from organization to organization, they always have some common features: internal processes, policies, defined roles, metrics, and compliance standards. The goal of the system is to help people efficiently and securely use the vast amounts of data generated by today’s enterprises.
Benefits:
- Improved Data Management
- Provide meaning and quality of data, Improved Data Quality
- Regulatory, more consistent Compliance
- Reduced costs and Increased Value.
- Single source of Truth.
- Establishing Data ownership & Accountability
Eg: Exchange datasets
Data Lineage
reference: https://www.youtube.com/watch?v=a4HPjtRHaHk
Data Modeling
What is an ER(Entity Relationship) Model?
An Entity Relationship Diagram is a diagram that represents relationships among entities in a database. It is commonly known as an ER Diagram. An ER Diagram in DBMS plays a crucial role in designing the database. Today’s business world previews all the requirements demanded by the users in the form of an ER Diagram. Later, it's forwarded to the database administrators to design the database.
Normalized Model?
Normalization is the method of arranging the data in the database efficiently. It involves constructing tables and setting up relationships between those tables according to some certain rules. The redundancy and inconsistent dependency can be removed using these rules in order to make it more flexible.
There are 6 defined normal forms: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. Normalization should eliminate the redundancy but not at the cost of integrity.
- Normalization is the technique of dividing the data into multiple tables to reduce data redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization is the technique of combining the data into a single table to make data retrieval faster.
- STAR Schema, Snowflake Schema comes into Picture in this case.
De-Normalized Model?
Deformalizing tables is basically creating a single source of truth when it comes to querying vast number of fact records that might have dimensional data spread across several tables. Having Facts and dimensions in a single table makes aggregations simpler, as the query runs faster.
Relational and Dimensional Models?
The relational model uses a collection of tables to represent both data and the relationships among those data.
Eg: Raw Layer
Dimensional Model:
Dimensional models are generally used for data warehousing scenarios, and are particularly useful where super-fast query results are required for computed numbers such as “quarterly sales by region” or “by salesperson”. Data is stored in the Dimensional model after pre-calculating these numbers, and updated as per some fixed schedule.
Eg: Curated Layer
What's the Difference Between a Data Warehouse, Data Lake, and Data Mart?
A data warehouse stores data in a structured format. It is a central repository of preprocessed data for analytics and business intelligence.
A data mart is a data warehouse that serves the needs of a specific business unit, like a company’s finance, marketing, or sales department.
A data lake is a central repository for raw data and unstructured data. You can store data first and process it later on.
How to choose between star and snowflake schemas?
Generally speaking, a star schema is recommended if you have a few dimensions with low cardinality and limited levels of hierarchy. Additionally, if you need fast and simple queries with aggregated data and have ample storage space to tolerate some data redundancy, then this schema is ideal. On the other hand, a snowflake schema is best suited when you have many dimensions with high cardinality and multiple levels of hierarchy. This type of schema is also helpful when you need complex and detailed queries with granular data and want to avoid data inconsistency. Moreover, a hybrid approach that combines both schemas can be used depending on the specific data warehouse scenarios and trade-offs. For instance, a star schema can be used for simpler and more stable dimensions while a snowflake schema can be used for more complex and dynamic ones. Finally, you can also use a constellation schema that has multiple fact tables sharing some common dimension tables to support different business processes or analytical needs.
How to create a star schema?
A star schema is the Backbone of Dimensional Modeling. Below are the basic steps those I follow for this schema creation.
1. Understand and Select Business Process or Objective.
2. Define granular information to be stored in a table.
Ideally, what should a row in the fact table need to represent?
Eg: Daily trade value. etc
3. Identify Dimensions - like Calander, Location, Products etc
4. Identify Facts - measurable things like transactions, purchases etc
5. Build Star or Snowflake around a fact linking it with Dimensions.
No comments:
Post a Comment