Skip to main content

Google

The Star Schema is Obsolete, Reduced Cost, and Better Performance

Making Decision On The Move

This is the second blog in the series. Read the first blog, here.

 

In my first blog, I stated that the star schema and dimensional modeling are obsolete, and I urge you to stop using them. I explained that in the first chapter of The Data Warehouse Toolkit Ralph Kimball explained that dimensional modeling and the star schema offered few benefits that are now archaic.

Reduce Cost

Dream Design Team

In 1996 when the star schema and dimensional modeling were introduced, storage costs were expensive, almost $12,000 per gigabyte. Since the star schema helped reduce data redundancy it also helped reduce storage costs. In 2022 the cost of a gigabyte of data storage in the cloud is only $.023 cents per month. Reducing data redundancy does not create significant cost savings. Additionally in the same time frame, labor costs for data workers have greatly increased. Converting a normal form model to the star schema dimensional model takes significant data engineering time every day. The star schema dimensional model also increases the time necessary to join information from multiple dimensions to obtain usable data sets. The result is that using a star schema dimensional model in the modern data processing environment significantly increases costs.

 

 

Better Performance

In a study done by Fivetran, Michael Kaminsky provides definitive benchmark proof that today’s modern data management platforms like Snowflake, Redshift, and BigQuery Perform significantly faster when they use wide tables as opposed to a star schema dimensional model. This is because when we join data together into one big wide table (OBT) the modern data management platform does not have to do any joins or at least fewer joins.  You see,  the single most expensive performance operation in a relational platform is still the SQL JOIN. When we combine this with previous knowledge that data visualization, reporting, and many data science tools perform better using wide tables it becomes clear that OBT or wide tables are a preferable table architecture.

 

Help Data Users and Knowledge Workers Better Understand the Data

Perhaps in 1996, this was plausible because data workers were primarily within Information Technology (IT) organizations and because they knew SQL well. However, I am not really convinced this argument was ever accurate. In 2022, knowledge workers are most often in the business organization. They are not IT employees they are business people trying to make better business decisions based on data. For many, the only way/tool they know for working with data is a spreadsheet. This means the star schema is not understandable and perhaps is even unusable as they lack the knowledge necessary to join information from multiple tables. At the very least using a spreadsheet to join information from multiple star schema dimensional models is time and effort intensive. However, OBT wide table is familiar to these data workers, it looks and feels like a spreadsheet. The OBT/wide table is much easier for the modern data worker to understand, and I think much easier for everyone to understand.

When we closely examine the reasons that Ralph Kimball himself said a star schema dimensional model was effective it is clear, that they no longer hold true. You need to stop using star schema dimensional models. They are ludicrously expensive, decreasing performance, and inhibiting your data/knowledge workers. For an explanation of what you should be doing instead look for my next blog.

 

Perficient’s Cloud Data Expertise

Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals.

 

Read the next blog in the series, here.

 

Thoughts on “The Star Schema is Obsolete, Reduced Cost, and Better Performance”

  1. Star schema is denormalized, so it does have more redundancy then having a single big table where everything is only stored once.
    Splitting up a single big table into 5 dimensional and 1 fact table would significantly increase storage required to store the same data. Therefore your point about storage costs decreasing would actually mean star can be more popular now that storage is cheap. I think the costs to design and maintain (and re-design when requirements inevitably shift) star tables are just too much effort. I doubt data engineering talent is much more expensive, as you have a lot more people choosing this field now.
    I would recommend one more aspect: the amount of data companies hoard and work with is insane. Of course in the past you were able to spend time on redesigning and renormalizing tables. Today engineers can make much bigger cost impact with different actions. Many of todays big cost reductions are achieved with simply turning on a new feature delivered by cloud providers like AWS, GCP and Azure.

  2. I think the story is more complex now.

    Due to cheap storage we collect much, much more dimension attributes – features for ML.

    And tables are no longer kept row-wise but we moved to column-wise storage so performance characteristics differs.

    Me personally I gave up on star-schema in simple cases. I am coming back to star schema for very high dimensionality cases.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at chuck.brooks@perficient.com.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram