Comparison of Big Data and Data Warehouse
If I say, Big Data is the advanced form of the Data warehouse, Will it be wrong? If I look at a typical definition of the Data warehouse from Wikipedia it says like this:
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise
Is Big Data not doing the same, Data Analysis from different sources? Yes, the same, But as the technologies advances, and needs change, a new terminologies are also introduced with some enhanced features, and people start following NEW Baby in the market like an insane. Data is growing fast, a lot of processing was required in order to perform some processing on it to get some useful insight about that to make some good decisions, So the term Big Data was coined. As per Wikipedia
Big data is a term used to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.
When I was studying Statistics, I always was used to wonder why I was studying that But later on, It came out, Everything is an Analysis on a Data to predict, to forecast, to know about the current situation of the subject under discussion, and other related things. So Big Data term got popularity in 1990’s and people had talked about a system where not only structured data as of warehouses which was used for a large set of data but also unstructured data were present for the analysis purpose from all the sources instead of a single source. In 2004, google published a paper and talked about MapReduce which was a parallel processing model and Apache used it and introduced Hadoop and in 2012 Apache Spark was introduced to tackle the limitations of the MapReduce.
Data warehouse and Big Data both are subject oriented and both are used for analysis with a difference that Big Data accepts data from all the sources, form sensors, from an application for a specific purpose, from social media and provides an exact COST effective solution for subject analysis while Data warehouse focuses and receives data from a single source like products, customers and provide an analysis for an organization but not provides analysis on the operations as of Big Data does, Data warehouse mostly accepts relational data while in Big Data you can store any type of data from videos, from texts, from audios, etc.
Some people try to differentiate Big data in terms of preferences but In my point of view, Data ware house and the Big Data both are used for making decisions with a difference that the earlier is on the basis of pre-informed information and the later is on the Multi-sourced Big data where an organization can also perform comparisons with a lot of approaches and time range is one of them which is the main feature of data warehouse. Big data provides more convenience in terms of handling volatile data like streaming while both are non-volatile and keep the old data safe.
Time is money, so big data is more cost effective by using Distributed file system then the warehouse which takes a day or so to process and perform analysis on a large data set.
So, Big data is almost the same as of warehouse and If I say, it is also a warehouse with the exceptions which I have discussed above will not be wrong and thanks to google who gave idea of MapReduce and Apache to take that idea to the next level by introducing Hadoop and Spark in 2012.
Please do share you opinion and valuable insight.
Syed Murtaza Hussain Kazmi
Senior Software Architect
Efficiency in code is good but that must be reasonable.