Traditional EDW vs Big Data

Big data is the newest buzz word in the industry. Executives and information technology experts are all dropped off from cloud computing buzz and hopped into the big data band wagon. Generally, the excitement and buzz in market leads into a misconception of a new idea and takes few iterations before the key concept of new idea is widely understood.

Is Big Data a new concept? – No. The concept has been there for four decades and it has been named as enterprise data warehouse (EDW) and the focus of EDW is primarily on the internal structured data.

The objective of this blog is to bring the key concept of big data by comparing it with enterprise data warehouse.

The simpliest view of a data warehouse is to take all the operational data to one place as single point of truth for the organization and all the combination of analytical reports are generated out of it. A typical enterprise data warehouse data flow is given in the figure above. If EDW is already in existence, what is big data and why this big data, big data di? (I mean: now?)

What is it? – To go back to my last article on Money ball architect, big data is a collection of internal and external information that required for Money Ball architects. Based on my definition, a Money Ball architect (otherwise called data architect or data scientist) shall work to identify a set of differentiating data from a massive data set. Differentiating data will be modeled and derived when the product, service, consumer & partner trends are studied and understood. The consumer, partner, product and economical data is unstructured in uncharted territory. A massive data set in uncharted territory includes both internal, external structured and unstructured data. The massive data set is called big data.

Why is it now? –  A need arose for big data with emergence of social media and other unstructured data widely used both internally and externally in an organization. The unstructured data includes the customer status update in facebook, twitter, youtube video upload, picture upload from a smart phone and voice assistance like Siri. The behavior of consumer, end user actual experience, product acceptance & adoption are viral, unstructured and paradoxical.  With rapid adoption and growth in mobile technology- the consumer interaction, purchasing habits, product reviews are done viral. Simplified approach for the consumer to engage in an experience increased the complexity of analysis from a service provider perspective.

“The behavior of consumer, end user actual experience, product acceptance & adoption are viral, unstructured and paradoxical”

An unsatisfied customer does not call “1-800-sup-port” number any more to file a compliant. They tweet, or update in their facebook status about their experience. The companies trying to measure the customer satisfication by analysing the internal customer compliant database sure will miss the reality. Traditional and trivial data analytics are not good enough anymore. Availability of technologies like Hadoop, HDFS, Avro, MapReduce, Zoo Keeper, Pig, Chukwa, Hive, HBase,R Programming make the big data concept practical.  Emergence of massive unstructured data through social media , utilization of it for daily activities and availability of technologies led into the bigdata now.

All of the core technologies for Bigdata are open source tools. With minimum hiccups during the Easter weekend, Hadoop, MapReduce was successfully installed, configured and functional in Ubuntu Linux runing on Virtual Box on the host OS Windows 7.

There are lots of commercialized version and open source tool available to run an enterprise big data infrastructure. I will write a big data technology landscape as my next topic related to big data.

MySQL – Enterprise readiness

Google runs critical business systems with MySQL.

“Google runs critical business systems with MySQL and InnoDB. The systems require 24×7 operation with minimal downtime. The systems support large OLTP and reporting workloads. We are very happy with the scalability, reliability and manageability of this software.”

Chris DiBona, Open Source Programs Manager, Google Inc.

Yahoo Financial runs on MySQL

MySQL at Yahoo!
Some Technical Details:
Operating system used: FreeBSD and Linux, synchronized using MySQL Replication
Size of database: 25 GB
Average number of concurrent connections: 60
Max number of concurrent connections: 250

Ticketmaster runs on MySQL

“We migrated the Event database from Microsoft SQL Server to MySQL for lower costs and higher scalability. Thanks to MySQL, we are able to scale 4 times better while constantly maintaining the replication latency of less than 1 second across our 250 MySQL servers.”

Ed Presz, Sr. Director of Database Engineering, Ticketmaster Entertainment

 It is time for all other innovative companies in financial, retail, manufacturing, health care, services sectors to look into MySQL as their database..

Note: This page is used for google’s page rank emprical analysis. The links will be created based on the random graph created.  This is node #3 which has the key word:  xysivabodzinyx , xysivabodzinxy . As per the graph, it links to page 1, page 5