Sunday, June 28, 2015

Data Aggregation & Data Discovery - Part II

Expanding on the context of Data Aggregation, variously called data refinery, data factory or data lake, I would like to analyze if the concept of Data Aggregation is just a theoretical construct or if there is a practical side to this.

My opinion is that Data Aggregation (regardless of how it is referred to) is just a means to an end; an enabler or precursor for Data Discovery.  This is truly a facility to bring together various types of disconnected sources of data that were previously leveraged in very “targeted” use cases.  The idea being to discover new connections or to explore new usage patterns.   These explorations might belong to the realm of identifying proactive growth opportunities or in the domain of preemptive loss prevention.  Data scientists are able to employ statistical algorithms and predictive modeling techniques to see if new patterns emerge or else to see they are able to ferret out alternate connections.  One also can imagine the use of clustering and machine learning techniques to find unknown patterns that could be applied to marketing, operational process, product placement decisions.  None of these would have been possible with dissociated sources of data.  Thus, the truly quantifiable benefit of leveraging a Data Aggregation platform is to bring large disconnected data into one holistic platform and to run traditional statistical modeling techniques for the purpose of Data Discovery.   

Now you might ask as to how the concept of Data Aggregation and the need to create a platform for disparate data sources are connected to Data Discovery and Big Data?  You are wondering what if any options open up.   Traditional infrastructure has always been a real limiting factor for any enterprise that wanted to create a hosting Platform for large data sets.  So most enterprises would be more inclined to host Data Ware Housing Platforms that offered KPIs which were projections of known trends rather than to invest in Platforms that explored the somewhat unreliable potential that lay in the realm of statistical modeling.  Data scientists overcame this by using “sample data sets”, knowing fully well that the results could be skewed or the patterns discovered could be choppy.  This is where Hadoop based Big Data techniques come in handy. 

Now your IT departments can offer cost-effective Hadoop based Platforms for deploying large data sets needed for Data Aggregation.  Linearly scalable commodity-hardware based big Data Aggregation Platforms make it possible for the data scientist to execute their  predictive models / algorithms against representative  data sets, instead of on scaled down sub-sets.  Most of all, Hadoop based Data Aggregation now insures the reliability of business outcomes generated by these models/ algorithms.  The efficacy of the outcomes and the applicability of the predictions ultimately increase the rate of adoption of data science predictive modeling techniques.  The bottom line – businesses gain a competitive edge from the process of Data Discovery deployed against the Data Aggregation constructs.

I look forward to hearing from you to learn what is working for you and if you have been able to realize the benefits that these technologies seem to tout.

Wednesday, May 27, 2015

Data Aggregation & Data Discovery - Part I

A lot of talk has been heard lately about the concept of data lake.  Variously known as, data refinery, data factory etc.  I find it interesting that we now hear logical architectural terms that speak to the concepts and to the purpose of the big data technologies such as Hadoop / HDFS and Apache distributed database technologies such as HBase/ Cassandra.   

This may be indicative of a shift.  What I am not sure of is does this mean that there is a level of maturity that has been achieved by this suite of open source technologies? Or could  this point to the fact that these technologies have practical applications that solve enterprise scale problems? Or does it show that enterprises have realized that they are no longer able to just deal with "structured data" and that a vast majority of information lies in the space of "unstructured content" leaving them no choice but to venture into the realm of big data technologies?  Not really sure!  

The fact remains, when the big name software vendors start getting into the business of marketing big data technologies and call start publishing white papers with cool sounding names then there is something going on!!  I look at the concepts of data lake, data refinary, data factory etc as synonymous terms for what in the information science realm we call data aggregation!  I could be totally off base here and would love to have more of a conceptual / architectural debate on this topic.

I would love to hear from others actively leveraging these technologies as to how they are applying these concepts/ technologies.

surekha -