Sunday, June 28, 2015
Data Aggregation & Data Discovery - Part II
Expanding on the context of Data Aggregation, variously called data refinery, data factory or data lake, I would like to analyze if the concept of Data Aggregation is just a theoretical construct or if there is a practical side to this.
My opinion is that Data Aggregation (regardless of how it is referred to) is just a means to an end; an enabler or precursor for Data Discovery. This is truly a facility to bring together various types of disconnected sources of data that were previously leveraged in very “targeted” use cases. The idea being to discover new connections or to explore new usage patterns. These explorations might belong to the realm of identifying proactive growth opportunities or in the domain of preemptive loss prevention. Data scientists are able to employ statistical algorithms and predictive modeling techniques to see if new patterns emerge or else to see they are able to ferret out alternate connections. One also can imagine the use of clustering and machine learning techniques to find unknown patterns that could be applied to marketing, operational process, product placement decisions. None of these would have been possible with dissociated sources of data. Thus, the truly quantifiable benefit of leveraging a Data Aggregation platform is to bring large disconnected data into one holistic platform and to run traditional statistical modeling techniques for the purpose of Data Discovery.
Now you might ask as to how the concept of Data Aggregation and the need to create a platform for disparate data sources are connected to Data Discovery and Big Data? You are wondering what if any options open up. Traditional infrastructure has always been a real limiting factor for any enterprise that wanted to create a hosting Platform for large data sets. So most enterprises would be more inclined to host Data Ware Housing Platforms that offered KPIs which were projections of known trends rather than to invest in Platforms that explored the somewhat unreliable potential that lay in the realm of statistical modeling. Data scientists overcame this by using “sample data sets”, knowing fully well that the results could be skewed or the patterns discovered could be choppy. This is where Hadoop based Big Data techniques come in handy.
Now your IT departments can offer cost-effective Hadoop based Platforms for deploying large data sets needed for Data Aggregation. Linearly scalable commodity-hardware based big Data Aggregation Platforms make it possible for the data scientist to execute their predictive models / algorithms against representative data sets, instead of on scaled down sub-sets. Most of all, Hadoop based Data Aggregation now insures the reliability of business outcomes generated by these models/ algorithms. The efficacy of the outcomes and the applicability of the predictions ultimately increase the rate of adoption of data science predictive modeling techniques. The bottom line – businesses gain a competitive edge from the process of Data Discovery deployed against the Data Aggregation constructs.
I look forward to hearing from you to learn what is working for you and if you have been able to realize the benefits that these technologies seem to tout.