Data Aggregation & Data Discovery - Part II
Expanding
on the context of Data Aggregation, variously called data refinery, data
factory or data lake, I would like to analyze if the concept of Data
Aggregation is just a theoretical construct or if there is a practical side to
this.
My
opinion is that Data
Aggregation (regardless of how it is referred to) is just a means to an end; an
enabler or precursor for Data Discovery. This is truly a facility to
bring together various types of disconnected sources of data that were
previously leveraged in very “targeted” use cases. The idea being to discover
new connections or to explore new usage patterns. These
explorations might belong to the realm of identifying proactive growth
opportunities or in the domain of preemptive loss prevention. Data
scientists are able to employ statistical algorithms and predictive modeling
techniques to see if new patterns emerge or else to see they are able to ferret
out alternate connections. One also can imagine the use of clustering and
machine learning techniques to find unknown patterns that could be applied to
marketing, operational process, product placement decisions. None of
these would have been possible with dissociated sources of data. Thus,
the truly quantifiable benefit of leveraging a Data Aggregation platform is to
bring large disconnected data into one holistic platform and to run traditional
statistical modeling techniques for the purpose of Data Discovery.
Now you might
ask as to how the concept of Data Aggregation and the need to create a platform
for disparate data sources are connected to Data Discovery and Big Data?
You are wondering what if any options open up. Traditional
infrastructure has always been a real limiting factor for any enterprise that
wanted to create a hosting Platform for large data sets. So most
enterprises would be more inclined to host Data Ware Housing Platforms that
offered KPIs which were projections of known trends rather than to invest in
Platforms that explored the somewhat unreliable potential that lay in the realm
of statistical modeling. Data scientists overcame this by using “sample
data sets”, knowing fully well that the results could be skewed or the patterns
discovered could be choppy. This is where Hadoop based Big Data
techniques come in handy.
Now your IT
departments can offer cost-effective Hadoop based Platforms for deploying large
data sets needed for Data Aggregation. Linearly scalable commodity-hardware
based big Data Aggregation Platforms make it possible for the data scientist to
execute their predictive models / algorithms against representative data sets, instead of on scaled down sub-sets. Most of all, Hadoop based
Data Aggregation now insures the reliability of business outcomes generated by
these models/ algorithms. The efficacy of the outcomes and the
applicability of the predictions ultimately increase the rate of adoption of
data science predictive modeling techniques. The bottom line – businesses
gain a competitive edge from the process of Data Discovery deployed against the
Data Aggregation constructs.
I look
forward to hearing from you to learn what is working for you and if you have
been able to realize the benefits that these technologies seem to tout.
Surekha -
Comments
Post a Comment