Data Aggregation & Data Discovery - Part I

A lot of talk has been heard lately about the concept of data lake.  Variously known as, data refinery, data factory etc.  I find it interesting that we now hear logical architectural terms that speak to the concepts and to the purpose of the big data technologies such as Hadoop / HDFS and Apache distributed database technologies such as HBase/ Cassandra.   

This may be indicative of a shift.  What I am not sure of is does this mean that there is a level of maturity that has been achieved by this suite of open source technologies? Or could  this point to the fact that these technologies have practical applications that solve enterprise scale problems? Or does it show that enterprises have realized that they are no longer able to just deal with "structured data" and that a vast majority of information lies in the space of "unstructured content" leaving them no choice but to venture into the realm of big data technologies?  Not really sure!  

The fact remains, when the big name software vendors start getting into the business of marketing big data technologies and call start publishing white papers with cool sounding names then there is something going on!!  I look at the concepts of data lake, data refinary, data factory etc as synonymous terms for what in the information science realm we call data aggregation!  I could be totally off base here and would love to have more of a conceptual / architectural debate on this topic.

I would love to hear from others actively leveraging these technologies as to how they are applying these concepts/ technologies.

surekha -


  1. Nice post and I would agree with the fact that Hadoop technology stack is coming of age to solve enterprise scale challenges. Scale was never a challenge for Hadoop in terms of volume, but the maturity to handle both batch and real-time streaming data now elevates the game. However, the concept of data lake is much wider and its implementation can span across multiple technology platforms. I recently came across an Enterprise that chose to implement a large part of the data lake on Hadoop technologies but proposes to drive the consumption through Teradata. Here, the data lake was not only an 'analytical store' but a mission critical platform driving regular Business processes and decisions.

    Populating the data lake with all data across the Enterprise, whether structured or unstructured, in my opinion, helps drive multiple use cases as well help Business uses discover their data assets. This is very cool. However, a word of caution - data has a life cyle and anything that is no longer useful from any of the Business processes or analytical reasons, should be duly retired from the data lake as per the organization's data retention policies to prevent the 'data lake' from becoming a 'data swamp' where users struggle to find what is needed.

    I would also like to hear further from others.

  2. Also, adding to my previous comment, a robust metadata solution is needed to encourage and drive the wide consumption across an Enterprise. Choice of a physical store for the data lake as well as the associated metadata could be a critical factor for the downstream use cases.


Post a Comment

Popular posts from this blog

Machine Learning vs Artificial Intelligence - clear as mud??

Internet of Things - is it right for my enterprise?