Wednesday, May 27, 2015
A lot of talk has been heard lately about the concept of data lake. Variously known as, data refinery, data factory etc. I find it interesting that we now hear logical architectural terms that speak to the concepts and to the purpose of the big data technologies such as Hadoop / HDFS and Apache distributed database technologies such as HBase/ Cassandra.
This may be indicative of a shift. What I am not sure of is does this mean that there is a level of maturity that has been achieved by this suite of open source technologies? Or could this point to the fact that these technologies have practical applications that solve enterprise scale problems? Or does it show that enterprises have realized that they are no longer able to just deal with "structured data" and that a vast majority of information lies in the space of "unstructured content" leaving them no choice but to venture into the realm of big data technologies? Not really sure!
The fact remains, when the big name software vendors start getting into the business of marketing big data technologies and call start publishing white papers with cool sounding names then there is something going on!! I look at the concepts of data lake, data refinary, data factory etc as synonymous terms for what in the information science realm we call data aggregation! I could be totally off base here and would love to have more of a conceptual / architectural debate on this topic.
I would love to hear from others actively leveraging these technologies as to how they are applying these concepts/ technologies.