Blog Archive

Friday, March 5, 2021

Data Lake design Architecture

What is Data Lake ?


  • Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.
  • Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities
  • Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.
  • A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.
  • It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.
  • It advocates a Store-All approach to huge volumes of data.
  • It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.
  • It helps in creating data models that are flexible and could be revised without database redesign.
  • It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.
  • All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too
  • It is a data scientist's favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any time—Iteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.
  • A key attribute of a Data Lake is that data is not classified when it is stored. As a result, the data preparation, cleansing, and transformation tasks are eliminated; these tasks generally take a lion's share of time in a Data Warehouse.