What is Data Lake ?
- Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.
- Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities
- Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.
- A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.
- It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.
- It advocates a Store-All approach to huge volumes of data.
- It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.
- It helps in creating data models that are flexible and could be revised without database redesign.
- It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.
- All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too
- It is a data scientist's favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any time—Iteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.
- A key attribute of a Data Lake is that data is not classified when it is stored. As a result, the data preparation, cleansing, and transformation tasks are eliminated; these tasks generally take a lion's share of time in a Data Warehouse.