Blog Archive

Thursday, April 18, 2019

Data Modeling in Hadoop

As we always hear in the context of Hadoop is Schema on Read . 
Above statement simply mean that raw and unprocessed data can be loaded into Hadoop.

Although being able to store all of our raw data is a powerful feature, there are still many factors that we should take into consideration before dumping our data into Hadoop. These considerations include:

Data storage formate: many kinds of file formate generate by the business and Hadoop can support them all. Each file format has there strength that makes it better suited for the application.Hadoop provides HDFS to store the data but on top of HDFS, there are many kinds of additional data access tools available like HBase and HIVE.Hbase for additional data access functionality Hive is for additional data management functionality.

Multitenancy: It’s common for clusters to host multiple users, groups, and application types.
Supporting multitenant clusters involves a number of important considerations when you are planning how data will be stored and managed.

Schema design: Despite the schema-less nature of Hadoop, there are still important considerations
to take into account around the structure of data stored in Hadoop. This includes directory structures for data loaded into HDFS as well as the output of data processing and analysis. This also includes the schemas of objects stored in systems such as HBase and Hive.

Metadata management: As with any data management system, metadata related to the stored data is often as important as the data itself. Understanding and making decisions related to metadata management are critical.

Security : This includes decisions around authentication, fine-grained access control, and encryption—both for data on the wire and data at rest.