Blog Archive

Thursday, April 18, 2019

Data Modeling in Hadoop

As we always hear in the context of Hadoop is Schema on Read . 
Above statement simply mean that raw and unprocessed data can be loaded into Hadoop.

Although being able to store all of our raw data is a powerful feature, there are still many factors that we should take into consideration before dumping our data into Hadoop. These considerations include:

Data storage formate: many kinds of file formate generate by the business and Hadoop can support them all. Each file format has there strength that makes it better suited for the application.Hadoop provides HDFS to store the data but on top of HDFS, there are many kinds of additional data access tools available like HBase and HIVE.Hbase for additional data access functionality Hive is for additional data management functionality.

Multitenancy: It’s common for clusters to host multiple users, groups, and application types.
Supporting multitenant clusters involves a number of important considerations when you are planning how data will be stored and managed.

Schema design: Despite the schema-less nature of Hadoop, there are still important considerations
to take into account around the structure of data stored in Hadoop. This includes directory structures for data loaded into HDFS as well as the output of data processing and analysis. This also includes the schemas of objects stored in systems such as HBase and Hive.

Metadata management: As with any data management system, metadata related to the stored data is often as important as the data itself. Understanding and making decisions related to metadata management are critical.

Security : This includes decisions around authentication, fine-grained access control, and encryption—both for data on the wire and data at rest.

Data extraction in Big data-Hadoop

Extraction and ingestion of the data from various sources and store into HDFS environment is a challenging task.
There are various techniques and application available to ingest large scale data into HDFS.
In this blog, we will see how to Import data from MySql to Hadoop

1- Importing data from MySQL into hdfs using SQOOP

Sqoop is an Apache project that is part of the Hadoop ecosystem. Sqoop is built on top of MapReduce and take advantage of its parallelism and fault tolerance. Instead of moving data between
clusters, Sqoop was designed to move data from and into relational databases using a JDBC driver to connect.

Note: Mysql driver can be downloaded from http://dev.mysql.com/downloads/connector/j/

Steps to do :

Complete the following steps to move data from a MySQL table to an HDFS file:

Step1: Create a new database in the MySQL database:
                   CREATE DATABASE weblogs;

Step2: Create and load the weblogs table:

 USE weblogs;
CREATE TABLE weblogs(
md5                           VARCHAR(32),
url                             VARCHAR(64),
date                           DATE,
time                          TIME,
ip address                  VARCHAR(15)
);
LOAD DATA INFILE '/path/weblog.txt' INTO TABLE weblogs
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n';

Step3: Select a count of rows from the weblogs table:

mysql> select count(*) from weblogs;
The output would be:
+----------+
| count(*) |
+----------+
| 2500 |
+----------+
1 row in set (0.01 sec)

Step 4: Import the data from MySQL to HDFS:

sqoop import -m 1 --connect jdbc:mysql://<HOST>:<PORT>/logs

--username bigdata --password bigdata --table weblogs --target-dir /
data/weblogs/import

The output would be:

INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-jon/compile/f57ad8b208643698f3d01954eedb2e4d/weblogs.
jar
WARN manager.MySQLManager: It looks like you are importing from
mysql.
WARN manager.MySQLManager: This transfer can be faster! Use the
--direct
WARN manager.MySQLManager: option to exercise a MySQL-specific
fast path.
...
INFO mapred.JobClient: Map input records=2500
INFO mapred.JobClient: Spilled Records=0
INFO mapred.JobClient: Total committed heap usage
(bytes)=8500003435
INFO mapred.JobClient: Map output records=2500
INFO mapred.JobClient: SPLIT_RAW_BYTES=83
INFO mapreduce.ImportJobBase: Transferred 200.2451 KB in 10.7619
seconds (17.8206 KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 2500 records.

How internally it works

Sqoop loads the JDBC driver defined in the --connect statement from $SQOOP_HOME/libs,
where $SQOOP_HOME is the full path to the location where Sqoop is installed. The --username
and --password options are used to authenticate the user issuing the command against the
MySQL instance. The mysql.user table must have an entry for the --username option and
the host of each node in the Hadoop cluster; or else Sqoop will throw an exception indicating

that the host is not allowed to connect to the MySQL Server.