Blog Archive

Thursday, April 18, 2019

Data extraction in Big data-Hadoop

Extraction and ingestion of the data from various sources and store into HDFS environment is a challenging task.
There are various techniques and application available to ingest large scale data into HDFS.
In this blog, we will see how to Import data from MySql to Hadoop

1- Importing data from MySQL into hdfs using SQOOP

Sqoop is an Apache project that is part of the Hadoop ecosystem. Sqoop is built on top of MapReduce and take advantage of its parallelism and fault tolerance. Instead of moving data between
clusters, Sqoop was designed to move data from and into relational databases using a JDBC driver to connect.

Note: Mysql driver can be downloaded from http://dev.mysql.com/downloads/connector/j/

Steps to do :

Complete the following steps to move data from a MySQL table to an HDFS file:

Step1: Create a new database in the MySQL database:
                   CREATE DATABASE weblogs;

Step2: Create and load the weblogs table:

 USE weblogs;
CREATE TABLE weblogs(
md5                           VARCHAR(32),
url                             VARCHAR(64),
date                           DATE,
time                          TIME,
ip address                  VARCHAR(15)
);
LOAD DATA INFILE '/path/weblog.txt' INTO TABLE weblogs
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n';

Step3: Select a count of rows from the weblogs table:

mysql> select count(*) from weblogs;
The output would be:
+----------+
| count(*) |
+----------+
| 2500 |
+----------+
1 row in set (0.01 sec)

Step 4: Import the data from MySQL to HDFS:

sqoop import -m 1 --connect jdbc:mysql://<HOST>:<PORT>/logs

--username bigdata --password bigdata --table weblogs --target-dir /
data/weblogs/import

The output would be:

INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-jon/compile/f57ad8b208643698f3d01954eedb2e4d/weblogs.
jar
WARN manager.MySQLManager: It looks like you are importing from
mysql.
WARN manager.MySQLManager: This transfer can be faster! Use the
--direct
WARN manager.MySQLManager: option to exercise a MySQL-specific
fast path.
...
INFO mapred.JobClient: Map input records=2500
INFO mapred.JobClient: Spilled Records=0
INFO mapred.JobClient: Total committed heap usage
(bytes)=8500003435
INFO mapred.JobClient: Map output records=2500
INFO mapred.JobClient: SPLIT_RAW_BYTES=83
INFO mapreduce.ImportJobBase: Transferred 200.2451 KB in 10.7619
seconds (17.8206 KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 2500 records.

How internally it works

Sqoop loads the JDBC driver defined in the --connect statement from $SQOOP_HOME/libs,
where $SQOOP_HOME is the full path to the location where Sqoop is installed. The --username
and --password options are used to authenticate the user issuing the command against the
MySQL instance. The mysql.user table must have an entry for the --username option and
the host of each node in the Hadoop cluster; or else Sqoop will throw an exception indicating

that the host is not allowed to connect to the MySQL Server.