Data Empowerment Blog: December 2017

Friday, December 8, 2017

HADOOP 2 X

Apache Hadoop-2.7.0- Components

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

>The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop modules.

• Hadoop Distributed File System (HDFS): a distributed file-system that stores data on commodity machines, providing

very high aggregate bandwidth across the cluster.

• Hadoop YARN: a resource-management platform responsible for managing computing resources in clusters and using

them for scheduling of users’ applications.

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets(programming model for large

scale data processing)

There are five pillars to Hadoop that make it enterprise ready:

1. Data Management: Apache Hadoop YARN, HDFS

2. Data Access: Apache Hive, Apache Pig, MapReduce, Apache Spark, Apache Storm,Apache Hbase, Apache Tez, Apache Kafka, Apache Hcatalog, Apache Slider, Apache Solr, Apache Mahout, Apache Accumulo

3. Data Governance and Integration: Apache Falcon, Apache Flume, Apache Sqoop

4. Security: Apache Knox, Apache Ranger

5. Operations: Apache Ambari, Apache Oozie, Apache ZooKeeper

Providers

Commercial Vendors:

Cloudera
Hortonworks
IBM Infosphere Biginsights
MapR Technologies
Think Big Analytics
Amazon Web Services (Cloud based)
Microsoft Azure (Cloud based)

Open Source Vendors

Apache
Apache Bigtop
Cascading
Cloudspace
Datameer
Data Mine Lab
Data Salt
Data Stax
Data Torrent
Debian
Emblocsoft
Hstreaming
Impetus
Pentaho
Talend
Jaspersoft
Karmasphere
Apache Mahoot
Nutch
NGData
Pervasive Software
Pivotal
Sematext International
Syncsort
Tresata
Wandisco
Etc..

Thursday, December 7, 2017

Big Data understanding

Building Blocks for Big Data Project

- Working knowledge on Hadoop & Hadoop Ecosystem

o Be comfortable with basic Linux commands

o Dataware housing Knowledge and SQL commands

o Programming concepts like Java, Python, R, Pearl etc.

- Understanding data structure & Business objective

- Data visualization tools like Tableau, Qlickview, Jasper reports etc.

- Be comfortable with analytics tools like R, Python, Spark, SAS etc.

- Be comfortable with statistics (exploratory) and machine learning algorithms

What disrupted the Data Center?

Every industry is graced with more data…

• Richer transnational data from portfolio of dozens or hundreds of

business applications

• Usage and behavior data from web and mobile apps

• Social media data

• Sensor and event data from IoT devices

• Data economy – firms buying and selling data

• Derived data from analytics

What is the challenge?

• The challenges include capture, curation, storage, search, sharing

transfer, analysis and visualization

• The main challenge lies in identifying the value, the relevant information within this data, and then transforming and extracting that data for further analysis.

What is Bigdata?

• Is it technology?

• Is it solution?

• Is it problem?

• Is it platform?

• Is it statement/phrase?

Big Data – 4 V’s

According to IDC(International Data Corporation) the size of digital universe at 4.4 zettabytes in 2013 and forecasting a tenfold growth by 2020 to 40 zettabytes
A zetta bytes is (10)21 bytes or thousands of exabytes or one million petabytes or one billion terabytes
The NYSE generates about 4-5 terabytes of data per day
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

IBM’s Definition of Big Data

Big data – Myths

· It’s Big : You need to have lots of data to talk about

big data

· You need to apply it right away

· The more granular the data, the better

· Big Data is good data

· Big Data means that analysts become all-important

· Big Data gives you concrete answers

· Big Data predicts the future

· Big Data is a magical solution

· Big Data can create self-learning algorithms

· Big Data is only for big corporations

· We Have So Much Data, We Don't Need to Worry

About Every Little Data Flaw

· Big Data Technology Will Eliminate the Need for

Data Integration

· It's Pointless Using a Data Warehouse for Advanced

Analytics

· Data Lakes Will Replace the Data Warehouse

· Hadoop is the holy grail of big data

· Machine Learning Overcomes Human Bias

Big Data- Scenarios

What is Hadoop?

Hadoop is an Open-Source Data Management framework with scale-out storage &distributed processing

Hadoop is not a database. Hadoop (from Apache Software Foundation) is a Java-based software framework for scalable,decentralized software applications that supports easy handling and analyzing of vast data volumes.

Existing Data Architecture

Limitations of Existing Data Analytics Architecture

An Emerging Data Architecture

Emerging Data Analytics Architecture

DBMS vs. HADOOP

Why Hadoop?

· Supports use of inexpensive, commodity hardware

-No RAID needed. Also, the servers need not be the latest and greatest hardware.

· Provides for simple, massive parallelism

· Provides resilience by replicating data and eliminating tape backups

· Provides locality of execution, as it knows where the data is

· Software free

· High quality support available at modest cost

· Certification available

· Easy to support when using GUI such as Cloudera Manager or Ambari

· Add-on tools available at relatively low cost, or in some cases no cost

· Evolving technology with a high degree of interest around the world

Hadoop Ecosystem

Analytics mapping – Hadoop 1.x

Analytics mapping – Hadoop 2.x

Typical Big Data Project – Role of Hadoop Ecosystem

Opportunity and Market Outlook

Who is using Hadoop?

Which companies Implemented Hadoop?

http://wiki.apache.org/hadoop/poweredBy

Next post would be on Hadoop 2X.......

Used information from Analytic lab

HBASE Instalation Guide

Requirement:

JRE on system
HADOOP should be installed
Download the Hbase latest version from apache hbase page (http://www.apache.org/dyn/closer.cgi/hbase/) and unpacked it like

$ cd /usr/local
$ tar -zxvf hbase-x.y.z.tar.gz

We are ready to go but its recomendation we need to setup data dictionary before hbase instalation.
use interactive shell to check the status of HBASE

hbase(main):001:0> status
1 servers, 0 dead, 2.0000 average load

We can separate the requirements into two categories: servers and networking. We will
look at the server hardware first and then into the requirements for the networking
setup subsequently.

In habase and hadoop have 2 kind of machines
1- Master Machine
2-Slave machine

As far as CPU is concerned, you should spec the master and slave machines the
same.
Node type Recommendation
Master Dual quad-core CPUs, 2.0-2.5 GHz
Slave Dual quad-core CPUs, 2.0-2.5 GHz

An exemplary setup could be as such: for the master machine, running the Name-
Node, SecondaryNameNode, JobTracker, and HBase Master, 24 GB of memory;
and for the slaves, running the DataNodes, TaskTrackers, and HBase RegionServers,
24 GB or more.
Node type Recommendation
Master 24 GB
Slave 24 GB (and up)

The disk capacity is usually 1 TB per disk, but you can also use 2 TB drives
if necessary. Using from six to 12 high-density servers with 1 TB to 2 TB drives is
good, as you get a lot of storage capacity and the JBOD setup with enough cores
can saturate the disk bandwidth nicely.
Node type Recommendation
Master 4 × 1 TB SATA, RAID 0+1 (2 TB usable)
Slave 6 × 1 TB SATA, JBOD

Windows

HBase running on Windows has not been tested to a great extent. Running a production
install of HBase on top of Windows is not recommended.
If you are running HBase on Windows, you must install Cygwin to have a Unix-like
environment for the shell scripts. The full details are explained in the Windows Installation
guide on the HBase website.

Once you have extracted all the files, you can make yourself familiar with what is in
the project’s directory. The content may look like this:
$ ls -lr
-rw-r--r-- 1 larsgeorge staff 192809 Feb 15 01:54 CHANGES.txt
-rw-r--r-- 1 larsgeorge staff 11358 Feb 9 01:23 LICENSE.txt
-rw-r--r-- 1 larsgeorge staff 293 Feb 9 01:23 NOTICE.txt
-rw-r--r-- 1 larsgeorge staff 1358 Feb 9 01:23 README.txt
drwxr-xr-x 23 larsgeorge staff 782 Feb 9 01:23 bin
drwxr-xr-x 7 larsgeorge staff 238 Feb 9 01:23 conf
drwxr-xr-x 64 larsgeorge staff 2176 Feb 15 01:56 docs
-rwxr-xr-x 1 larsgeorge staff 905762 Feb 15 01:56 hbase-0.90.1-tests.jar
-rwxr-xr-x 1 larsgeorge staff 2242043 Feb 15 01:56 hbase-0.90.1.jar
drwxr-xr-x 5 larsgeorge staff 170 Feb 15 01:55 hbase-webapps
drwxr-xr-x 32 larsgeorge staff 1088 Mar 3 12:07 lib
-rw-r--r-- 1 larsgeorge staff 29669 Feb 15 01:28

Once Hbase fully distributed setup has been done then we need to deploy Hbase configuration to Cluster.
There are many ways to deploy setup to cluster
1- Script
2-Apachhe whirr giving utility to quickly deploy on cluster (Reduces cost as its on cloud)
3-Puppet and Chef- simmilar to whir

Data Empowerment Blog

Blog Archive