Blog Archive

Friday, December 8, 2017

HADOOP 2 X






Apache Hadoop-2.7.0- Components

  • The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
  • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

>The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS): a distributed file-system that stores data on commodity machines, providing
very high aggregate bandwidth across the cluster.
• Hadoop YARN: a resource-management platform responsible for managing computing resources in clusters and using
them for scheduling of users’ applications.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets(programming model for large
scale data processing)

There are five pillars to Hadoop that make it enterprise ready:

1. Data Management: Apache Hadoop YARN, HDFS
2. Data Access: Apache Hive, Apache Pig, MapReduce, Apache Spark, Apache Storm,Apache Hbase, Apache Tez, Apache Kafka, Apache Hcatalog, Apache Slider, Apache Solr, Apache Mahout, Apache Accumulo
3. Data Governance and Integration: Apache Falcon, Apache Flume, Apache Sqoop
4. Security: Apache Knox, Apache Ranger
5. Operations: Apache Ambari, Apache Oozie, Apache ZooKeeper

                                     Providers

Commercial Vendors:

  • Cloudera
  • Hortonworks
  • IBM Infosphere Biginsights
  • MapR Technologies
  • Think Big Analytics
  • Amazon Web Services (Cloud based)
  • Microsoft Azure (Cloud based)
Open Source Vendors

  • Apache
  • Apache Bigtop
  • Cascading
  • Cloudspace
  • Datameer
  • Data Mine Lab
  • Data Salt
  • Data Stax
  • Data Torrent
  • Debian
  • Emblocsoft
  • Hstreaming
  • Impetus
  • Pentaho
  • Talend
  • Jaspersoft
  • Karmasphere
  • Apache Mahoot
  • Nutch
  • NGData
  • Pervasive Software
  • Pivotal
  • Sematext International
  • Syncsort
  • Tresata
  • Wandisco
  • Etc..



Thursday, December 7, 2017

Big Data understanding

                                               Building Blocks for Big Data Project

 -        Working knowledge on Hadoop & Hadoop Ecosystem
o   Be comfortable with basic Linux commands
o   Dataware housing Knowledge and SQL commands
o   Programming concepts like Java, Python, R, Pearl etc.
-        Understanding data structure & Business objective
-        Data visualization tools like Tableau, Qlickview, Jasper reports etc.
-        Be comfortable with analytics tools like R, Python, Spark, SAS etc.
-        Be comfortable with statistics (exploratory) and machine learning algorithms




What disrupted the Data Center?




Every industry is graced with more data…

• Richer transnational data from portfolio of dozens or hundreds of
    business applications
• Usage and behavior data from web and mobile apps
• Social media data
• Sensor and event data from IoT devices
• Data economy – firms buying and selling data
• Derived data from analytics

What is the challenge?

• The challenges include capture, curation, storage, search, sharing
   transfer, analysis and visualization
• The main challenge lies in identifying the value, the relevant          information within this data, and then transforming and extracting that data for further analysis.


What is Bigdata?

• Is it technology?
• Is it solution?
• Is it problem?
• Is it platform?
• Is it statement/phrase?

Big Data – 4 V’s
  •  According to IDC(International Data Corporation) the size of digital universe at 4.4 zettabytes in 2013 and forecasting a tenfold growth by 2020 to 40 zettabytes
  • A zetta bytes is (10)21 bytes or thousands of exabytes or one million petabytes or one billion terabytes
  • The NYSE generates about 4-5 terabytes of data per day
  • Facebook hosts more than 240 billion photos, growing at 7 petabytes per month


IBM’s Definition of Big Data


Big data – Myths

·        It’s Big : You need to have lots of data to talk about
big data
·       You need to apply it right away
·       The more granular the data, the better
·       Big Data is good data
·       Big Data means that analysts become all-important
·       Big Data gives you concrete answers
·       Big Data predicts the future
·       Big Data is a magical solution
·       Big Data can create self-learning algorithms
·       Big Data is only for big corporations
·       We Have So Much Data, We Don't Need to Worry
About Every Little Data Flaw
·       Big Data Technology Will Eliminate the Need for
Data Integration
·       It's Pointless Using a Data Warehouse for Advanced
Analytics
·       Data Lakes Will Replace the Data Warehouse
·       Hadoop is the holy grail of big data
·         Machine Learning Overcomes Human Bias


Big Data- Scenarios





What is Hadoop?
  •     Hadoop is an Open-Source Data Management framework with scale-out storage &distributed processing

Hadoop is not a database. Hadoop (from Apache Software Foundation) is a Java-based software framework for scalable,decentralized software applications that supports easy handling and analyzing of vast data volumes.





Existing Data Architecture



Limitations of Existing Data Analytics Architecture




An Emerging Data Architecture



Emerging Data Analytics Architecture



DBMS vs. HADOOP







Why Hadoop?


·        Supports use of inexpensive, commodity hardware
                -No RAID needed. Also, the servers need not be the latest                 and greatest hardware.
·        Provides for simple, massive parallelism
·        Provides resilience by replicating data and eliminating tape backups
·        Provides locality of execution, as it knows where the data is
·        Software free
·        High quality support available at modest cost
·        Certification available
·        Easy to support when using GUI such as Cloudera Manager or Ambari
·        Add-on tools available at relatively low cost, or in some cases no cost

·        Evolving technology with a high degree of interest around the world


Hadoop Ecosystem





Analytics mapping – Hadoop 1.x



Analytics mapping – Hadoop 2.x





Typical Big Data Project – Role of Hadoop Ecosystem




Opportunity and Market Outlook



Who is using Hadoop?




Which companies Implemented Hadoop?

http://wiki.apache.org/hadoop/poweredBy



Next post would be on Hadoop 2X.......

Used information from Analytic lab





HBASE Instalation Guide

Requirement:

JRE on system
HADOOP should be installed
Download the Hbase latest version from apache hbase page (http://www.apache.org/dyn/closer.cgi/hbase/) and unpacked it like

$ cd /usr/local
$ tar -zxvf hbase-x.y.z.tar.gz

We are ready to go but its recomendation we need to setup data dictionary before hbase instalation.
use interactive shell to check the status of HBASE

hbase(main):001:0> status
1 servers, 0 dead, 2.0000 average load

We can separate the requirements into two categories: servers and networking. We will
look at the server hardware first and then into the requirements for the networking
setup subsequently.

In habase and hadoop have 2 kind of machines
1- Master Machine
2-Slave machine


As far as CPU is concerned, you should spec the master and slave machines the
same.
Node type Recommendation
Master Dual quad-core CPUs, 2.0-2.5 GHz
Slave Dual quad-core CPUs, 2.0-2.5 GHz

An exemplary setup could be as such: for the master machine, running the Name-
Node, SecondaryNameNode, JobTracker, and HBase Master, 24 GB of memory;
and for the slaves, running the DataNodes, TaskTrackers, and HBase RegionServers,
24 GB or more.
Node type Recommendation
Master 24 GB
Slave 24 GB (and up)

The disk capacity is usually 1 TB per disk, but you can also use 2 TB drives
if necessary. Using from six to 12 high-density servers with 1 TB to 2 TB drives is
good, as you get a lot of storage capacity and the JBOD setup with enough cores
can saturate the disk bandwidth nicely.
Node type Recommendation
Master 4 × 1 TB SATA, RAID 0+1 (2 TB usable)
Slave 6 × 1 TB SATA, JBOD

Windows

HBase running on Windows has not been tested to a great extent. Running a production
install of HBase on top of Windows is not recommended.
If you are running HBase on Windows, you must install Cygwin to have a Unix-like
environment for the shell scripts. The full details are explained in the Windows Installation
guide on the HBase website.

Once you have extracted all the files, you can make yourself familiar with what is in
the project’s directory. The content may look like this:
$ ls -lr
-rw-r--r-- 1 larsgeorge staff 192809 Feb 15 01:54 CHANGES.txt
-rw-r--r-- 1 larsgeorge staff 11358 Feb 9 01:23 LICENSE.txt
-rw-r--r-- 1 larsgeorge staff 293 Feb 9 01:23 NOTICE.txt
-rw-r--r-- 1 larsgeorge staff 1358 Feb 9 01:23 README.txt
drwxr-xr-x 23 larsgeorge staff 782 Feb 9 01:23 bin
drwxr-xr-x 7 larsgeorge staff 238 Feb 9 01:23 conf
drwxr-xr-x 64 larsgeorge staff 2176 Feb 15 01:56 docs
-rwxr-xr-x 1 larsgeorge staff 905762 Feb 15 01:56 hbase-0.90.1-tests.jar
-rwxr-xr-x 1 larsgeorge staff 2242043 Feb 15 01:56 hbase-0.90.1.jar
drwxr-xr-x 5 larsgeorge staff 170 Feb 15 01:55 hbase-webapps
drwxr-xr-x 32 larsgeorge staff 1088 Mar 3 12:07 lib
-rw-r--r-- 1 larsgeorge staff 29669 Feb 15 01:28


Once Hbase fully distributed setup has been done then we need to deploy Hbase configuration to Cluster.
There are many ways to deploy setup to cluster
1- Script
2-Apachhe whirr giving utility to quickly deploy on cluster (Reduces cost as its on cloud)
3-Puppet and Chef- simmilar to whir