Hadoop Essentials

Home / Courses / Hadoop Essentials

INTRODUCTION TO HADOOP – Data Analyst Package

Day 1

Introduction to Hadoop and its Ecosystem, Map Reduce and HDFS

  • Big Data, Factors constituting Big Data
  • Hadoop and Hadoop Ecosystem
  • Map Reduce – Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency
  • Hadoop Distributed File System (HDFS) Concepts and its Importance
  • Deep Dive in Map Reduce – Execution Framework, Partitioner, Combiner, Data Types, Key pairs
  • HDFS Deep Dive – Architecture, Data Replication, Name Node, Data Node, Data Flow
  • Parallel Copying with DISTCP, Hadoop Archives

Hands on Exercises

  • Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their Properties and Demon Threads
  • Accessing HDFS from Command Line
  • Map Reduce – Basic Exercises
  • Understanding Hadoop Eco-system
  • Introduction to Sqoop, use cases and Installation
  • Introduction to Hive, use cases and Installation
  • Introduction to Pig, use cases and Installation
  • Introduction to Oozie, use cases and Installation
  • Introduction to Flume, use cases and Installation
  • Introduction to Yarn, use cases and installation

Day 2

Deep Dive in Map Reduce and Yarn

  • How to develop Map Reduce Application, writing unit test
  • Best Practices for developing and writing, Debugging Map Reduce applications
  • Joining Data sets in Map Reduce
  • Algorithms – Traversing Graph, etc.
  • Hadoop API’s

Deep Dive in Pig

  • Grunt, Script Mode, Data Model
  • Advance Pig Latin, Evaluation and Filter functions, Pig and Ecosystem
  • Real time use cases – Gaming Industry, Oil and Gas Sector

Day 3

Deep Dive in Hive

  • Understanding Hive, Architecture, Physical Model, Data Model, Data Types
  • Hive QL- DDL, DML, other Operations
  • Understanding Tables in Hive, Partitioning, Indexes, Bucketing, Sub Queries, Joining Tables, Data Load and appending data to existing Table
  • Hands on Exercises – Playing with huge data and Querying extensively.
  • User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning

Introduction to HBase architecture

  • Introduction to HBase, Architecture, Map Reduce Integration, Different Client API – Features and Administration.

Day 4

Deep Dive into Ooze

  • Understanding Oozie
  • Designing and Implementing Workflow
  • Oozie Coordinator Application Implementation

Hadoop Cluster Setup and Running Map Reduce Jobs

  • Hadoop Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup
  • Running Map Reduce Jobs On Cluster

Major Project – Putting it all together and Connecting Dots

  • Putting it all together and Connecting Dots
  • Working with Large data sets, Steps involved in analyzing large data

Advance Map reduce

  • Delving Deeper into The Hadoop API
  • More Advanced Map Reduce Programming, Joining Data Sets in Map Reduce
  • Graph Manipulation in Hadoop

 

BIG DATA/HADOOP WITH SPARK – Data Engineer Package 2

Introduction to Big Data

  • Overview of Big Data Technologies and its role in Analytics
  • Big Data challenges & solutions
  • Data Science vs Data Engineering
  • Job Roles, Skills & Tools

Setting up Development Environment

  • Setting up Development environment on User’s laptop to be able to develop and execute programs
  • Setting up Eclipse (Basics of Eclipse like, import, create project, add JARs) to understand basics of Eclipse for Map Reduce and Spark development
  • Installing Maven & Gradle to understand building tools
  • Installing Putty, FileZilla/WinSCP to get ready to access Habanero Data Training Big Data Cloud

UNIX AND PYTHON (LVC)

Case Study: XYZ Telecom need to set up appropriate directory Structure along with permissions on various files on Linux file system

  • Setting up, accessing and verifying Linux server access over SSH
  • Transferring files over FTP or SFTP
  • Creating directory structure and Setting up permissions
  • Understanding File name pattern and move using regular expressions
  • Changing file owners, permissions
  • Reviewing mock file generator utility written in Shell Script, enhancing it to be more useful

Case Study: Developing a simulator to generate mock data using Python

  • Understand your domain requirement, describe need of required fields and possible values, file format etc
  • Preparing configuration file that can be changed to fit any requirement
  • Developing python script to generate mock data in configuration file

JAVA (LVC)

Case Study: Design and Develop Phone Book in Java

  • Identifying Classes and Methods for Phone Book
  • Implementing design into Java Code using Eclipse
  • Compiling and Executing Java Program
  • Enhancing the code with each learnings, like Inheritance, Method overloading
  • Further enhancing the code to initialize Phonebook from a Text File by using Java file reading.

HDFS (Hadoop Distributed File System)

Case Study: Handling huge data set in HDFS to make it accessible to right user and remove non-functional requirements like backups, cost, high availability etc.

  • Understanding the problem statement and challenges persisting to such large data to perceive the need of Distributed File System
  • Understanding HDFS architecture to solve problems
  • Understanding configuration and creating directory structure to get a solution of the given problem statement
  • setup appropriate permissions to secure data for appropriate users

Case Study: Developing automation tool for HDFS file management

  • Setting up Java Development with HDFS libraries to use HDFS Java APIs
  • Coding to develop menu driven HDFS file management utility and schedule to run for file management in HDFS cluster

SQOOP AND MAP REDUCE

Sqoop

Case Study: Develop automation utility to migrate huge RDBMS warehouse implemented in MySQL to Hadoop cluster

  • Creating and loading data into RDBMS table to understand RDBMS setup
  • Preparing data to experiment with Sqoop imports
  • Importing using Sqoop Command in HDFS file system to understand simple imports
  • Importing using Sqoop command in Hive table to import data into Hive partitioned table and perform ETL
  • Exporting using Sqoop from Hive/ HDFS to RDBM to store the output of Hive ETL into RDBMS
  • Wrapping Sqoop commands into Unix Shell Script To be able to build and use automated utility for day to day use

Map-Reduce

Case Study: Processing 4G usage data of a Telecom Operator to find out potential customers for various promotional offers

  • Cleaning data, ETL and Aggregation
  • Exploring data set using known tools like Linux commands to understand the nature of data
  • Setting up Eclipse project, maven dependencies to add required Map Reduce Libraries
  • Coding, packaging and deploying project on Hadoop cluster to understand how to deploy/ run map reduce on Hadoop Cluster

HIVE

Case Study: Process a structured data set to find some insights

  • Finding out per driver total miles and hours driven
  • Creating Table, Loading Data, Selecting Query to load, query and cleaning of data
  • Which driver has driven maximum & minimum miles
  • Joining Tables, Saving Query results to table to explore and use right type of table type, partition schema, buckets
  • Discussing optimum file format for hive table
  • Using right file format, type of table, partition scheme to optimize query performance
  • Using UDFs to reuse domain specific implementations

PIG

Case Study: Perform ETL processing on Data Set to find some insights

  • Loading and exploring Movie – 100K data set to load data set, explore it and associate schema to it
  • Using grunt, Loading data set, defining schema
  • Finding Simple Statistic from given Data Set to clean up the data
  • Filtering and modifying data schema
  • Finding gender distribution in users
  • Aggregating and looping
  • Finding top 25 movies by rating, joining data sets and saving to HDFS to perform aggregration
  • Dumping, Storing, joining, sorting
  • Filtering function for complex condition to reuse domain specific functionalities & avoid rewriting code
  • Using UDFs

SPARK

Case Study: Build a model to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges

  • Loading and performing pre-processing to convert unstructured data to some structured data format
  • Cleaning data, filtering out bad records, converting data to more usable format
  • Aggregating data based on Response Code to find out server’ performance from logs
  • Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors

Spark Project

Case Study: Build a model (using Python) to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges

  • Loading and performing pre-processing to convert unstructured data to some structured data format
  • Cleaning data, filtering out bad records, converting data to more usable format
  • Aggregating data based on Response Code to find out server’ performance from logs
  • Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors

OOZIE

Case Study: Setting up Data processing pipeline to work as per schedule in Hadoop Eco System comprising of multiple components like sqoop job, hive scripts, pig scripts, spark jobs etc.

  • Setting up Oozie workflow to tigger a script, then Sqoop Job followed by Hive Job
  • Executing workflow to run complete ETL pipeline

HBASE

Case Study: Find out top 10 customers by expenditure, top 10 most buying brands, and monthly sales from data stored in Hbase which is in Key value pair

  • Designing Hbase Table Schema to model table structure, decide families in table as per data
  • Deciding families in table as per data
  • Bulk Loading & Programmatically Loading data using Java APIs to populate data into Hbase table
  • Querying and Showing data on UI to integrate Hbase with UI/Reporting

LIVE PROJECT 1

Project: ETL processing of retail logs

  • To find demand of a given product
  • Trend and seasonality of a product
  • Understand performance of the chain

LIVE PROJECT 2

Project: Creating 360-degree view (past, present and future) of the customer for a retail company – avoiding repetition or re-keying of information, to view customer history, establishing context and initiating
desired actions

  • Explorating and checking basic quality of data to understand data and need for filtering/pre-processing of data
  • Loading data into RDBMS table to simulate real world scenarios where data is persisted in RDBMS table
  • Developing & executing Sqoop Job to ingest Data in Hadoop Cluster to perfrom further actions
  • Developing & executing Pig script to perform required ETL processing on ingested Data
  • Developing & executing Hive Queries to get reports out of processed data

 

BIG DATA/HADOOP WITH MACHINE LEARNING

Candidates will take Entire Package Two in Addition to Topics Below

  • Introduction to Machine Learning
  • Linear Regression with One Variable
  • Linear Algebra Review
  • Linear Regression with Multiple Variables
  • Octave/Matlab Tutorial
  • Logistic Regression
  • Regularization
  • Neural Networks: Learning
  • Advice for Applying Machine Learning
  • Machine Learning System design
  • Support Vector Machines
  • Unsupervised Learning
  • Dimensional Reduction
  • Anomaly Detection
  • Rocemmender Systems
  • Large Scale Machine Learning
  • Application Example: Photo OCR

LIVE PROJECT

Live projects will be picked from Kaggle.com Machine Learning Projects