Hadoop Essentials - Kurindeta Limited

INTRODUCTION TO HADOOP – Data Analyst Package

Day 1

Introduction to Hadoop and its Ecosystem, Map Reduce and HDFS

Big Data, Factors constituting Big Data
Hadoop and Hadoop Ecosystem
Map Reduce – Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency
Hadoop Distributed File System (HDFS) Concepts and its Importance
Deep Dive in Map Reduce – Execution Framework, Partitioner, Combiner, Data Types, Key pairs
HDFS Deep Dive – Architecture, Data Replication, Name Node, Data Node, Data Flow
Parallel Copying with DISTCP, Hadoop Archives

Hands on Exercises

Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their Properties and Demon Threads
Accessing HDFS from Command Line
Map Reduce – Basic Exercises
Understanding Hadoop Eco-system
Introduction to Sqoop, use cases and Installation
Introduction to Hive, use cases and Installation
Introduction to Pig, use cases and Installation
Introduction to Oozie, use cases and Installation
Introduction to Flume, use cases and Installation
Introduction to Yarn, use cases and installation

Day 2

Deep Dive in Map Reduce and Yarn

How to develop Map Reduce Application, writing unit test
Best Practices for developing and writing, Debugging Map Reduce applications
Joining Data sets in Map Reduce
Algorithms – Traversing Graph, etc.
Hadoop API’s

Deep Dive in Pig

Grunt, Script Mode, Data Model
Advance Pig Latin, Evaluation and Filter functions, Pig and Ecosystem
Real time use cases – Gaming Industry, Oil and Gas Sector

Day 3

Deep Dive in Hive

Understanding Hive, Architecture, Physical Model, Data Model, Data Types
Hive QL- DDL, DML, other Operations
Understanding Tables in Hive, Partitioning, Indexes, Bucketing, Sub Queries, Joining Tables, Data Load and appending data to existing Table
Hands on Exercises – Playing with huge data and Querying extensively.
User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning

Introduction to HBase architecture

Introduction to HBase, Architecture, Map Reduce Integration, Different Client API – Features and Administration.

Day 4

Deep Dive into Ooze

Understanding Oozie
Designing and Implementing Workflow
Oozie Coordinator Application Implementation

Hadoop Cluster Setup and Running Map Reduce Jobs

Hadoop Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup
Running Map Reduce Jobs On Cluster

Major Project – Putting it all together and Connecting Dots

Putting it all together and Connecting Dots
Working with Large data sets, Steps involved in analyzing large data

Advance Map reduce

Delving Deeper into The Hadoop API
More Advanced Map Reduce Programming, Joining Data Sets in Map Reduce
Graph Manipulation in Hadoop

BIG DATA/HADOOP WITH SPARK – Data Engineer Package 2

Introduction to Big Data

Overview of Big Data Technologies and its role in Analytics
Big Data challenges & solutions
Data Science vs Data Engineering
Job Roles, Skills & Tools

Setting up Development Environment

Setting up Development environment on User’s laptop to be able to develop and execute programs
Setting up Eclipse (Basics of Eclipse like, import, create project, add JARs) to understand basics of Eclipse for Map Reduce and Spark development
Installing Maven & Gradle to understand building tools
Installing Putty, FileZilla/WinSCP to get ready to access Habanero Data Training Big Data Cloud

UNIX AND PYTHON (LVC)

Case Study: XYZ Telecom need to set up appropriate directory Structure along with permissions on various files on Linux file system

Setting up, accessing and verifying Linux server access over SSH
Transferring files over FTP or SFTP
Creating directory structure and Setting up permissions
Understanding File name pattern and move using regular expressions
Changing file owners, permissions
Reviewing mock file generator utility written in Shell Script, enhancing it to be more useful

Case Study: Developing a simulator to generate mock data using Python

Understand your domain requirement, describe need of required fields and possible values, file format etc
Preparing configuration file that can be changed to fit any requirement
Developing python script to generate mock data in configuration file

JAVA (LVC)

Case Study: Design and Develop Phone Book in Java

Identifying Classes and Methods for Phone Book
Implementing design into Java Code using Eclipse
Compiling and Executing Java Program
Enhancing the code with each learnings, like Inheritance, Method overloading
Further enhancing the code to initialize Phonebook from a Text File by using Java file reading.

HDFS (Hadoop Distributed File System)

Case Study: Handling huge data set in HDFS to make it accessible to right user and remove non-functional requirements like backups, cost, high availability etc.

Understanding the problem statement and challenges persisting to such large data to perceive the need of Distributed File System
Understanding HDFS architecture to solve problems
Understanding configuration and creating directory structure to get a solution of the given problem statement
setup appropriate permissions to secure data for appropriate users

Case Study: Developing automation tool for HDFS file management

Setting up Java Development with HDFS libraries to use HDFS Java APIs
Coding to develop menu driven HDFS file management utility and schedule to run for file management in HDFS cluster

SQOOP AND MAP REDUCE

Sqoop

Case Study: Develop automation utility to migrate huge RDBMS warehouse implemented in MySQL to Hadoop cluster

Creating and loading data into RDBMS table to understand RDBMS setup
Preparing data to experiment with Sqoop imports
Importing using Sqoop Command in HDFS file system to understand simple imports
Importing using Sqoop command in Hive table to import data into Hive partitioned table and perform ETL
Exporting using Sqoop from Hive/ HDFS to RDBM to store the output of Hive ETL into RDBMS
Wrapping Sqoop commands into Unix Shell Script To be able to build and use automated utility for day to day use

Map-Reduce

Case Study: Processing 4G usage data of a Telecom Operator to find out potential customers for various promotional offers

Cleaning data, ETL and Aggregation
Exploring data set using known tools like Linux commands to understand the nature of data
Setting up Eclipse project, maven dependencies to add required Map Reduce Libraries
Coding, packaging and deploying project on Hadoop cluster to understand how to deploy/ run map reduce on Hadoop Cluster

HIVE

Case Study: Process a structured data set to find some insights

Finding out per driver total miles and hours driven
Creating Table, Loading Data, Selecting Query to load, query and cleaning of data
Which driver has driven maximum & minimum miles
Joining Tables, Saving Query results to table to explore and use right type of table type, partition schema, buckets
Discussing optimum file format for hive table
Using right file format, type of table, partition scheme to optimize query performance
Using UDFs to reuse domain specific implementations

PIG

Case Study: Perform ETL processing on Data Set to find some insights

Loading and exploring Movie – 100K data set to load data set, explore it and associate schema to it
Using grunt, Loading data set, defining schema
Finding Simple Statistic from given Data Set to clean up the data
Filtering and modifying data schema
Finding gender distribution in users
Aggregating and looping
Finding top 25 movies by rating, joining data sets and saving to HDFS to perform aggregration
Dumping, Storing, joining, sorting
Filtering function for complex condition to reuse domain specific functionalities & avoid rewriting code
Using UDFs

SPARK

Case Study: Build a model to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges

Loading and performing pre-processing to convert unstructured data to some structured data format
Cleaning data, filtering out bad records, converting data to more usable format
Aggregating data based on Response Code to find out server’ performance from logs
Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors

Spark Project

Case Study: Build a model (using Python) to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges

Loading and performing pre-processing to convert unstructured data to some structured data format
Cleaning data, filtering out bad records, converting data to more usable format
Aggregating data based on Response Code to find out server’ performance from logs
Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors

OOZIE

Case Study: Setting up Data processing pipeline to work as per schedule in Hadoop Eco System comprising of multiple components like sqoop job, hive scripts, pig scripts, spark jobs etc.

Setting up Oozie workflow to tigger a script, then Sqoop Job followed by Hive Job
Executing workflow to run complete ETL pipeline

HBASE

Case Study: Find out top 10 customers by expenditure, top 10 most buying brands, and monthly sales from data stored in Hbase which is in Key value pair

Designing Hbase Table Schema to model table structure, decide families in table as per data
Deciding families in table as per data
Bulk Loading & Programmatically Loading data using Java APIs to populate data into Hbase table
Querying and Showing data on UI to integrate Hbase with UI/Reporting

LIVE PROJECT 1

Project: ETL processing of retail logs

To find demand of a given product
Trend and seasonality of a product
Understand performance of the chain

LIVE PROJECT 2

Project: Creating 360-degree view (past, present and future) of the customer for a retail company – avoiding repetition or re-keying of information, to view customer history, establishing context and initiating
desired actions

Explorating and checking basic quality of data to understand data and need for filtering/pre-processing of data
Loading data into RDBMS table to simulate real world scenarios where data is persisted in RDBMS table
Developing & executing Sqoop Job to ingest Data in Hadoop Cluster to perfrom further actions
Developing & executing Pig script to perform required ETL processing on ingested Data
Developing & executing Hive Queries to get reports out of processed data

BIG DATA/HADOOP WITH MACHINE LEARNING

Candidates will take Entire Package Two in Addition to Topics Below

Introduction to Machine Learning
Linear Regression with One Variable
Linear Algebra Review
Linear Regression with Multiple Variables
Octave/Matlab Tutorial
Logistic Regression
Regularization
Neural Networks: Learning
Advice for Applying Machine Learning
Machine Learning System design
Support Vector Machines
Unsupervised Learning
Dimensional Reduction
Anomaly Detection
Rocemmender Systems
Large Scale Machine Learning
Application Example: Photo OCR

LIVE PROJECT

Live projects will be picked from Kaggle.com Machine Learning Projects