INTRODUCTION TO HADOOP – Data Analyst Package
Day 1
Introduction to Hadoop and its Ecosystem, Map Reduce and HDFS
- Big Data, Factors constituting Big Data
- Hadoop and Hadoop Ecosystem
- Map Reduce – Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency
- Hadoop Distributed File System (HDFS) Concepts and its Importance
- Deep Dive in Map Reduce – Execution Framework, Partitioner, Combiner, Data Types, Key pairs
- HDFS Deep Dive – Architecture, Data Replication, Name Node, Data Node, Data Flow
- Parallel Copying with DISTCP, Hadoop Archives
Hands on Exercises
- Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their Properties and Demon Threads
- Accessing HDFS from Command Line
- Map Reduce – Basic Exercises
- Understanding Hadoop Eco-system
- Introduction to Sqoop, use cases and Installation
- Introduction to Hive, use cases and Installation
- Introduction to Pig, use cases and Installation
- Introduction to Oozie, use cases and Installation
- Introduction to Flume, use cases and Installation
- Introduction to Yarn, use cases and installation
Day 2
Deep Dive in Map Reduce and Yarn
- How to develop Map Reduce Application, writing unit test
- Best Practices for developing and writing, Debugging Map Reduce applications
- Joining Data sets in Map Reduce
- Algorithms – Traversing Graph, etc.
- Hadoop API’s
Deep Dive in Pig
- Grunt, Script Mode, Data Model
- Advance Pig Latin, Evaluation and Filter functions, Pig and Ecosystem
- Real time use cases – Gaming Industry, Oil and Gas Sector
Day 3
Deep Dive in Hive
- Understanding Hive, Architecture, Physical Model, Data Model, Data Types
- Hive QL- DDL, DML, other Operations
- Understanding Tables in Hive, Partitioning, Indexes, Bucketing, Sub Queries, Joining Tables, Data Load and appending data to existing Table
- Hands on Exercises – Playing with huge data and Querying extensively.
- User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning
Introduction to HBase architecture
- Introduction to HBase, Architecture, Map Reduce Integration, Different Client API – Features and Administration.
Day 4
Deep Dive into Ooze
- Understanding Oozie
- Designing and Implementing Workflow
- Oozie Coordinator Application Implementation
Hadoop Cluster Setup and Running Map Reduce Jobs
- Hadoop Multi Node Cluster Setup using Amazon ec2 – Creating 4 node cluster setup
- Running Map Reduce Jobs On Cluster
Major Project – Putting it all together and Connecting Dots
- Putting it all together and Connecting Dots
- Working with Large data sets, Steps involved in analyzing large data
Advance Map reduce
- Delving Deeper into The Hadoop API
- More Advanced Map Reduce Programming, Joining Data Sets in Map Reduce
- Graph Manipulation in Hadoop
BIG DATA/HADOOP WITH SPARK – Data Engineer Package 2
Introduction to Big Data
- Overview of Big Data Technologies and its role in Analytics
- Big Data challenges & solutions
- Data Science vs Data Engineering
- Job Roles, Skills & Tools
Setting up Development Environment
- Setting up Development environment on User’s laptop to be able to develop and execute programs
- Setting up Eclipse (Basics of Eclipse like, import, create project, add JARs) to understand basics of Eclipse for Map Reduce and Spark development
- Installing Maven & Gradle to understand building tools
- Installing Putty, FileZilla/WinSCP to get ready to access Habanero Data Training Big Data Cloud
UNIX AND PYTHON (LVC)
Case Study: XYZ Telecom need to set up appropriate directory Structure along with permissions on various files on Linux file system
- Setting up, accessing and verifying Linux server access over SSH
- Transferring files over FTP or SFTP
- Creating directory structure and Setting up permissions
- Understanding File name pattern and move using regular expressions
- Changing file owners, permissions
- Reviewing mock file generator utility written in Shell Script, enhancing it to be more useful
Case Study: Developing a simulator to generate mock data using Python
- Understand your domain requirement, describe need of required fields and possible values, file format etc
- Preparing configuration file that can be changed to fit any requirement
- Developing python script to generate mock data in configuration file
JAVA (LVC)
Case Study: Design and Develop Phone Book in Java
- Identifying Classes and Methods for Phone Book
- Implementing design into Java Code using Eclipse
- Compiling and Executing Java Program
- Enhancing the code with each learnings, like Inheritance, Method overloading
- Further enhancing the code to initialize Phonebook from a Text File by using Java file reading.
HDFS (Hadoop Distributed File System)
Case Study: Handling huge data set in HDFS to make it accessible to right user and remove non-functional requirements like backups, cost, high availability etc.
- Understanding the problem statement and challenges persisting to such large data to perceive the need of Distributed File System
- Understanding HDFS architecture to solve problems
- Understanding configuration and creating directory structure to get a solution of the given problem statement
- setup appropriate permissions to secure data for appropriate users
Case Study: Developing automation tool for HDFS file management
- Setting up Java Development with HDFS libraries to use HDFS Java APIs
- Coding to develop menu driven HDFS file management utility and schedule to run for file management in HDFS cluster
SQOOP AND MAP REDUCE
Sqoop
Case Study: Develop automation utility to migrate huge RDBMS warehouse implemented in MySQL to Hadoop cluster
- Creating and loading data into RDBMS table to understand RDBMS setup
- Preparing data to experiment with Sqoop imports
- Importing using Sqoop Command in HDFS file system to understand simple imports
- Importing using Sqoop command in Hive table to import data into Hive partitioned table and perform ETL
- Exporting using Sqoop from Hive/ HDFS to RDBM to store the output of Hive ETL into RDBMS
- Wrapping Sqoop commands into Unix Shell Script To be able to build and use automated utility for day to day use
Map-Reduce
Case Study: Processing 4G usage data of a Telecom Operator to find out potential customers for various promotional offers
- Cleaning data, ETL and Aggregation
- Exploring data set using known tools like Linux commands to understand the nature of data
- Setting up Eclipse project, maven dependencies to add required Map Reduce Libraries
- Coding, packaging and deploying project on Hadoop cluster to understand how to deploy/ run map reduce on Hadoop Cluster
HIVE
Case Study: Process a structured data set to find some insights
- Finding out per driver total miles and hours driven
- Creating Table, Loading Data, Selecting Query to load, query and cleaning of data
- Which driver has driven maximum & minimum miles
- Joining Tables, Saving Query results to table to explore and use right type of table type, partition schema, buckets
- Discussing optimum file format for hive table
- Using right file format, type of table, partition scheme to optimize query performance
- Using UDFs to reuse domain specific implementations
PIG
Case Study: Perform ETL processing on Data Set to find some insights
- Loading and exploring Movie – 100K data set to load data set, explore it and associate schema to it
- Using grunt, Loading data set, defining schema
- Finding Simple Statistic from given Data Set to clean up the data
- Filtering and modifying data schema
- Finding gender distribution in users
- Aggregating and looping
- Finding top 25 movies by rating, joining data sets and saving to HDFS to perform aggregration
- Dumping, Storing, joining, sorting
- Filtering function for complex condition to reuse domain specific functionalities & avoid rewriting code
- Using UDFs
SPARK
Case Study: Build a model to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges
- Loading and performing pre-processing to convert unstructured data to some structured data format
- Cleaning data, filtering out bad records, converting data to more usable format
- Aggregating data based on Response Code to find out server’ performance from logs
- Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors
Spark Project
Case Study: Build a model (using Python) to predict production error/ failure (huge servers – applications/ software) with good speed by using computation power efficiently while considering processor challenges
- Loading and performing pre-processing to convert unstructured data to some structured data format
- Cleaning data, filtering out bad records, converting data to more usable format
- Aggregating data based on Response Code to find out server’ performance from logs
- Filtering, Joining and aggregating data to find top 20 Frequent Hosts that generates errors
OOZIE
Case Study: Setting up Data processing pipeline to work as per schedule in Hadoop Eco System comprising of multiple components like sqoop job, hive scripts, pig scripts, spark jobs etc.
- Setting up Oozie workflow to tigger a script, then Sqoop Job followed by Hive Job
- Executing workflow to run complete ETL pipeline
HBASE
Case Study: Find out top 10 customers by expenditure, top 10 most buying brands, and monthly sales from data stored in Hbase which is in Key value pair
- Designing Hbase Table Schema to model table structure, decide families in table as per data
- Deciding families in table as per data
- Bulk Loading & Programmatically Loading data using Java APIs to populate data into Hbase table
- Querying and Showing data on UI to integrate Hbase with UI/Reporting
LIVE PROJECT 1
Project: ETL processing of retail logs
- To find demand of a given product
- Trend and seasonality of a product
- Understand performance of the chain
LIVE PROJECT 2
Project: Creating 360-degree view (past, present and future) of the customer for a retail company – avoiding repetition or re-keying of information, to view customer history, establishing context and initiating
desired actions
- Explorating and checking basic quality of data to understand data and need for filtering/pre-processing of data
- Loading data into RDBMS table to simulate real world scenarios where data is persisted in RDBMS table
- Developing & executing Sqoop Job to ingest Data in Hadoop Cluster to perfrom further actions
- Developing & executing Pig script to perform required ETL processing on ingested Data
- Developing & executing Hive Queries to get reports out of processed data
BIG DATA/HADOOP WITH MACHINE LEARNING
Candidates will take Entire Package Two in Addition to Topics Below
- Introduction to Machine Learning
- Linear Regression with One Variable
- Linear Algebra Review
- Linear Regression with Multiple Variables
- Octave/Matlab Tutorial
- Logistic Regression
- Regularization
- Neural Networks: Learning
- Advice for Applying Machine Learning
- Machine Learning System design
- Support Vector Machines
- Unsupervised Learning
- Dimensional Reduction
- Anomaly Detection
- Rocemmender Systems
- Large Scale Machine Learning
- Application Example: Photo OCR
LIVE PROJECT
Live projects will be picked from Kaggle.com Machine Learning Projects
Data Analyst Pkg
₦200,000
per candidate
Data Engineer Pkg
₦200,000
per candidate
Big Data & Machine Learning
₦200,000
per candidate