hadoop internals

By attending Hadoop Internals workshop, Participants will learn:

  • The internals of MapReduce and HDFS and how to build Hadoop architecture
  • Proper cluster configuration and deployment to integrate with systems and hardware in the data center
  • How to load data into the cluster from dynamically generated files using Flume and from RDBMS using Sqoop
  • Configuring the Fair Scheduler to provide service-level agreements for multiple users of a cluster
  • Installing and implementing Kerberos-based security for your cluster
  • Best practices for preparing and maintaining Apache Hadoop in production
  • Troubleshooting, diagnosing, tuning, and solving Hadoop issues

In Hadoop Internals training course, Participants will gain a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster, covering topics from installation and configuration through load balancing and tuning, this course is the best preparation for the real-world challenges faced by Hadoop administrators.

Hadoop Internals course covers concepts addressed on the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

Hadoop Internals class is designed for system administrators and IT managers who have basic Linux systems administration experience. Prior knowledge of Hadoop is not required.

System administrators and others responsible for managing Apache Hadoop clusters in production or development environments.

COURSE AGENDA

  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduce Scaling Out
  • Data Flow Combiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop’s API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners
  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • RDBMS
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases
  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations
  • Data Integrity 
    • Data Integrity in HDFS
    • Local FileSystem
    • Checksum FileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile
  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+
  • The Configuration API
  • Configuring the Development Environment
  • Running Locally on Test Data
  • Cluster Specs
  • Cluster Setup and Installation
  • Hadoop Configuration
  • YARN Configuration
  • Benchmarking a Hadoop Cluster
  • Hadoop in the Cloud
  • Tuning
  • MapReduce Workflows
  • Monitoring and debugging on a production cluster
  • Tuning for performance
  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records
  • Setting up parameter values for practical use
  • Checking system’s health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library
  • Pig
    • Thinking like a Pig
      • Data flow language
      • Data types
      • User-defined functions
  • Installing Pig
    • Managing the Grunt shell
    • Learning Pig Latin through Grunt
  • Speaking Pig Latin
    • Data types and schemas
    • Expressions and functions
    • Relational operators
    • Execution optimization
  • Hive
    • Installing and configuring Hive
    • Example queries
    • HiveQL in details
    • Hive Sum-up
  • Hbase
    • Intoduction
    • Concepts
    • Clients
    • Hbase vs RDBMS