introduction to big data

Data sets that have the potential to grow rapidly need to be manageable. This Introduction to Big Data course provides the knowledge and training to use new Big Data tools and techniques as well as learn ways of storing information that will allow for efficient processing and analysis for informed business decision-making. Further, participants learn to store, manage, process and analyze massive amounts of unstructured data.

By attending Introduction to Big Data workshop, Participants will learn to:

  • Integrate Big Data components to create an appropriate Data Lake
  • Select the correct Big Data stores for disparate data sets
  • Process large data sets using Hadoop to extract value
  • Query large data sets in near real time with Pig and Hive
  • Plan and implement a Big Data strategy for the organization

COURSE AGENDA

  • Data models: key value, graph, document, column-family
  • Hadoop Distributed File System
  • HBase
  • Hive
  • Cassandra
  • Hypertable
  • Amazon S3
  • BigTable
  • BigTable
  • MongoDB
  • Redis
  • Riak
  • Neo4J
  • Selecting data sources for analysis
  • Eliminating redundant data
  • Establishing the role of NoSQL
  • Establishing the business importance of Big Data
  • Addressing the challenge of extracting useful data
  • Integrating Big Data with traditional data
  • The four dimensions of Big Data: volume, velocity, variety, veracity
  • Introducing the Storage, MapReduce and Query Stack
  • Choosing the correct data stores based on your data characteristics
  • Moving code to data
  • Implementing polyglot data store solutions
  • Aligning business goals to the appropriate data store
  • Mapping data to the programming framework
  • Connecting and extracting data from storage
  • Transforming data for processing
  • Subdividing data in preparation for Hadoop MapReduce
  • Creating the components of Hadoop MapReduce jobs
  • Distributing data processing across server farms
  • Executing Hadoop MapReduce jobs
  • Monitoring the progress of job flows
  • Distinguishing Hadoop daemons
  • Investigating the Hadoop Distributed File System
  • Selecting appropriate execution modes: local, pseudo–distributed and fully distributed
  • Comparing real–time processing models
  • Leveraging Storm to extract live events
  • Lightning-fast processing with Spark and Shark
  • Communicating with Hadoop in Pig Latin
  • Executing commands using the Grunt Shell
  • Streamlining high-level processing
  • Persisting data in the Hive MegaStore
  • Performing queries with HiveQL
  • Investigating Hive file formats
  • Mining data with Mahout
  • Visualizing processed results with reporting tools
  • Querying in real time with Impala
  • Establishing your Big Data needs
  • Meeting business goals with timely data
  • Evaluating commercial Big Data tools
  • Managing organizational expectations
  • Focusing on business importance
  • Framing the problem
  • Selecting the correct tools
  • Achieving timely results
  • Selecting suitable vendors and hosting options
  • Balancing costs against business value
  • Keeping ahead of the curve