Hadoop :
Apache Hadoop is a framework that allows distributed processing of large datasets across clusters of commodity computers using a simple programming model. It is designed to scale-up from single servers to thousands of machines, each providing computation and storage. Rather than rely on hardware to deliver high-availability, the framework itself is designed to detect and handle failures at the application layer, thus delivering a highly available service on top of a cluster of computers, each of which may be prone to failures. In short, Hadoop is an open-source software framework for storing and processing big data in a distributed way on large clusters of commodity hardware. Basically, it accomplishes the following two tasks: 1. Massive data storage. 2. Faster processing.
Core Hadoop Components :
1.Hadoop Common: This package provides file system and OS level abstractions. It contains libraries and utilities required by other Hadoop modules.
- Hadoop Distributed File System (HDFS): HDFS is a distributed file system that provides a limited interface for managing the file system.
- Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
- Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a resourcemanagement platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
Hadoop Ecosystem :
2.4.1 HBase HBase “is an open-source, distributed, versioned, column-oriented store” that sits on top of HDFS. HBase is based on Google’s Bigtable. HBase is based on columns rather than rows. This essentially increases the speed of execution of operations if they are need to be performed on similar values across massive datasets; for example, read/write operations that involve all rows but only a small subset of all columns. HBase does not provide its own query or scripting language, but is accessible through Java, Thrift and REST APIs. 2.4.2 Hive Hive provides a warehouse structure for other Hadoop input sources and SQL-like access for data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allows user-defined functions (UDFs). Hive’s data model is based primarily on three related data structures: tables, partitions and buckets. Tables correspond to HDFS directories that are divided into partitions, which in turn can be divided into buckets. Scripting (Pig) Machine learning (Mahout) Distributed processing (MapReduce) Columnar database (HBase) Data transfer Hadoop to RDBMS (SQOOP) Hadoop distributed file system (HDFS) Query (Hive) Workflow and scheduling (Oozie) Coordination (Zookeeper) Figure 2.3 Hadoop ecosystem. 20 • Chapter 2/ H adoop 2.4.3 HCatalog HCatalog is a metadata and table storage management service for HDFS. HCatalog’s goal is to simplify the user’s interaction with HDFS data and enable data sharing between tools and execution platforms. 2.4.4 Pig Pig is a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Pig Latin is a high-level scripting language on Pig platform. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Pig is more flexible with respect to possible data format than Hive due to its data model. Pig’s data model is similar to the relational data model, but here tuples can be nested. For example, a table of tuples can have a table in the third field of each tuple. In Pig, tables are called bags. Pig also has a “map” data type, which is useful in representing semi-structured data such as JSON or XML.” 2.4.5 Sqoop Sqoop (“SQL-to-Hadoop”) is a tool which transfers data in both ways between relational systems and HDFS or other Hadoop data stores such as Hive or HBase. Sqoop can be used to import data from external structured databases into HDFS or any other related systems such as Hive and HBase. On the other hand, Sqoop can also be used to extract data from Hadoop and export it to external structured databases such as relational databases and enterprise data warehouses. 2.4.6 Oozie Oozie is a job coordinator and workflow manager for jobs executed in Hadoop. It is integrated with the rest of the Apache Hadoop stack. It supports several types of Hadoop jobs, such as Java map-reduce, Streaming map-reduce, Pig, Hive and Sqoop as well as system-specific jobs such as Java programs and shell scripts. An Oozie workflow is a collection of actions and Hadoop jobs arranged in a Directed Acyclic Graph (DAG), since tasks are executed in a sequence and also are subject to certain constraints. 2.4.7 Mahout Mahout is a scalable machine-learning and data-mining library. There are currently following four main groups of algorithms in Mahout: 1. Recommendations/Collective filtering. 2. Classification/Categorization. 3. Clustering. 4. Frequent item-set mining/Parallel frequent pattern mining. 2.5 P hysi c al Ar c hite c ture • 21 Mahout is not simply a collection of pre-existing data mining algorithms. Many machine learning algorithms are non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. But algorithms in the Mahout library can be executed in a distributed fashion, and have been written for MapReduce 2.4.8 ZooKeeper ZooKeeper is a distributed service with master and slave nodes for storing and maintaining configuration information, naming, providing distributed synchronization and providing group services in memory on ZooKeeper servers. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers called znodes. HBase depends on ZooKeeper and runs a ZooKeeper instance by default.
Apache Spark :
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Apache Spark is an open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.
Nosql :
What is NoSQL?
NoSQL encompasses a wide variety of different database technologies that were developed in response to the demands presented in building modern applications:
Developers are working with applications that create massive volumes of new, rapidly changing data types — structured, semi-structured, unstructured and polymorphic data.
Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams work in agile sprints, iterating quickly and pushing code every week or two, some even multiple times every day.
Applications that once served a finite audience are now delivered as services that must be always-on, accessible from many different devices and scaled globally to millions of users.
Organizations are now turning to scale-out architectures using open software technologies, commodity servers and cloud computing instead of large monolithic servers and storage infrastructure.
Relational databases were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the commodity storage and processing power available today.
NoSQL Database Types
- Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
- Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
- Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or ‘key’), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as ‘integer’, which adds functionality.
- Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows
MongoDB
Hosted Database as a Service
Try out the easiest way to start learning and prototyping applications on MongoDB, the leading non-relational database.
Launching an application on any database typically requires careful planning to ensure performance, high availability, security, and disaster recovery – and these obligations continue as long as you run the application. With MongoDB Atlas, you receive all of the features of MongoDB without any of the operational heavy lifting, allowing you to focus instead on learning and building your apps. Features include:
- On-demand, pay as you go model
- Seamless upgrades and auto-healing
- Fully elastic. Scale up and down with ease
- Deep monitoring & customizable alerts
- Highly secure by default
- Continuous backups with point-in-time recovery
0 Comments