Hadoop

In: Computers and Technology

Submitted By mayanist88
Words 8590
Pages 35
www.linuxidc.com

Hadoop入门实战手册

更多Hadoop相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

北京宽连十方数字技术有限公司 技术研究部 (2011年7月)

Linux¹«Éç(LinuxIDC.com) ÊÇ°üÀ¨Ubuntu,Fedora,SUSE¼¼Êõ£¬×îÐÂIT×ÊѶµÈLinuxרҵÀàÍøÕ¾¡£

www.linuxidc.com

目录
1 概述 ........................................................................................................................... 4 1.1 什么是Hadoop? .................................................................................................. 4 1.2 为什么要选择Hadoop? ....................................................................................... 4 1.2.1 系统特点 ........................................................................................................ 4 1.2.2 使用场景 ........................................................................................................ 5 2 术语 ........................................................................................................................... 5 3 Hadoop的单机部署 .................................................................................................... 6 3.1 目的 ..................................................................................................................... 6 3.2 先决条件 .............................................................................................................. 6 3.2.1 支持平台 ........................................................................................................ 6 3.2.2 所需软件 ........................................................................................................ 6 3.2.3 安装软件 ........................................................................................................ 6 3.3 下载 ..................................................................................................................... 7 3.4 运行Hadoop集群的准备工作…...

Similar Documents

Hadoop Setup

...Hadoop Cluster Setup Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. Required Software Required software for Linux and Windows include: 1. Java 1.6.x, preferably from Sun, must be installed. 2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Installation Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves. The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path. Steps for Installation 1. Install java 1.6 Check java version: $ java –version 2. Adding dedicated user group $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser 3. Install ssh $ su - hduser......

Words: 1213 - Pages: 5

Hadoop Jitter

... Our contributions are 1. the quantification and assessment of performance variation of data-intensive scientific workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientific workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler. I. I NTRODUCTION Certain high-performance applications such as weather prediction or algorithmic trading require the analysis and aggregation of large amounts of data geo-spatially distributed across the world, in a very short amount of time (i.e. on-demand). A traditional supercomputer may be neither a practical nor an economical solution because it is not suitable for handling data that is distributed across the world. For such application domains, the ease and inexpensiveness of getting access to a cloud has shown to be an advantage over high performance clusters. The strength of cloud computing infrastructures has been the reliabilty and fault-tolerance of an application at a very large scale. Google’s MapReduce [2] programming model and Yahoo’s subsequent implementation of Hadoop [3] have allowed one to harness the power of such cloud infrastructures. However, for certain applications (particularly data-intensive scientific workloads) the small...

Words: 7930 - Pages: 32

Hadoop

...Cluster. Hadoop Architecture Two Components * Distributed File System * Map Reduce Engine HDFS Nodes * Name Node * Only one node per Cluster * Manages File system, Name Space and Metadata * Single point of Failure but mitigated by writing to multiple file systems * Data Node * Many per cluster * Manages blocks with data and serves them to Nodes * Periodically reports to Name Node on the list of blocks it stores Map Reduce Nodes * Job Tracker * Task Tracker PIG – A high level Hadoop programing language that provides data flow language and execution framework for parallel computation Created by Yahoo Like a Built in Function for Map Reduce We write queries in PIG – Queries get translated to Map Reduce Program during execution HIVE : Provides adhoc SQL like queries for data aggregation and summarization Written by JEFF from FACEBOOK. Database on top of Hadoop HiveQL is the query language. Runs like SQL with less features of SQL HBASE: Database on top of Hadoop. Real-time distributed database on the top of HDFS It is based on Google’s BIG TABLE – Distributed non-RDBMS which can store billions of rows and columns in single table across multiple servers Handy to write output from MAP REDUCE to HBASE ZOO KEEPER: Maintains the order of all animals in Hadoop.Created by Yahoo. Helps to run distributed application and maintain them in Hadoop. SQOOP: Sqoops the data from RDBMS to......

Words: 276 - Pages: 2

Yoyo

...Hello everyone and welcome to Hadoop Fundamentals – What is Hadoop. My name is Warren Pettit. In this video we will explain what is Hadoop and what is Big Data. We will define some Hadoop-related open source projects and give some examples of Hadoop in action. Imagine this scenario: You have 1GB of data that you need to process. The data is stored in a relational database on your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB. And then 100GB. And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB, you are quickly approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your management wants to derive information from both the relational data and the unstructured data and wants this information as soon as possible. What should you do? Hadoop may be the answer! What is Hadoop? Hadoop is an open source project of the Apache Foundation. It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation. It is optimized to handle massive quantities of data......

Words: 577 - Pages: 3

Bigdata Etl

...White Paper Big Data Analytics Extract, Transform, and Load Big Data with Apache Hadoop* ABSTRACT Over the last few years, organizations across public and private sectors have made a strategic decision to turn big data into competitive advantage. The challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for subsequent analysis, a process known as “Extract, Transform & Load” (ETL). The nature of big data requires that the infrastructure for this process can scale cost-effectively. Apache Hadoop* has emerged as the de facto standard for managing big data. This whitepaper examines some of the platform hardware and software considerations in using Hadoop for ETL. –  e plan to publish other white papers that show how a platform based on Apache Hadoop can be extended to W support interactive queries and real-time predictive analytics. When complete, these white papers will be available at http://hadoop.intel.com. Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The ETL Bottleneck in Big Data Analytics The ETL Bottleneck in Big Data Analytics. . . . . . . . . . . . . . . . . . . . . . 1 Big Data refers to the large amounts, at least terabytes, of poly-structured...

Words: 6174 - Pages: 25

Hadoop

...Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.[3] The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged code for nodes to process in parallel, based on the data each node needs to process. This approach takes advantage of data locality[4]—nodes manipulating the data that they have on hand—to allow the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.[5] The base Apache Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; Hadoop YARN – a resource-management platform......

Words: 456 - Pages: 2

Big Analytics

...REVOLUTION ANALYTICS WHITE PAPER Advanced ‘Big Data’ Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional analytical model. First, Big Analytics describes the efficient use of a simple model applied to volumes of data that would be too large for the traditional analytical environment. Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. The algorithm is not the competitive advantage; the ability to apply it to huge amounts of data—without compromising performance—generates the competitive edge. Second, Big Analytics refers to the sophistication of the model itself. Increasingly, analysis algorithms are provided directly by database management system (DBMS) vendors. To pull away from the pack, companies must go well beyond what is provided and innovate by using newer, more sophisticated statistical analysis. Revolution Analytics addresses both of these opportunities in Big Analytics while supporting the following objectives for working with Big Data Analytics: 1. 2. 3. 4. Avoid sampling / aggregation; Reduce data movement and replication; Bring the analytics as close as possible to the data and; Optimize computation speed. First, Revolution Analytics delivers optimized statistical algorithms for the three primary data management paradigms being employed to......

Words: 1996 - Pages: 8

Big Data

...are going to be made. Hadoop and Big Data Hadoop is a relatively new system that allows to easily store huge amount of data and able to process data. With Hadoop no data is too big for it to handle, it can even store data with the excess amount that a node or server is able to handle. Big data comes into connection with Hadoop on how companies are now able to find value in the data that they might have considered useless at one point. By looking at what Hadoop really is, it will be easier to see how specifically big data has a relationship with it. Hadoop is an open source system that is able to handle all types of data ranging from structured to unstructured data, even emails and picture, basically anything that you could think and it doesn’t matter in what format it has. It is able to process large data from terabytes to petabytes and even more than that. It also lets you see what kind of decisions would be better to make base on the hard data it will give out, its better to use instead of making assumptions and its easier to look at whole data sets, not just examples. Hadoop not only works alone in trying to handle big data, it has a system that is integrated with it called MapReduce which is constrained in support for graphing, machine learning, and other memory with intensive algorithms. Many companies use MapReduce, to name a few; Yahoo, which uses it for web mapping and social media companies like Facebook uses it for data mining. Hadoop also includes HDFS,......

Words: 1883 - Pages: 8

Big Data

...percent of enterprises appear to have deployed a big data project to this date At the center of the big data movement is an open source software framework created by Doug Cutting, formerly of Yahoo!, called Hadoop. Hadoop has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. Like what you see? Click here to sign up for Information Management's daily newsletter to get the latest news, trends, commentary and more. The Hadoop system consists of three projects: Hadoop Common, a utility layer that provides access to the Hadoop Distributed File System and Hadoop subprojects. HDFS acts as the data storage platform for the Hadoop framework and can scale to massive size when distributed over numerous computing nodes. Hadoop MapReduce is a powerful framework for processing data sets across clusters of Hadoop nodes. The Map and Reduce process splits the work by first mapping the input across the control nodes of the cluster, then splitting the workload into even smaller data sets and distributing it further throughout the computing cluster. This allows it to leverage massively parallel processing, a computing advantage that technology has introduced to modern system architectures. With MPP, Hadoop can run on inexpensive commodity servers, dramatically reducing the upfront capital costs traditionally required to build out a massive system. As the nodes "return" their answers, the Reduce......

Words: 2481 - Pages: 10

Cisco Case Study

...(SLAs) for internal customers using big data analytics services ● Support multiple internal users on same platform SOLUTION ● Implemented enterprise Hadoop platform on Cisco UCS CPA for Big Data - a complete infrastructure solution including compute, storage, connectivity and unified management ● Automated job scheduling and process orchestration using Cisco Tidal Enterprise Scheduler as alternative to Oozie RESULTS ● Analyzed service sales opportunities in one-tenth the time, at one-tenth the cost ● $40 million in incremental service bookings in the current fiscal year as a result of this initiative ● Implemented a multi-tenant enterprise platform while delivering immediate business value LESSONS LEARNED ● Cisco UCS can reduce complexity, improves agility, and radically improves cost of ownership for Hadoop based applications ● Library of Hive and Pig user-defined functions (UDF) increases developer productivity. ● Cisco TES simplifies job scheduling and process orchestration ● Build internal Hadoop skills ● Educate internal users about opportunities to use big data analytics to improve data processing and decision making NEXT STEPS ● Enable NoSQL Database and advanced analytics capabilities on the same platform. ● Adoption of the platform across different business functions. Enterprise Hadoop architecture, built on Cisco UCS Common Platform Architecture (CPA) for Big Data, unlocks hidden business intelligence. Challenge Cisco is the......

Words: 3053 - Pages: 13

Real Time Analytics

...fourth enabler is new set of analytics tools designed specifically to analyze large amount of data, both structured and unstructured. * Discuss Hadoop/MapReduce and its relevance for Big Data. * Hadoop is an open-source structure for creating and executing conveyed applications that procedure a lot of information. It gives the base that circulates information over a large number of machines in a bunch and that pushes investigation code to hubs nearest to the information being dissected. * Hadoop is a "versatile and available outline for tremendous scale estimation and data taking care of on an arrangement of stock hardware" * * MapReduce: MapReduce is an utilitarian programming perspective that is proper to dealing with parallel treatment of epic data sets passed on over a broad number of PCs, or toward the day's end, MapReduce is the application perspective maintained by Hadoop and the establishment presented in this article. MapReduce, as its name derives, works in two stages: * * Map: The aide step essentially deals with a little issue: Hadoop's practitioner disengages the issue into minimal workable subsets and doles out those to guide systems to clarify. * Reduce: The reducer joins the delayed consequences of the mapping systems and structures the yield of the MapReduce operation. * Hadoop is an open-source structure for making and executing passed on applications that methodology a ton of data. It gives the base that circles......

Words: 1729 - Pages: 7

Integration of Technology

...competitive advantage. Hadoop It is open source software designed to provide massive storage and large data processing power. Hadoop has the ability to handle tasks running at the same time. Hadoop has a storage and processing part. It works by dividing files into large blocks and distributing them amongst the nodes (Kozielski & Wrembel, 2014). In processing, it works with MapReduce to ensure that codes are transferred and nodes are processed in parallel. By using nodes, Hadoop allows data manipulation making it is process faster and more efficiently. It has four main components: The Hadoop Common which contains utilities required, the Hadoop Distributed File System which is the storage part, Hadoop Yarn which manages and computes resources and Hadoop MapReduce which is a program responsible for processing large-scale data. It can process large amounts of data quickly by using multiple computers (Kozielski & Wrembel, 2014). Hadoop is being turned into a data processing operating system by large organizations. This is because it allows numerous data manipulations and analytical processes. Other data analysis programs such as SQL run on Hadoop and perform well on this system. The ability of Hadoop running many programs lowers cost of data analysis and allows businesses to analyze different amounts of data on products and consumers. Hadoop not only provides an organization with more data to work on, but it can also process data of different variations. This is because Hadoop......

Words: 948 - Pages: 4

Case Stydu of Hive Using Hadoop

...CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of Computer Engineering , Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com Abstract Hadoop is a framework of tools. It is not a software that you can download on your computer. These tools are used to running applications on big data which has huge in capacity,need to process quickly and can be in variety forms. To manage the big data HIVE used as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop .Hive provides a SQL-LIKE languages called HIVEQL. In this paper we explains how to use hive using Hadoop with a simple real time example and also explained how to create a table,load the data into table from external file ,retrieve the data from table and their different statistics like CPU time for each stage of query execution ,cumulative CPU time and time taken to fetch records. Key Words:Hadoop,Hive,MapReduce,HDFS,HIVEQL 1. 1.1. INTRODUCTION Hadoop Hadoop is a open source and is distributed under Apache license. It is a framework of tools and not a software that you can download. These tools are used to running applications on big data .Big data means data with respective to its volume, speed, variety forms(Unstructured).In traditional approach big data is processed by using powerful computer but this computer will do good job until some...

Words: 1954 - Pages: 8

Hadoop Distribution Comparison

...Hadoop Distribution Comparison Tiange Chen The three kinds of Hadoop distributions that will be discussed today are: Apache Hadoop, MapR, and Cloudera. All of them have the same goals of performance, scalability, reliability, and availability. Furthermore, all of them have advantages including massive storage, great computing power, flexibility (Store and process data whenever you want, instead of preprocess before storing data like traditional relational databases. And it enables users to easily access new data sources including social media, email conversations, etc..), fault tolerance (One node fails, jobs still works on other nodes because data is replicated to other nodes in the beginning, so the computing does not fail), low cost (Use commodity hardware to store data), and scalability (More nodes, more storage, and little administration.). Apache Hadoop is the standard Hadoop distribution. It is open source project, created and maintained by developers from all around the world. Public access allows many people to test it, and problems can be noticed and fixed quickly, so their quality is reliable and satisfied. (Moccio, Grim, 2012) The core components are Hadoop Distribution File System (HDFS) as storage part and MapReduce as processing part. HDFS is a simple and robust coherency model. It is able to store large amount of information and provides steaming read performance. However, it is not strong enough in the aspect of easy management and seamless......

Words: 540 - Pages: 3

Haddoop Installation

...In this tutorial, the required steps has been described for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. Installing Python $ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:ferramroberto/java Update the source list $ sudo apt-get update Install Sun Java 6 JDK $ sudo apt-get install sun-java6-jdk Select Sun's Java as the default on your machine. (See 'sudo update-alternatives --config java' for more information.) $ sudo update-java-alternatives -s java-6-sun The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu). After installation, make a quick check whether Sun’s JDK is correctly set up: $ java –version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing) Adding a dedicated Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser This will add the user hduser and the group hadoop to your local machine. Configuring SSH user@ubuntu:~$ su – hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa...

Words: 2067 - Pages: 9