This article is reproduced from the public date class,Original link

1998 9 Month 4 Day, Google was founded in Silicon Valley, USA. As everyone knows, it is a company that started as a search engine.

There is no such thing as a name.Doug CuttingAmerican engineers are also hooked on search engines. He made a function library for text search (let's understand it as a functional component of the software), namedLucene.

Left is Doug Cutting, right is Lucene's LOGO

Lucene is written in JAVA and aims to add full-text search capabilities to a variety of small and medium-sized applications. Because it is easy to use and open source (code open), it is very popular among programmers.

In the early days, the project was posted on Doug Cutting's personal website and SourceForge (an open source software site). Later, at the end of 2001, Lucene becameApache Software FoundationA subproject of the jakarta project.

Apache Software Foundation, I should know all about IT.

In 2004, Doug Cutting made persistent efforts. On the basis of Lucene, in cooperation with Apache open source partner Mike Cafarella, he developed an open source search engine that could replace the mainstream search at the time, namedNutch.

Nutch is a web search application built on top of Lucene that can be downloaded and used directly. It adds a web crawler and some web-related features to Lucene, from a simple site search to a global network search, just like Google.

Nutch's influence in the industry is greater than Lucene.

A large number of websites use the Nutch platform, which greatly reduces the technical threshold and makes it possible to replace high-priced Web servers with low-cost ordinary computers. Even for a while, there was a trend in Silicon Valley to start a business with Nutch low-cost.

Over time, both Google and Nutch are facing the problem of increasing "volume" of search objects.

In particular, Google, as an Internet search engine, needs to store a large number of web pages, and constantly optimize its search algorithm to improve search efficiency.

Google search bar

In the process, Google did find a lot of good ideas and shared them unselfishly.

In 2003, Google published a technical academic paper that publicly introduced its own Google file system.GFS (Google File System). This is a proprietary file system designed by Google to store massive amounts of search data.

The second year, 2004, Doug Cutting was based on Google's GFS paper.Distributed file storage systemAnd name itNDFS (Nutch Distributed File System).

Or 2004, Google has published a technical academic paper introducing itself.MapReduce programming model. This programming model is used for parallel analysis operations on large data sets (greater than 1TB).

In the second year (2005 year), Doug Cutting was based on MapReduce, which was implemented in the Nutch search engine.

2006 year, still very powerful at the timeYahoo (Yahoo), recruited Doug Cutting.

Here is a supplement to the background of Yahoo's Zhaoan Doug: Before 2004, Yahoo as an Internet pioneer used Google's search engine as its own search service. At the beginning of 2004, Yahoo gave up Google and started developing its own search engine. and so. . .

After joining Yahoo, Doug Cutting upgraded NDFS and MapReduce and renamed it toHadoop(NDFS is also renamed HDFS, Hadoop Distributed File System).

This is the origin of Hadoop, the famous big data framework system. And Doug Cutting is calledFather of Hadoop.

The name Hadoop is actually the name of the yellow toy elephant that Doug Cutting his son. So, the Hadoop logo is a running yellow elephant.

Let's continue to talk.

Still in 2006, Google sent a paper again.

This time, they introduced their ownBigTable. This is a distributed data storage system, a non-relational database for processing large amounts of data.

Of course, Doug Cutting has not let go. In your own Hadoop system, BigTable was introduced and namedHBase.

Ok, anyway, it’s just followGoogleThe pace of the times, what are you doing, what I am learning.

Therefore, the core part of Hadoop basically has the shadow of Google.

2008 1 month, Hadoop successfully succeeded, officially became the top project of the Apache Foundation.

In the same year 2, Yahoo announced the establishment of a Hadoop cluster with 1 10,000 cores and deployed its own search engine products.

In 7, Hadoop broke the world record and became the fastest system for sorting 1TB data in 209 seconds.

Since then, Hadoop has entered a period of rapid development until now.

Hadoop's core architecture    

The core of Hadoop, to put it bluntly, is HDFS and MapReduce. HDFS provides for massive datastorageAnd MapReduce provides for massive data.Computing framework.

Hadoop core architecture

Let's take a closer look at how they work separately.

First look at HDFS.

The entire HDFS has three important roles: NameNode (Name Node), DataNode (Data Node), and Client (Client).

Typical master-slave architecture with TCP/IP communication

NameNode:It is the master node (master node), which can be regarded as a manager in the distributed file system. It is mainly responsible for managing the file system namespace, cluster configuration information, and storage block copying. The NameNode stores the Meta-data of the file system in memory. The information mainly includes file information, information about the file block corresponding to each file, and information about each file block in the DataNode.

DataNode:It is a slave node (slave node), which is the basic unit of file storage. It stores the block in the local file system, saves the Meta-data of the block, and periodically sends all the existing block information to the NameNode.

Client:Split files; access HDFS; interact with NameNode to obtain file location information; interact with DataNode to read and write data. 

There is still oneBlockConcept: Block is the basic read-write unit in HDFS; files in HDFS are all cut into blocks for storage; these blocks are copied to multiple DataNodes; block size (usually 64MB) and copy The number of blocks is determined by the Client when the file is created.    

Let's take a brief look at the HDFS read and write process.

first of allWrite process :

  1. The user makes a request to the client. For example, you need to write data for 200MB.
  2. The client makes a plan: the data is cut according to the 64MB block; all the blocks are saved in three copies.
  3. Client cuts large files into blocks.
  4. For the first block, the Client tells the NameNode (master node), please help me, copy the 64MB block three times.
  5. The NameNode tells the Client the addresses of the three DataNodes and sorts them according to the distance to the Client.
  6. Client sends data and manifest to the first DataNode
  7. The first DataNode copies the data to the second DataNode.
  8. The second DataNode copies the data to the third DataNode.
  9. If all the data for a block has been written, the feedback to the NameNode is complete.
  10. The same operation is performed for the second block.
  11. After all the blocks are completed, close the file. The NameNode will persist the data to disk.

Reading process:

  1. The user makes a read request to the Client.
  2. The client requests all information about this file from the NameNode.
  3. The NameNode will give the Client a list of blocks for this file, as well as a list of data nodes that store each block (sorted by the distance from the client).
  4. The Client downloads the required blocks from the nearest data node.

(Note: The above is a simplified description and the actual process will be more complicated.)

Look at MapReduce again.

MapReduce is actually a programming model. The core steps of this model are mainly divided into two parts:Map(mapping)Reduce.

When you submit a calculation job to the MapReduce framework, it first splits the calculation job into severalMap taskAnd then assigned to different nodes to execute, each Map task processes a part of the input data, when the Map task is completed, it will generate some intermediate files, these intermediate files will be used asReduce taskInput data. The main goal of the Reduce task is to summarize the output of the previous Maps and output them.

Is it a little dizzy? Let us give an example.

The above picture is a task of statistic word frequency.

  1. Hadoop cuts the input data into several shards and hands each split to a map task.
  2. After Mapping, it is equivalent to the word, the number of times it appears in the task.
  3. Shuffle puts the same words together and sorts them into several pieces.
  4. According to these fragments, reduce is reduced.
  5. The result of the reduce task is counted and output to a file.

If you still don't understand it, let's take another example.

A teacher has 100 papers to review. He found 5 helpers and threw them to each helper 20 paper. Help each other to check the volume. Finally, the helpers summarize the results to the teacher. Very simple, right?

The MapReduce framework model greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming.

Oh, I almost forgot, in MapReduce, in order to complete the above process, you need two roles:JobTrackerTaskTracker.

JobTracker is used to schedule and manage other TaskTrackers. JobTracker can run on any computer in the cluster. The TaskTracker is responsible for executing the task and must be running on the DataNode.

1.0 version and 2.0 version

2011 11 month, Hadoop 1.0.0 version officially released, meaning that it can be used for commercialization.

However, there are some problems in the 1.0 version:

1 has poor scalability and the JobTracker is heavily loaded, which is a performance bottleneck.

2 has poor reliability, and there is only one NameNode. If it hangs, the entire system will crash.

3 only works with MapReduce.

4 resource management is less efficient.

So, 2012 5, Hadoop launched 2.0 version .

In the 2.0 version, on top of HDFS, addedYARN (Resource Management Framework)Floor. It is a resource management module that provides resource management and scheduling for a variety of applications.

In addition, the 2.0 version also enhances the security and stability of the system.

Therefore, the 2.0 version was basically used in the industry. Hadoop is now further developed to the 3.X version.

Hadoop's ecosystem    

Over time, Hadoop has evolved from the first two or three components into an ecosystem with multiple 20 components.

In the entire Hadoop architecture, the computing framework plays a role in linking up and down. On the one hand, it can manipulate the data in HDFS, on the other hand, it can be encapsulated and provide calls to upper components such as Hive and Pig.

Let's take a brief look at some of the more important components.

HBase: From Google's BigTable; is a highly reliable, high performance, column-oriented, scalable distributed database.

Hive: It is a data warehouse tool that can map structured data files into a database table. It can quickly implement simple MapReduce statistics through SQL-like statements. It does not need to develop a special MapReduce application, which is very suitable for statistical analysis of data warehouse.

Pig: A large-scale data analysis tool based on Hadoop, which provides the SQL-LIKE language called Pig Latin. The language compiler converts SQL-like data analysis requests into a series of optimized MapReduce operations.

ZooKeeper: Chubby from Google; it is mainly used to solve some data management problems often encountered in distributed applications, simplifying the coordination of distributed applications and the difficulty of management.

Ambari: Hadoop management tools that quickly monitor, deploy, and manage clusters.

Sqoop: Used to transfer data between Hadoop and traditional databases.

Mahout: An extensible machine learning and data mining library.

Looking at the previous picture, it might be more intuitive:

Hadoop's advantages and applications    

In general, Hadoop has the following advantages:

High reliability: This is determined by its genes. Its genes come from Google. What Google is best at is "garbage." When Google started out, it was poor and couldn't afford high-end servers, so it was especially like to deploy such a large system on a regular computer. Although the hardware is not reliable, the system is very reliable.

High scalability: Hadoop distributes data and performs computing tasks among the available clusters of computers, which can be easily extended. To put it bluntly, it is easy to grow bigger.

Efficient: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High fault tolerance: Hadoop automatically saves multiple copies of data and automatically redistributes failed tasks. This is actually a high reliability.

low cost: Hadoop is open source, relies on community services, and is less expensive to use.

Based on these advantages, Hadoop is suitable for applications such as big data storage and big data analysis. It is suitable for clusters running from thousands to tens of thousands of servers and supports PB storage capacity.

Hadoop is used in a wide variety of applications, including:Search, log processing, recommendation system, data analysis, video image analysis, data saving, etc., you can use it for deployment.

At present, Yahoo, IBM, Facebook, Amazon, Alibaba, Huawei, Baidu, Tencent and other companies all use Hadoop to build their own big data systems.

In addition to the above-mentioned large enterprises using Hadoop technology in their own services, some commercial companies that provide Hadoop solutions have followed suit, using Hadoop to optimize, improve, and redevelop their own technologies, and then provide commercial services.

More well-known is Cloudera.

Founded in 2008, it specializes in Hadoop-based data management software sales and services. It also provides Hadoop-related support, consulting, training and other services, somewhat similar to RedHat's role in the Linux world. The father of Hadoop, Doug Cutting, which we mentioned earlier, was hired by the company as the chief architect.

Hadoop and Spark    

Finally, let me introduce the Spark that everyone cares about.

Spark is also a top-level project of the Apache Software Foundation. It can be understood as an improvement on the basis of Hadoop.

it isUC Berkeley AMP LabA common parallel framework for open source Hadoop MapReduce. Compared to Hadoop, it can be said that it is blue and blue.

We said earlier,MapReduce is disk oriented. Therefore, due to the constraints of disk read and write performance, MapReduce is not efficient in dealing with iterative calculations, real-time calculations, interactive data queries, and so on. However, these calculations are in graph calculation, data mining and机器 学习Very common in related applications.

AndSpark is memory oriented. This enables Spark to provide near-real-time processing performance for data from multiple different data sources, for applications that require multiple operations on a particular data set.

Processing the same data in the same experimental environment,If running in memory, then Spark is 100 faster than MapReduce. Other aspects, such as processing iterative operations, computing data analysis reports, sorting, etc., Spark are much faster than MapReduce.

In addition, Spark is also more powerful than Hadoop in terms of ease of use and versatility.

Therefore, the limelight of Spark has overshadowed Hadoop.