-
Notifications
You must be signed in to change notification settings - Fork 0
/
hadoop_foundation
464 lines (387 loc) · 15.3 KB
/
hadoop_foundation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
HADOOP FOUNDATION
Hadoop Introduction:-
=======================
What is Hadoop?
=================
--> Hadoop is a Bigdata Framework.
--> It has storage and processing.
--> For Storage we use HDFS file system.
--> For processing we have "MapReduce, Pig, Hive, etc..."
--> Hadoop is implemented using "Java Language"
--> Hadoop is implemented by "Dough Cutting".
--> Vendor of Hadoop is "Apache"
=========================================================
What is Bigdata?
================
--> Bigdata means Huge Volumes of data which can not be stored and procesed using Tradional Technologies
like RDMS and Application Programming with in short time.
RDBMS:
--> RDBMS has limitaion. It does not supoort unlimited storage.
--> Most of the RDMS filesystem is not distributed file system.
--> Most of the RDMS is not supoort parralel processing
Classification of Bigdata:-
-----------------------------
--> Bigdata can be classified into three categories. They are:-
1. Structured Data(Hadoop and RDMS)
2. Semi-Structured Data(Hadoop, Most of the RDMS not supports)
3. Un-Structured Data(Hadoop, Most of RDMS not supports)
Structured Data:-
----------------
--> Data has a proper format associated with it.
E.g:- Tables(Rows and Columns format)
--> RDBMS and Hadoop Supports Structured data.
Semi-Structured Data:-
------------------
---> Data does not have proper format associated with it.
E.g:-XML,JSON,CSV etc..
---> Most of the RDMS not supports Semi-Structured data but Hadoop Supports Semi-Structured data.
Un-Structured Data:-
------------------
---> Data does not have any format associated with it.
E.g:- Log files, Chatting data, Reviews, Videos,Audio, Images etc..
---> Most of the RDBMS not supports but Hadoop Supports Un-Structured data.
==================================================================================
Characterstics of Bigdata:-
--> There are mainly three Characterstics. They are:-
1. Volume
2. Velocity
3. Variety
Volume:-
--------
--> Huge Volumes of Data.
--> Hadoop supports unlimited data.
--> Thers is no limit in Hadoop cluster.
--> RDBMS does not support unlimited storage because there is a limit in RDBMS cluster.
Velocity:-
-----------
--> How much speed data is growing is called velocity.
E.g:-
--> Facebook generate 200TB data every data.
--> In Youtube each minute 48hrs video is uloading
--> Twitter processing 400 millions of twitts.
--> etc....
Variety:-
-----------
--> Structured, Semi-Structured and Un-Structured
--> Hadoop supports all kids of data.
--> RDBMS supports only Structured.
Challenges Associated with Bigdata?
===================================
--> Storage: There is no technology which stores unlimited data before Hadoop.
--> Processing: There is Application programming which can process huge volumes of data
in very short time or faster.
--> Hadoop solved Bigdata problems using HDFS and MapReduce.
--> HDFS stores unlimited storage. There is no limit in HDFS file system.
It is a distributed file system.
--> MapReduce is a Algorithm, we can implement MapReduce with Java, Python
etc...
--> MapReduce is a parralel processing architecture.
===========================================================================================
Diiferent Vendors of Hadoop:
===============================
--> Hadoop is implemented using Java
--> Apache Hadoop is a Opensource and Free
--> Opensource -- They give complete sourcecode
--> Based on this opensource many vendors imple hadoop
1. Cloudera -- CDH(Cloudera Distribution Hadoop)
2. Hortonworks --- HDP (Hadoop/Hortonworks Distributied Platform)
3. MapR -- MapR Hadoop
4. Amazon -- EMR (Elastic MapReduce)
5. IBM -- INfoSphare (BigInsights)
6. Microsoft -- HDInsights
etc...
--> Recently Cloudera and Hortonworks merged.
===========================================================================================
Features of Hadoop:-
====================
1. Cost Effective:
===================
--> Hadoop Softrware is Free
--> Hadoop works on Commodity Hardware
PC < Comomodity < Server Hardware (HighEnd or Certified Hardware)
2. Large Cluster of Nodes:
===========================
--> RDBMS has a limitaion in cluster.
--> But Hadoop, there is no limitation.
--> We can take 'n' no. of machines in Hadoop cluster.
3. Distributed Data:
=====================
--> Hadoopuses "HDFS Filesystem"
--> It is a Distributed Filesystem
--> Data divided into multiple blocks and each block placed in different
machine.
4. Parallel Processing :
=========================
--> MapReduce, Pig, Hive etc....... processing frameworks
--> Hadoop process data parallely
5. Automatic Failover Management:-
==================================
--> To resolve automatic failover or fault tolerance hadoop maintains three
replicas of each block (default).
--> In Cluster, if machine failed, still we can access hdfs filesystem.
6. Data LOcality Optimization
--> Application instances are creates in data location.
--> Data not moves application location, application instances are created in
data location.
--> Map instances are created in multiple machines.
7. Hetrogenous cluster :-
==========================
--> WE can take any hardware and any OS for machines in cluster.
--> It is not required to same operating system and same hardware for all
machines.
8. Scalability:-
=====================
--> We can add or remove any number of machines to cluster without putting
cluster in manintenance mode or without stoping cluster.
--> There is no limit in hadoop cluster
--> Hadoop supports unlimited storage.
===========================================================================================
Hadoop Architecture:-
====================
Master/Slave Architecture
--> Hadoop folows Master/Slave Architecture.
--> In Hadoop Architecture, there are 5 Daemons involves. They are:
1. NameNode
2. SecondaryNameNode
3. DataNode
4. Until Hadopp 1.x "JobTRacker", FRom Hadoop 2.x "ResourceManager"
5. Until Hadoop 1.x "TaskTracker", From Hadoop 2.x "NodeManager"
What is Daemon?
---------------
--> A Daemon is a process, it always runs in the background.
--> In Hadoop Daemon means, it is java programme.
Master Daemons:
----------------
1. NameNode
2. SecondaryNameNode
3. ResourceManager
Slave Daemons:
-----------------------
1. DataNode
2 NodeManager
--> It is Highly recommended to take master daemons on seperate machines.
--> It is Highly recommended to take both slave daemons on same machines.
we will take multiple salves.
===========================================================================================
HDFS Architecture:-
------------------
--> It is responsible for storage.
HDFS Daemons:
--------------
1. NameNode
2. SecondaryNameNode
3. DataNode
Master Daemons:-
-----------------
1. NameNode
2. SecondaryNameNode
Slave Daemons:
--------------------
1. DataNode
===========================================================================================
MapReduce Architecture:-
--> It is responsible for processing.
--> In hadoop we have two MapReduce Architecture. They are:
until Hadoop 1.x --- MR1 Architecture
1. JobTRacker
2. TaskTracker
From Hadoop 2.x -- MR2 Architecture (YARN Architecture)
-------------------------------------------------------
1. ResourceManager
2. NodeManager
Master Daemon:
--------------
1. ResourceManager
Slave Daemons:
-------------------
1. NodeManager
====================================================================================================
Hadoop Components:-
---------------------
There are 6 important components. They are:
1. NameNode
2. SecondaryNameNode
3. DataNode
4. Until Hadopp 1.x "JobTRacker", FRom Hadoop 2.x "ResourceManager"
5. Until Hadoop 1.x "TaskTracker", From Hadoop 2.x "NodeManager"
6. Client.
--> These 5 Daemons plays very important role in hadoop
--> Daemon -- It is a Background process.
NameNode:-
---------
--> NameNode is a HDFS Daemon
--> It ia sa Master Daemon in HDFS
--> Client Coomunicates with "NameNode" to store the data.
--> When NameNode gets storage request, it will read some configuration files.
hdfs-site.xml
block-size -- 128MB(default)
replication-factor - 3(default)
slaves for understanding where is DataNode's.
Here we provide complete information of slaves
proximity alogrithm
It contains of racks information, bandwidth of each machine.
we call this one "rackawarness alogrithm".
This alogrithm presented in python script.
The whole metadata file called as fsimage
--> Then it will prepare meta data
--> Then it will return metadata to client.
What is metadata?
-----------------
--> Data bout data.
--> It has file to block , block to DataNode's mapping information.
======================================================================================
SecondaryNameNode:-
-------------------
--> It is not a backup NameNode.
--> It is a HDFS Deamon.
--> It is a Master Daemon in HDFS architecture.
--> It is also called "CheckpointingNode"
--> It will perform checkponinting operation.
--> Merging fsimage and editlog is called "Checkpointing".
--> Once checkpoint is complted it will erase edit log files.
When Checkpointing is happen?
-----------------------------
There are two scenarios. They are:
1. edit log file size -- 1 hour default
2. thresh-hold time/ no. og entries -- 1 million by default
The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.
dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints, and
dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.
Without SecondaryNameNode, NameNode can perform Checkpointing but it is
overhead on NameNode.
=====================================================================================================
DataNode:-
----------
--> It is a HDFS Daemon.
--> It is a slave/worker daemon.
--> We can take 'n' no. of DataNode's.
--> The main purpose of DataNode is to store/hold the data.
--> Client will write data into DataNode.
--> Once client got metadata information from NameNode, then client will perform
hdfs write flow.
--> It updates its master node saying "I am active and I have so and so blocks",
This concept is called "Heart beat signal".
=====================================================================================================
ResourceManager:-
-----------------
--> It ia a MapReduce Daemon.
--> It is a YARN Daemon(MR 2 Architecture, it was introduce in Hadoop 2.x)
--> ResourceManager recieves processing request may be
MapReduce
Pig
HiveSqoop
Spark
etc..
--> Once ResourceManager recieves processing request, it will communicate with
NameNode for deatils of client application required file.
--> Once ResourceManager recieves response from NameNode then based on the
data and locations it will assign tasks to its slaves/worker machines.
=====================================================================================================
NodeManager:-
-------------
--> It is a Slave/Worker Daemon in MapReduce Architecture.
--> The responsibility of NodeManager is to process client request.
--> Client request may be
MapReduce
Pig
Hive
Sqoop
Spark
etc..
--> Once NodeManager recieves processing request assigned by its masterr
then it will start processing.
--> NodeManager updates status to it's master saying " I am active, I am executing
so and so request/task and percentage of completion".
This process is called heart beat signal in MapReduce.
=====================================================================================================
Client:-
--------
--> There are two types of clients. They are:-
1. HDFS Clients
2. Processing Clients/Application
=====================================================================================================
HDFS Filesystem:-
-----------------
--> HDFS means "Hadoop Distribute Filesystem".
--> It supports unlimited storage and
--> It supports parallel processing.
--> It is a Userspace Filesystem.
What is the difference between Operating System Filesystem(Local Filesystem) and
HDFS Filesystem?
--> Operating System Filesystem starts as part operating system process
--> We can increase or we cannot decrease size of Operating System filesize
--> But HDFS is Userspace Filesystem, it will start after Operating system started,
it is a user process so we can increase or decrease hdfs filesystem size. It is
under user control.
What is Block?
--------------
--> It is chunk or peace of data.
--> When client loads data into hdfs filesystem, data divided into multiple blocks.
--> Each block placed in different machine.
--> The default block size is until hadoop 1.x -- 64 MB
from hadoop 2.x -- 128MB
--> This is configurable, we can increase or we can decrease block size in
hdfs-site.xml -- (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)
name:dfs.blocksize
value:
--> We can increase as multiples of block size
64+64+64+64+64
128+128+128+128
--> In Operating System block size is 4 KB
Latest Block 8KB
What is the advantage of large block size?
--------------------------------------------
--> reading/seeeking time is very faster.
What is Replication or What is Replication-Factor?
---------------------------------------------------
--> Replication means -- duplicate copy
--> Replication-Factor -- how many duplicate copies of block
--> By default hadoop create "3" replicas.
--> We can increase or decrease replication-factor.
In hdfs-site.xml
name: dfs.replication-factor
value: 3
--> The main advantage of replication is we can solve fault-tolerance
even we can access hdfs filesystem if any machine down or fail or crash
Under Replication:-
-------------------
--> The blocks under < Replication-Factor is called "Under Replicated Blocks"
--> Assume DataNode 3 is failed, the blocks under failed node comes into under
"Under Replication" block.
--> NameNode maintains undereplicvated blocks information in NameNode machine.
--> Then NameNode will ask DataNode machines to replicate "underreplicated blocks"
--> Assume DataNode accepted to replicate "b3" and DataNode2 accepted to replicate
block1.
--> At any point of time NameNode tries to maintain replication-factor.
Over Replication:-
-------------------
--> The blocks which are > replication is called "Over Replicated" blocks.
--> NameNode wil ask
For Block1 -- DataNode1,2,3,5 to remove block1.
For Block3 -- DataNode1,3,4 and 5 to remove block1.
Assume DataNode 3 accepted to remove Block 1, 3
--> At any point of time NameNode tries to maintain replication-factor.
==================================================================================
HDFS Write Flow:-
-----------------
--> When client wants to write file into HDFS filesystem client will perform
HDFS Write Flow.
--> First Client Communicate with NameNode and ngets the blocks and DataNodes
information and writes into HDFS Filesystem.
HDFS Read Flow:-
-----------------
--> When client want to read HDFS filesystem
may HDFS CLI Command
MapReduce
Pig
Hive
ect...
--> First Client Communicates with NameNode and gets Filesystem metadata
--> NameNode return Metadata like
f1--b1--1,3,5
f1--b2--2,4,6
f1--b3--3,4,5
--> Client reads
b1 from DataNode1
b2 from DataNode2
b3 from DataNode3