forked from stratosphere/stratosphere.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
319 lines (276 loc) · 15.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
layout: default
title: Next Generation Big Data Analytics Platform
description: Stratosphere is an Open Source platform for massively parallel big data analytics. It features a rich set of operators, advanced, iterative data flows, an efficient runtime, and automatic program optimization.
keywords: stratosphere, big data, data analytics, open source, hadoop, apache, massively-parallel data processing, map reduce
---
<script src="{{ site.baseurl }}/js/jquery.scrollTo.min.js"></script>
<script src="{{ site.baseurl }}/js/jquery.localScroll.min.js"></script>
<script type="text/javascript">
$( document ).ready(function() {
// hide element (for js users)
$( "#see-how-hadoop" ).hide();
$( "#see-how-hadoop-iterations" ).hide();
$.localScroll({ offset: { top:-150 }});
$( "#see-how-hadoop-iterations-link" ).click(function() {
$( "#see-how-hadoop-iterations" ).slideToggle( "slow", function() {
// Animation complete.
});
});
$( "#see-how-hadoop-link" ).click(function() {
$( "#see-how-hadoop" ).slideToggle( "slow", function() {
// Animation complete.
});
});
});
</script>
<div class="jumbotron masthead" id="top">
<!-- START RIBBON -->
<div class="ribbon ribbon-large ribbon-red" style="margin-top:52px;margin-right:-2px;">
<div class="banner">
<div class="text" style="text-align: center; font-size: 0.6em; position:relative;">
<div style="margin-top: -2px;line-height:120%">
<a href="{{ site.baseurl }}/blog/apache/2014/04/16/stratosphere-goes-apache-incubator.html">Soon in <br/> <span style="font-size: 1.1em;">Apache Incubator</span></a></div>
</div>
</div>
</div>
<!-- END RIBBON -->
<h1>Stratosphere</h1>
<p style="padding-bottom:2em">Big Data looks tiny from here.</p>
<p>
<a href="{{site.current_stable_dl}}" onclick="_gaq.push(['_trackEvent','Action','download-startpage',this.href]);" class="btn btn-info btn-lg" style="width:10em">
<i class="icon-download"> </i>Download {{site.current_stable}}
</a>
<a href="https://github.com/stratosphere/stratosphere" target="_blanc" onclick="_gaq.push(['_trackEvent','Action','github',this.href]);" class="btn btn-primary btn-lg">
<i class="icon-github"> </i>View on GitHub
</a><br>
<p style="font-size:60%">Stable: {{site.current_stable}}, Beta: {{site.current_snapshot}}</p>
</p>
</div>
<div class="container features">
<div class="row" style="padding-top:40px;padding-bottom:40px;">
<div class="col-md-12">
<h2><strong>Stratosphere is the next generation Big Data Analytics Platform</strong>.</h2>
<p class="lead">It features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. Stratosphere has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.</p>
</div>
</div>
<div class="row feature-highlights">
<div class="col-md-4">
<h3><i class="icon-file-alt"></i> <a href="{{ site.baseurl }}/quickstart/setup.html">Easy to Install</a> </h3>
<p>Download and run Stratosphere programs in less than 5 minutes.</p>
</div>
<div class="col-md-4">
<h3><i class="icon-code"></i> <a href="{{ site.baseurl }}/quickstart">Easy to Use</a></h3>
<p>Beauty of Scala programming: specify what you want out of the data, not how the job is executed.</p>
</div>
<div class="col-md-4">
<h3><i class="icon-rocket"></i> <a href="#feature_operators" class="smsc">Advanced Analytics</a></h3>
<p>Iterative, arbitrarily large programs with multiple inputs and outputs.</p>
</div>
</div>
<div class="row feature-highlights">
<div class="col-md-4">
<h3><i class="icon-cloud"></i> <a href="#feature_stack">Run in the Cloud</a></h3>
<p>Instantly deploy Stratosphere on Amazon's EC2 and run your data analysis in the cloud.</p>
</div>
<div class="col-md-4">
<h3><i class="icon-rotate-left"></i> Performance </h3>
<p>Scale out to large clusters, exploit multi-core processors and in-memory processing.</p>
</div>
<div class="col-md-4">
<h3><i class="icon-puzzle-piece"></i> <a href="#feature_optimizer">Empowering Data Scientists</a></h3>
<p>Our optimizer automatically parallelizes and optimizes your programs.</p>
</div>
</div>
</div>
<hr style="border-top: 10px solid #eeeeee;margin-top:60px">
<div class="container features-detail" style="padding-top:40px;padding-bottom:40px;">
<div class="row">
<div class="col-md-12">
<h1 class="text-center"><strong>Features</strong></h1>
</div>
</div>
<div class="container features-detail">
<div class="row" id="feature_operators">
<div class="col-md-12">
<h3><strong>More Operators</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<p>Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go <a href="http://en.wikipedia.org/wiki/Out-of-core_algorithm" target="_blanc">out of core</a> under memory pressure.</p>
</div>
<div class="col-md-6">
<img src="{{ site.baseurl }}/img/Stratosphere_Operators_extended.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_Operators_extended.svg'" class="img-responsive">
</div>
</div>
<hr>
<div class="row">
<div class="col-md-12">
<h3><strong>Advanced Data Flow Graphs</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<img src="{{ site.baseurl }}/img/Stratosphere_DAG.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_DAG.svg'" class="img-responsive">
</div>
<div class="col-md-6">
<p>Stratosphere allows to model analysis programs as advanced data flow graphs. For many applications, this is a more natural fit than the constrained MapReduce interface (strictly Map followed by Reduce). Furthermore, data pipelining and in-memory data transfers increase performance by drastically reducing disk and network I/O.</p>
<a class="text-small" id="see-how-hadoop-link">See how Hadoop does complex data flows</a>
</div>
</div>
<div class="row" id="see-how-hadoop">
<div class="col-md-12 bs-callout-info bs-callout">
<h4>Complex Data Flows in Hadoop</h4>
<img src="{{ site.baseurl }}/img/Stratosphere_Dataflow_Cmp.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_Dataflow_Cmp.svg'" class="img-responsive">
<p>Executing the plan shown on the left using MapReduce leads to a composition of multiple MapReduce jobs. Intermediate results are stored in HDFS after each job. This causes <b>a lot of network and disk I/O</b>. Remember also that a just the setup of a MapReduce job itself takes some time.
This example shows that many real world applications do not fit the MapReduce model. Also, the implementation of complex data flows using MapReduce is very time-consuming.</p>
<p>Stratosphere is able to natively execute the job. Everything is processed in-memory. Only if the data does not fit into the memory anymore, it starts using the local hard disks.</p>
</div>
</div>
<hr>
<div class="row">
<div class="col-md-12">
<h3><strong>Powerful Programming Interfaces</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<p>You can write data analysis programs for Stratosphere in Java or <a href="http://scala-lang.org" target="_blanc">Scala</a>. Both APIs provide a powerful yet easy-to-use abstraction to compose data analysis programs by applying customizable transformations such as map, filter, reduce, and join on data sets. Stratosphere's high-level APIs hide the complexities of parallel programming and efficient data processing from the user. Behind the scenes, the Stratosphere optimizer compiles such programs into efficient, parallel data flows which are executed on a cluster or a local machine.
</p>
<i class="icon-arrow-right"></i> <a href="{{ site.baseurl }}/quickstart">See our Quickstart guides for Scala and Java</a>
</div>
<div class="col-md-6">
<b>Word count</b> in Stratosphere using Scala: <br><br>
{% highlight scala %}
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
{% endhighlight %}
</div>
</div>
<hr>
<div class="row">
<div class="col-md-12">
<h3><strong>Support for Iterative Algorithms</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<img src="{{ site.baseurl }}/img/Stratosphere_Iteration.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_Iteration.svg'" class="img-responsive">
</div>
<div class="col-md-6">
<p>Data Mining, Machine Learning and Graph processing algorithms often require to loop over the working data multiple times.
Stratosphere supports iterative algorithms in its core. (The runtime allows for very fast iteration times and the optimizer deals with caching loop-invariant data.) The advanced <i>incremental iterations</i> support algorithms that focus on the "hot part" of the evolving solution and may converge even faster.
</p>
<a class="text-small" id="see-how-hadoop-iterations-link">Iterative algorithms with Hadoop</a>
</div>
</div>
<div class="row" id="see-how-hadoop-iterations">
<div class="col-md-12 bs-callout-info bs-callout">
<h4>Iterations with Hadoop</h4>
<img src="{{ site.baseurl }}/img/Hadoop_Iterations.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Hadoop_Iterations.svg'" class="img-responsive">
<p>Iterative algorithms are implemented in Hadoop MapReduce using a central driver that spawns MapReduce jobs until the result has been computed. This approach has many disadvantages:
<ul>
<li>Each MapReduce Job needs at least 30 seconds for setup</li>
<li>Data is transferred only between jobs using HDFS (in-memory would be much faster)</li>
<li>Everything has to be read over and over again, even if it has already finished</li>
<li>Very time consuming to implement</li>
</ul>
</p>
<p>Stratosphere <b>natively executes iterative algorithms</b>. The result of the last operator is fed back to the input of the first operator (in-memory). It is not required to start a new job on each iteration. Stratosphere detects which parts of the data need processing for further iterations. Only those are loaded.</p>
</div>
</div>
<hr>
<div class="row" id="feature_runtime">
<div class="col-md-12">
<h3><strong>High-Performance Execution Runtime</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<p>Stratosphere features its own high-performance, massively-parallel execution runtime which has been built from ground up leveraging processing techniques of parallel database systems. The engine supports low-latency processing concepts such as pipelined execution, in-memory processing, and push-based data shipping as well as sort- and hash-based processing algorithms which go gracefully out-of-core if main memory is not sufficient.
</div>
<div class="col-md-6">
<img src="{{ site.baseurl }}/img/Stratosphere_Execution.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_Execution.svg'" class="img-responsive">
</div>
</div>
<hr>
<div class="row" id="feature_optimizer">
<div class="col-md-12">
<h3><strong>Built-In Optimizer</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<p>Stratosphere comes with an optimizer that is independent of the actual programming interface. It chooses a fitting execution strategy depending on the inputs and operations. For example the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.</p>
</div>
<div class="col-md-6">
<ul style="padding-top:0;margin-top:0">
<li>Cost-based optimizer choice of operators and shipping strategies.</li>
<li>In-memory pipelining of operators</li>
<li>Reduction of shipped and written data volume</li>
<li>Global memory distribution</li>
<li>Input Sampling to determine cardinalities</li>
</ul>
<p><i class="icon-arrow-right"></i> Focus on your application logic rather than parallel execution.</p>
</div>
</div>
<hr>
<div class="row" id="feature_stack">
<div class="col-md-12">
<h3><strong>System Stack</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<img src="{{ site.baseurl }}/img/Stratosphere_Stack.svg" onerror="this.onerror=null; this.src='{{ site.baseurl }}/img/Stratosphere_Stack.svg'" class="img-responsive">
</div>
<div class="col-md-6">
<p>Stratosphere seamlessly integrates into existing Hadoop setups and runs side-by-side with Hadoop's TaskTrackers and DataNodes. Stratosphere can read data from Hadoop sources, but comes with its own efficient runtime. Similar to Hadoop, Stratosphere scales by adding more machines to the cluster. <br>
Stratosphere runs also on Hadoop 2.2 (YARN), so you do not need to change your infrastructure. <br>
The Local execution mode allows to debug and analyze your application right from your favorite IDE, without having Stratosphere installed.</p>
</p>
</div>
</div>
<hr>
<div class="row">
<div class="col-md-12">
<h3><strong>Open Source Community and Support</strong></h3>
</div>
</div>
<div class="row">
<div class="col-md-6">
<p>Stratosphere is an active, community driven open-source project. It is licensed under the Apache License. <br>
Our friendly community is always open to new users and developers. Join us and shape the future of Big Data.
</p>
</div>
<div class="col-md-3">
<!-- <p><a href="https://groups.google.com/forum/#!forum/stratosphere-users" target="_blanc" onclick="_gaq.push(['_trackEvent','Action','index_mailinglist_users',this.href]);" class="btn btn-default btn-lg" style="width: 100%;"><i class="icon-plus-sign-alt"> </i>Mailing List (Users)</a></p> -->
<p><a href="https://groups.google.com/forum/#!forum/stratosphere-dev" target="_blanc" onclick="_gaq.push(['_trackEvent','Action','index_mailinglist_dev',this.href]);" class="btn btn-default btn-lg" style="width: 100%;"><i class="icon-plus-sign-alt"> </i>Mailing List (Dev)</a></p>
</div>
<div class="col-md-3">
<a href="https://github.com/stratosphere/stratosphere/issues" target="_blanc" onclick="_gaq.push(['_trackEvent','Action','github-issue',this.href]);" class="btn btn-default btn-lg" style="width: 100%;"><i class="icon-github"> </i>Open an Issue on GitHub</a>
</div>
</div>
<hr>
<div class="row">
<div class="col-md-12">
<h3><strong>Next Steps</strong></h3>
<p><a href="{{ site.baseurl }}/downloads">Download</a> and try Stratosphere. Our Quickstart scripts make it easy for developers to create an empty Stratosphere program skeleton to start from. Dependencies are seamlessly handled by Maven without any installation. Testing and debugging is possible directly inside the IDE with Stratosphere's embedded mode.
Ready-made binaries for cluster setups are available as well.
</p>
<p>Visit <a href="http://bigdataclass.org" target="_blanc">bigdataclass.org</a> for Stratosphere programming exercises.</p>
</div>
</div>
<div class="row">
<div class="col-md-12">
<p class="text-center" style="padding-top:60px"><a href="#top"><i class="icon-collapse-top"></i> Back to top</a></p>
</div>
</div>
</div>
</div>