forked from google/cluster-data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
bibliography.bib
1877 lines (1774 loc) · 99 KB
/
bibliography.bib
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
################################################################
# Introduction
################################################################
This bibliography is a resource for people writing papers that refer
to the Google cluster traces. It covers papers that analyze the
traces, as well as ones that use them as inputs to other studies.
* I recommend using \usepackage{url}.
* Entries are in publication-date order, with the most recent at the top.
* Bibtex ignores stuff that is outside the entries, so text like this is safe.
The following are the RECOMMENDED CITATIONS if you just need the basics:
* Borg:
* \cite{clusterdata:Verma2015, clusterdata:Tirmazi2020} for Borg itself
* 2019 traces:
* \cite{clusterdata:Wilkes2020, clusterdata:Wilkes2020a, clusterdata:Tirmazi2020} for
the complete set of info about trace itself.
* \cite{clusterdata:Wilkes2020} for the 2019 trace announcement
* \cite{clusterdata:Wilkes2020a} for the details about the 2019 trace contents
* \cite{clusterdata:Tirmazi2020} for the EuroSys paper about the 2019 and 2011 traces
* 2011 trace:
* \cite{clusterdata:Wilkes2011, clusterdata:Reiss2011} for the trace itself
* \cite{clusterdata:Reiss2012b} for the first thorough analysis of it.
If you use the traces, please send a bibtex entry that looks *exactly* like one
of these to [email protected], so your paper can be added - and cited! A
Github pull request is the best format.
################################################################
# Trace-announcements
################################################################
These entries can be used to cite the traces themselves.
# The May 2019 traces.
# Use clusterdata:Tirmazi2020 for the first paper to analyze them.
# This is the formal announcement of the trace:
@Misc{clusterdata:Wilkes2020,
author = {John Wilkes},
title = {Yet more {Google} compute cluster trace data},
howpublished = {Google research blog},
month = Apr,
year = 2020,
address = {Mountain View, CA, USA},
note = {Posted at \url{https://ai.googleblog.com/2020/04/yet-more-google-compute-cluster-trace.html}.},
}
# If you want to cite details about the trace itself:
@TechReport{clusterdata:Wilkes2020a,
author = {John Wilkes},
title = {{Google} cluster-usage traces v3},
institution = {Google Inc.},
year = 2020,
month = Apr,
type = {Technical Report},
address = {Mountain View, CA, USA},
note = {Posted at \url{https://github.com/google/cluster-data/blob/master/ClusterData2019.md}},
abstract = {
This document describes the semantics, data format, and
schema of usage traces of a few Google compute cells.
This document describes version 3 of the trace format.},
}
#----------------
The next couple are for the May 2011 "full" trace.
@Misc{clusterdata:Wilkes2011,
author = {John Wilkes},
title = {More {Google} cluster data},
howpublished = {Google research blog},
month = Nov,
year = 2011,
address = {Mountain View, CA, USA},
note = {Posted at \url{http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html}.},
}
@TechReport{clusterdata:Reiss2011,
author = {Charles Reiss and John Wilkes and Joseph L. Hellerstein},
title = {{Google} cluster-usage traces: format + schema},
institution = {Google Inc.},
year = 2011,
month = Nov,
type = {Technical Report},
address = {Mountain View, CA, USA},
note = {Revised 2014-11-17 for version 2.1. Posted at
\url{https://github.com/google/cluster-data}},
}
#----------------
# The next one is for the earlier "small" 7-hour trace.
# (Most people should not be using this.)
@Misc{clusterdata:Hellersetein2010,
author = {Joseph L. Hellerstein},
title = {{Google} cluster data},
howpublished = {Google research blog},
month = Jan,
year = 2010,
note = {Posted at \url{http://googleresearch.blogspot.com/2010/01/google-cluster-data.html}.},
}
#----------------
The canonical Borg paper.
@inproceedings{clusterdata:Verma2015,
title = {Large-scale cluster management at {Google} with {Borg}},
author = {Abhishek Verma and Luis Pedrosa and Madhukar R. Korupolu and David Oppenheimer and Eric Tune and John Wilkes},
year = {2015},
booktitle = {Proceedings of the European Conference on Computer Systems (EuroSys'15)},
address = {Bordeaux, France},
articleno = {18},
numpages = {17},
abstract = {
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs,
from many thousands of different applications, across a number of clusters each with
up to tens of thousands of machines.
It achieves high utilization by combining admission control, efficient task-packing,
over-commitment, and machine sharing with process-level performance isolation.
It supports high-availability applications with runtime features that minimize
fault-recovery time, and scheduling policies that reduce the probability of correlated
failures. Borg simplifies life for its users by offering a declarative job specification
language, name service integration, real-time job monitoring, and tools to analyze and
simulate system behavior.
We present a summary of the Borg system architecture and features, important design
decisions, a quantitative analysis of some of its policy decisions, and a qualitative
examination of lessons learned from a decade of operational experience with it.},
url = {https://dl.acm.org/doi/10.1145/2741948.2741964},
doi = {10.1145/2741948.2741964},
}
#----------------
The next paper describes the policy choices and technologies used to
make the traces safe to release.
@InProceedings{clusterdata:Reiss2012,
author = {Charles Reiss and John Wilkes and Joseph L. Hellerstein},
title = {Obfuscatory obscanturism: making workload traces of
commercially-sensitive systems safe to release},
year = 2012,
booktitle = {3rd International Workshop on Cloud Management (CLOUDMAN)},
month = Apr,
publisher = {IEEE},
pages = {1279--1286},
address = {Maui, HI, USA},
abstract = {Cloud providers such as Google are interested in fostering
research on the daunting technical challenges they face in
supporting planetary-scale distributed systems, but no
academic organizations have similar scale systems on which to
experiment. Fortunately, good research can still be done using
traces of real-life production workloads, but there are risks
in releasing such data, including inadvertently disclosing
confidential or proprietary information, as happened with the
Netflix Prize data. This paper discusses these risks, and our
approach to them, which we call systematic obfuscation. It
protects proprietary and personal data while leaving it
possible to answer interesting research questions. We explain
and motivate some of the risks and concerns and propose how
they can best be mitigated, using as an example our recent
publication of a month-long trace of a production system
workload on a 11k-machine cluster.},
url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6212064},
}
################################################################
# Trace-analysis papers
################################################################
These papers are primarily about analyzing the traces.
Order: most recent first.
If you just want one citation about the Cluster2011 trace, then
use \cite{clusterdata:Reiss2012b}.
################ 2022
@inproceedings {clusterdata:jajooSLearn2022,
author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
title = {A Case for Task Sampling based Learning for Cluster Job Scheduling},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
address = {Renton, WA, USA},
url = {https://www.usenix.org/conference/nsdi22/presentation/jajoo},
publisher = {USENIX Association},
keywords = {data centers, big data, job scheduling, learning, online learning},
abstract = {The ability to accurately estimate job runtime properties allows a
scheduler to effectively schedule jobs. State-of-the-art online cluster job
schedulers use history-based learning, which uses past job execution information
to estimate the runtime properties of newly arrived jobs. However, with fast-paced
development in cluster technology (in both hardware and software) and changing user
inputs, job runtime properties can change over time, which lead to inaccurate predictions.
In this paper, we explore the potential and limitation of real-time learning of job
runtime properties, by proactively sampling and scheduling a small fraction of the
tasks of each job. Such a task-sampling-based approach exploits the similarity among
runtime properties of the tasks of the same job and is inherently immune to changing
job behavior. Our study focuses on two key questions in comparing task-sampling-based
learning (learning in space) and history-based learning (learning in time): (1) Can
learning in space be more accurate than learning in time? (2) If so, can delaying
scheduling the remaining tasks of a job till the completion of sampled tasks be more
than compensated by the improved accuracy and result in improved job performance? Our
analytical and experimental analysis of 3 production traces with different skew and job
distribution shows that learning in space can be substantially more accurate. Our
simulation and testbed evaluation on Azure of the two learning approaches anchored in a
generic job scheduler using 3 production cluster job traces shows that despite its online
overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
and 1.32x compared to the prior-art history-based predictor.},
}
################ 2021
@article{clusterdata:jajooSLearnTechReport2021,
author = {Akshay Jajoo and Y. Charlie Hu and Xiaojun Lin and Nan Deng},
title = {The Case for Task Sampling based Learning for Cluster Job Scheduling},
journal = {Computing Research Repository},
volume = {abs/2108.10464},
year = {2021},
url = {https://arxiv.org/abs/2108.10464},
eprinttype = {arXiv},
eprint = {2108.10464},
timestamp = {Fri, 27 Aug 2021 15:02:29 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2108-10464.bib},
bibsource = {dblp computer science bibliography, https://dblp.org},
keywords = {data centers, big data, job scheduling, learning, online learning},
abstract = {The ability to accurately estimate job runtime properties allows a
scheduler to effectively schedule jobs. State-of-the-art online cluster job
schedulers use history-based learning, which uses past job execution information
to estimate the runtime properties of newly arrived jobs. However, with fast-paced
development in cluster technology (in both hardware and software) and changing user
inputs, job runtime properties can change over time, which lead to inaccurate predictions.
In this paper, we explore the potential and limitation of real-time learning of job
runtime properties, by proactively sampling and scheduling a small fraction of the
tasks of each job. Such a task-sampling-based approach exploits the similarity among
runtime properties of the tasks of the same job and is inherently immune to changing
job behavior. Our study focuses on two key questions in comparing task-sampling-based
learning (learning in space) and history-based learning (learning in time): (1) Can
learning in space be more accurate than learning in time? (2) If so, can delaying
scheduling the remaining tasks of a job till the completion of sampled tasks be more
than compensated by the improved accuracy and result in improved job performance? Our
analytical and experimental analysis of 3 production traces with different skew and job
distribution shows that learning in space can be substantially more accurate. Our
simulation and testbed evaluation on Azure of the two learning approaches anchored in a
generic job scheduler using 3 production cluster job traces shows that despite its online
overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x,
and 1.32x compared to the prior-art history-based predictor.},
}
################ 2020
@inproceedings{clusterdata:Tirmazi2020,
author = {Tirmazi, Muhammad and Barker, Adam and Deng, Nan and Haque, Md E. and Qin, Zhijing Gene and Hand, Steven and Harchol-Balter, Mor and Wilkes, John},
title = {{Borg: the Next Generation}},
year = {2020},
isbn = {9781450368827},
publisher = {ACM},
address = {Heraklion, Greece},
url = {https://doi.org/10.1145/3342195.3387517},
doi = {10.1145/3342195.3387517},
booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys'20)},
articleno = {30},
numpages = {14},
keywords = {data centers, cloud computing},
abstract = {
This paper analyzes a newly-published trace that covers 8
different Borg clusters for the month of May 2019. The
trace enables researchers to explore how scheduling works in
large-scale production compute clusters. We highlight how
Borg has evolved and perform a longitudinal comparison of
the newly-published 2019 trace against the 2011 trace, which
has been highly cited within the research community.
Our findings show that Borg features such as alloc sets
are used for resource-heavy workloads; automatic vertical
scaling is effective; job-dependencies account for much of
the high failure rates reported by prior studies; the workload
arrival rate has increased, as has the use of resource
over-commitment; the workload mix has changed, jobs have
migrated from the free tier into the best-effort batch tier;
the workload exhibits an extremely heavy-tailed distribution
where the top 1\% of jobs consume over 99\% of resources; and
there is a great deal of variation between different clusters.},
}
################ 2018
@article{clusterdata:Sebastio2018,
title = {Characterizing machines lifecycle in Google data centers},
journal = {Performance Evaluation},
volume = 126,
pages = {39 -- 63},
year = 2018,
issn = {0166-5316},
doi = {https://doi.org/10.1016/j.peva.2018.08.001},
url = {http://www.sciencedirect.com/science/article/pii/S016653161830004X},
author = {Stefano Sebastio and Kishor S. Trivedi and Javier Alonso},
keywords = {Statistical analysis, Distributed architectures, Cloud computing, System reliability, Large-scale systems, Empirical studies},
abstract = {Due to the increasing need for computational power, the market has
shifted towards big centralized data centers. Understanding the nature
of the dynamics of these data centers from machine and job/task
perspective is critical to design efficient data center management
policies like optimal resource/power utilization, capacity planning and
optimal (reactive and proactive) maintenance scheduling. Whereas
jobs/tasks dynamics have received a lot of attention, the study of the
dynamics of the underlying machines supporting the jobs/tasks execution
has received much less attention, even when these dynamics would
substantially affect the performance of the jobs/tasks execution. Given
the limited data available from large computing installations, only a
few previous studies have inspected data centers and only concerning
failures and their root causes. In this paper, we study the 2011 Google
data center traces from the machine dynamics perspective. First, we
characterize the machine events and their underlying distributions in
order to have a better understanding of the entire machine lifecycle.
Second, we propose a data-driven model to enable the estimate of the
expected number of available machines at any instant of time. The model
is parameterized and validated using the empirical data collected by
Google during a one month period.}
}
################ 2017
@Inbook{clusterdata:Ray2017,
author = {Ray, Biplob R., Chowdhury, Morshed and Atif, Usman},
editor = {Doss, Robin, Piramuthu, Selwyn and Zhou, Wei},
title = {Is {High Performance Computing (HPC)} Ready to Handle Big Data?},
bookTitle = {Future Network Systems and Security},
year = 2017,
month = Aug,
publisher = {Springer},
address = {Cham, Switzerland},
pages = {97--112},
abstract={In recent years big data has emerged as a universal term and its
management has become a crucial research topic. The phrase `big data'
refers to data sets so large and complex that the processing of them
requires collaborative High Performance Computing (HPC). How to
effectively allocate resources is one of the prime challenges in
HPC. This leads us to the question: are the existing HPC resource
allocation techniques effective enough to support future big data
challenges? In this context, we have investigated the effectiveness of
HPC resource allocation using the Google cluster dataset and a number of
data mining tools to determine the correlational coefficient between
resource allocation, resource usages and priority. Our analysis
initially focused on correlation between resource allocation and
resource uses. The finding shows that a high volume of resources that
are allocated by the system for a job are not being used by that same
job. To investigate further, we analyzed the correlation between
resource allocation, resource usages and priority. Our clustering,
classification and prediction techniques identified that the allocation
and uses of resources are very loosely correlated with priority of the
jobs. This research shows that our current HPC scheduling needs
improvement in order to accommodate the big data challenge
efficiently.},
keywords = {Big data; HPC; Data mining; QoS; Correlation },
isbn = {978-3-319-65548-2},
doi = {10.1007/978-3-319-65548-2_8},
url = {https://doi.org/10.1007/978-3-319-65548-2_8},
}
@INPROCEEDINGS{clusterdata:Elsayed2017,
author = {Nosayba El-Sayed and Hongyu Zhu and Bianca Schroeder},
title = {Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations},
booktitle={International Conference on Distributed Computing Systems (ICDCS)},
year=2017,
month=Jun,
pages={1333--1344},
abstract={In large-scale computing platforms, jobs are prone to interruptions
and premature terminations, limiting their usability and leading to
significant waste in cluster resources. In this paper, we tackle this
problem in three steps. First, we provide a comprehensive study based on
log data from multiple large-scale production systems to identify
patterns in the behaviour of unsuccessful jobs across different clusters
and investigate possible root causes behind job termination. Our results
reveal several interesting properties that distinguish unsuccessful jobs
from others, particularly w.r.t. resource consumption patterns and job
configuration settings. Secondly, we design a machine learning-based
framework for predicting job and task terminations. We show that job
failures can be predicted relatively early with high precision and
recall, and also identify attributes that have strong predictive power
of job failure. Finally, we demonstrate in a concrete use case how our
prediction framework can be used to mitigate the effect of unsuccessful
execution using an effective task-cloning policy that we propose.},
keywords={learning (artificial intelligence);parallel
processing;resource allocation;software fault tolerance; job
configuration settings;job failures prediction;job
terminations mitigation;job terminations prediction;
large-scale computing platforms;machine learning-based
framework;resource consumption patterns; task-cloning
policy;trace-driven approach;Computer crashes;Electric
breakdown;Google;Large-scale systems; Linear systems;Parallel
processing;Program processors;Failure Mitigation;Failure
Prediction;Job Failure; Large-Scale Systems;Reliability;Trace
Analysis}, doi={10.1109/ICDCS.2017.317}, issn={1063-6927}, }
################ 2014
@INPROCEEDINGS{clusterdata:Abdul-Rahman2014,
author = {Abdul-Rahman, Omar Arif and Aida, Kento},
title = {Towards understanding the usage behavior of {Google} cloud
users: the mice and elephants phenomenon},
booktitle = {IEEE International Conference on Cloud Computing
Technology and Science (CloudCom)},
year = 2014,
month = dec,
address = {Singapore},
pages = {272--277},
keywords = {Google trace; Workload trace analysis; User session view;
Application composition; Mass-Count disparity; Exploratory statistical
analysis; Visual analysis; Color-schemed graphs; Coarse grain
classification; Heavy-tailed distributions; Long-tailed lognormal
distributions; Exponential distribution; Normal distribution; Discrete
modes; Large web services; Batch processing; MapReduce computation;
Human users; },
abstract = {In the era of cloud computing, users encounter the challenging
task of effectively composing and running their applications on the
cloud. In an attempt to understand user behavior in constructing
applications and interacting with typical cloud infrastructures, we
analyzed a large utilization dataset of Google cluster. In the present
paper, we consider user behavior in composing applications from the
perspective of topology, maximum requested computational resources, and
workload type. We model user dynamic behavior around the user's session
view. Mass-Count disparity metrics are used to investigate the
characteristics of underlying statistical models and to characterize
users into distinct groups according to their composition and behavioral
classes and patterns. The present study reveals interesting insight into
the heterogeneous structure of the Google cloud workload.},
doi = {10.1109/CloudCom.2014.75},
}
################ 2013
@inproceedings{clusterdata:Di2013,
title = {Characterizing cloud applications on a {Google} data center},
author = {Di, Sheng and Kondo, Derrick and Franck, Cappello},
booktitle = {42nd International Conference on Parallel Processing (ICPP)},
year = 2013,
month = Oct,
address = {Lyon, France},
abstract = {In this paper, we characterize Google applications,
based on a one-month Google trace with over 650k jobs running
across over 12000 heterogeneous hosts from a Google data
center. On one hand, we carefully compute the valuable
statistics about task events and resource utilization for
Google applications, based on various types of resources (such
as CPU, memory) and execution types (e.g., whether they can
run batch tasks or not). Resource utilization per application
is observed with an extremely typical Pareto principle. On the
other hand, we classify applications via a K-means clustering
algorithm with optimized number of sets, based on task events
and resource usage. The number of applications in the Kmeans
clustering sets follows a Pareto-similar distribution. We
believe our work is very interesting and valuable for the
further investigation of Cloud environment.},
}
################ 2012
@INPROCEEDINGS{clusterdata:Reiss2012b,
title = {Heterogeneity and dynamicity of clouds at scale: {Google}
trace analysis},
author = {Charles Reiss and Alexey Tumanov and Gregory R. Ganger and
Randy H. Katz and Michael A. Kozuch},
booktitle = {ACM Symposium on Cloud Computing (SoCC)},
year = 2012,
month = Oct,
address = {San Jose, CA, USA},
abstract = {To better understand the challenges in developing effective
cloud-based resource schedulers, we analyze the first publicly available
trace data from a sizable multi-purpose cluster. The most notable
workload characteristic is heterogeneity: in resource types (e.g.,
cores:RAM per machine) and their usage (e.g., duration and resources
needed). Such heterogeneity reduces the effectiveness of traditional
slot- and core-based scheduling. Furthermore, some tasks are
constrained as to the kind of machine types they can use, increasing the
complexity of resource assignment and complicating task migration. The
workload is also highly dynamic, varying over time and most workload
features, and is driven by many short jobs that demand quick scheduling
decisions. While few simplifying assumptions apply, we find that many
longer-running jobs have relatively stable resource utilizations, which
can help adaptive resource schedulers.},
url = {http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-socc2012.pdf},
privatenote = {An earlier version of this was posted at
\url{http://www.istc-cc.cmu.edu/publications/papers/2012/ISTC-CC-TR-12-101.pdf},
and included here as clusterdata:Reiss2012a. Please use this
version instead of that.},
}
@INPROCEEDINGS{clusterdata:Liu2012,
author = {Zitao Liu and Sangyeun Cho},
title = {Characterizing machines and workloads on a {Google} cluster},
booktitle = {8th International Workshop on Scheduling and Resource
Management for Parallel and Distributed Systems (SRMPDS)},
year = 2012,
month = Sep,
address = {Pittsburgh, PA, USA},
abstract = {Cloud computing offers high scalability, flexibility and
cost-effectiveness to meet emerging computing
requirements. Understanding the characteristics of real workloads on a
large production cloud cluster benefits not only cloud service providers
but also researchers and daily users. This paper studies a large-scale
Google cluster usage trace dataset and characterizes how the machines in
the cluster are managed and the workloads submitted during a 29-day
period behave. We focus on the frequency and pattern of machine
maintenance events, job- and task-level workload behavior, and how the
overall cluster resources are utilized.},
url = {http://www.cs.pitt.edu/cast/abstract/liu-srmpds12.html},
}
@INPROCEEDINGS{clusterdata:Di2012a,
author = {Sheng Di and Derrick Kondo and Walfredo Cirne},
title = {Characterization and comparison of cloud versus {Grid} workloads},
booktitle = {International Conference on Cluster Computing (IEEE CLUSTER)},
year = 2012,
month = Sep,
pages = {230--238},
address = {Beijing, China},
abstract = {A new era of Cloud Computing has emerged, but the characteristics
of Cloud load in data centers is not perfectly clear. Yet this
characterization is critical for the design of novel Cloud job and
resource management systems. In this paper, we comprehensively
characterize the job/task load and host load in a real-world production
data center at Google Inc. We use a detailed trace of over 25 million
tasks across over 12,500 hosts. We study the differences between a
Google data center and other Grid/HPC systems, from the perspective of
both work load (w.r.t. jobs and tasks) and host load
(w.r.t. machines). In particular, we study the job length, job
submission frequency, and the resource utilization of jobs in the
different systems, and also investigate valuable statistics of machine's
maximum load, queue state and relative usage levels, with different job
priorities and resource attributes. We find that the Google data center
exhibits finer resource allocation with respect to CPU and memory than
that of Grid/HPC systems. Google jobs are always submitted with much
higher frequency and they are much shorter than Grid jobs. As such,
Google host load exhibits higher variance and noise.},
keywords = {cloud computing;computer centres;grid computing;queueing
theory;resource allocation;search engines;CPU;Google data
center;cloud computing;cloud job;cloud load;data centers;grid
workloads;grid-HPC systems;host load;job length;job submission
frequency;jobs resource utilization;machine maximum load;queue
state;real-world production data center;relative usage
levels;resource allocation;resource attributes;resource
management systems;task load;Capacity
planning;Google;Joints;Load modeling;Measurement;Memory
management;Resource management;Cloud Computing;Grid
Computing;Load Characterization},
doi = {10.1109/CLUSTER.2012.35},
privatenote = {An earlier version is available at
\url{http://hal.archives-ouvertes.fr/hal-00705858}. It used
to be included here as clusterdata:Di2012.},
}
################ 2010
@Article{clusterdata:Mishra2010,
author = {Mishra, Asit K. and Hellerstein, Joseph L. and Cirne,
Walfredo and Das, Chita R.},
title = {Towards characterizing cloud backend workloads: insights
from {Google} compute clusters},
journal = {SIGMETRICS Perform. Eval. Rev.},
volume = {37},
number = {4},
month = Mar,
year = 2010,
issn = {0163-5999},
pages = {34--41},
numpages = {8},
url = {http://doi.acm.org/10.1145/1773394.1773400},
doi = {10.1145/1773394.1773400},
publisher = {ACM},
abstract = {The advent of cloud computing promises highly available,
efficient, and flexible computing services for applications such as web
search, email, voice over IP, and web search alerts. Our experience at
Google is that realizing the promises of cloud computing requires an
extremely scalable backend consisting of many large compute clusters
that are shared by application tasks with diverse service level
requirements for throughput, latency, and jitter. These considerations
impact (a) capacity planning to determine which machine resources must
grow and by how much and (b) task scheduling to achieve high machine
utilization and to meet service level objectives.
Both capacity planning and task scheduling require a good understanding
of task resource consumption (e.g., CPU and memory usage). This in turn
demands simple and accurate approaches to workload
classification-determining how to form groups of tasks (workloads) with
similar resource demands. One approach to workload classification is to
make each task its own workload. However, this approach scales poorly
since tens of thousands of tasks execute daily on Google compute
clusters. Another approach to workload classification is to view all
tasks as belonging to a single workload. Unfortunately, applying such a
coarse-grain workload classification to the diversity of tasks running
on Google compute clusters results in large variances in predicted
resource consumptions.
This paper describes an approach to workload classification and its
application to the Google Cloud Backend, arguably the largest cloud
backend on the planet. Our methodology for workload classification
consists of: (1) identifying the workload dimensions; (2) constructing
task classes using an off-the-shelf algorithm such as k-means; (3)
determining the break points for qualitative coordinates within the
workload dimensions; and (4) merging adjacent task classes to reduce the
number of workloads. We use the foregoing, especially the notion of
qualitative coordinates, to glean several insights about the Google
Cloud Backend: (a) the duration of task executions is bimodal in that
tasks either have a short duration or a long duration; (b) most tasks
have short durations; and (c) most resources are consumed by a few tasks
with long duration that have large demands for CPU and memory.},
}
################################################################
# Trace-usage papers
################################################################
These entries are for papers that primarily focus on some other topic, but
use the traces as inputs, e.g., in simulations or load predictions.
Order: most recent first.
################ 2020
@INPROCEEDINGS{clusterdata:Lin2020,
title = {Using {GANs} for Sharing Networked Time Series Data: Challenges,
Initial Promise, and Open Questions},
author = {Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti,
Giulia and Sekar, Vyas},
year = {2020},
isbn = {9781450381383},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3419394.3423643},
doi = {10.1145/3419394.3423643},
abstract = {Limited data access is a longstanding barrier to data-driven
research and development in the networked systems community. In this work,
we explore if and how generative adversarial networks (GANs) can be used to
incentivize data sharing by enabling a generic framework for sharing
synthetic datasets with minimal expert knowledge. As a specific target, our
focus in this paper is on time series datasets with metadata (e.g., packet
loss rate measurements with corresponding ISPs). We identify key challenges
of existing GAN approaches for such workloads with respect to fidelity
(e.g., long-term dependencies, complex multidimensional relationships, mode
collapse) and privacy (i.e., existing guarantees are poorly understood and
can sacrifice fidelity). To improve fidelity, we design a custom workflow
called DoppelGANger (DG) and demonstrate that across diverse real-world
datasets (e.g., bandwidth measurements, cluster requests, web sessions) and
use cases (e.g., structural characterization, predictive modeling, algorithm
comparison), DG achieves up to 43% better fidelity than baseline models.
Although we do not resolve the privacy problem in this work, we identify
fundamental challenges with both classical notions of privacy and recent
advances to improve the privacy properties of GANs, and suggest a potential
roadmap for addressing these challenges. By shedding light on the promise
and challenges, we hope our work can rekindle the conversation on workflows
for data sharing.},
booktitle = {Proceedings of the ACM Internet Measurement Conference (IMC
2020)},
pages = {464--483},
numpages = {20},
keywords = {privacy, synthetic data generation, time series,
generative adversarial networks},
}
@article{clusterdata:Aydin2020,
title = {Multi-objective temporal bin packing problem: an application in cloud computing},
journal = {Computers \& Operations Research},
volume = 121,
pages = {1049--59},
year = 2020,
month = Sep,
issn = {0305-0548},
doi = {https://doi.org/10.1016/j.cor.2020.104959},
url = {http://www.sciencedirect.com/science/article/pii/S0305054820300769},
author = {Nurşen Aydin and Ibrahim Muter and Ş. Ilker Birbil},
keywords = {Bin packing, Cloud computing, Heuristics, Exact methods, Column generation},
abstract = {Improving energy efficiency and lowering operational
costs are the main challenges faced in systems with multiple
servers. One prevalent objective in such systems is to
minimize the number of servers required to process a given set
of tasks under server capacity constraints. This objective
leads to the well-known bin packing problem. In this study, we
consider a generalization of this problem with a time
dimension, where the tasks are to be performed with predefined
start and end times. This new dimension brings about new
performance considerations, one of which is the uninterrupted
utilization of servers. This study is motivated by the problem
of energy efficient assignment of virtual machines to physical
servers in a cloud computing service. We address the virtual
machine placement problem and present a binary integer
programming model to develop different assignment policies. By
analyzing the structural properties of the problem, we propose
an efficient heuristic method based on solving smaller
versions of the original problem iteratively. Moreover, we
design a column generation algorithm that yields a lower bound
on the objective value, which can be utilized to evaluate the
performance of the heuristic algorithm. Our numerical study
indicates that the proposed heuristic is capable of solving
large-scale instances in a short time with small optimality
gaps.},
}
@article{clusterdata:Milocco2020,
title = {Evaluating the Upper Bound of Energy Cost Saving by Proactive Data Center Management},
journal = {IEEE Transactions on Network and Service Management},
year = 2020,
issn = {1932-4537},
doi = {10.1109/TNSM.2020.2988346},
url = {https://ieeexplore.ieee.org/abstract/document/9069318},
author = {Ruben Milocco and Pascale Minet and Éric Renault and Selma Boumerdassi},
keywords = {Data center management, Proactive management, Machine Learning, Prediction, Energy cost},
abstract = {
Data Centers (DCs) need to periodically configure their servers in order to meet user demands.
Since appropriate proactive management to meet demands reduces the cost, either by improving Quality of
Service (QoS) or saving energy, there is a great interest in studying different proactive strategies
based on predictions of the energy used to serve CPU and memory requests. The amount of savings that can
be achieved depends not only on the selected proactive strategy but also on user-demand statistics and the
predictors used. Despite its importance, it is difficult to find theoretical studies that quantify the
savings that can be made, due to the problem complexity. A proactive DC management strategy is presented
together with its upper bound of energy cost savings obtained with respect to a purely reactive management.
Using this method together with records of the recent past, it is possible to quantify the efficiency of
different predictors. Both linear and nonlinear predictors are studied, using a Google data set collected
over 29 days, to evaluate the benefits that can be obtained with these two predictors.},
}
################ 2018
@article{clusterdata:Sliwko2018,
author = {Sliwko, Leszek},
title = {A Scalable Service Allocation Negotiation For Cloud Computing},
journal = {Journal of Theoretical and Applied Information Technology},
volume = 96,
number = 20,
month = Oct,
year = 2018,
issn = {1817-3195},
pages = {6751--6782},
numpages = {32},
keywords = {distributed scheduling, agents; load balancing, MASB},
abstract={This paper presents a detailed design of a decentralised agent-based
scheduler, which can be used to manage workloads within the computing cells
of a Cloud system. This scheme in based on the concept of service allocation
negotiation, whereby all system nodes communicate between themselves and
scheduling logic is decentralised. The architecture presented has been
implemented, with multiple simulations run using realword workload traces from
the Google Cluster Data project. The results were then compared to the
scheduling patterns of Google’s Borg system.}
}
@INPROCEEDINGS{clusterdata:Liu2018gh,
author = {Liu, Jinwei and Shen, Haiying and Sarker, Ankur and Chung, Wingyan},
title = {Leveraging Dependency in Scheduling and Preemption for High Throughput in Data-Parallel Clusters},
booktitle = {2018 IEEE International Conference on Cluster Computing (CLUSTER)},
year = {2018},
month = Sep,
pages = {359--369},
publisher = {IEEE},
abstract = {Task scheduling and preemption are two important functions in
data-parallel clusters. Though directed acyclic graph task dependencies
are common in data-parallel clusters, previous task scheduling and
preemption methods do not fully utilize such task dependency to increase
throughput since they simply schedule precedent tasks prior to their
dependent tasks or neglect the dependency. We notice that in both
scheduling and preemption, choosing a task with more dependent tasks to
run allows more tasks to be runnable next, which facilitates to select a
task that can more increase throughput. Accordingly, in this paper, we
propose a Dependency-aware Scheduling and Preemption system (DSP) to
achieve high throughput. First, we build an integer linear programming
model to minimize the makespan (i.e., the time when all jobs finish
execution) with the consideration of task dependency and deadline, and
derive the target server and start time for each task, which can
minimize the makespan. Second, we utilize task dependency to determine
tasks' priorities for preemption. Finally, we propose a method to reduce
the number of unnecessary preemptions that cause more overhead than the
throughput gain. Extensive experimental results based on a real cluster
and Amazon EC2 cloud service show that DSP achieves much higher
throughput compared to existing strategies.},
doi = {10.1109/CLUSTER.2018.00054},
}
@inproceedings{clusterdata:Minet2018j,
author = {Pascale Minet and Éric Renault and Ines Khoufi and Selma Boumerdassi},
title = {Analyzing Traces from a {Google} Data Center},
booktitle = {14th International Wireless Communications and Mobile Computing Conference (IWCMC 2018)},
year = 2018,
month = Jun,
publisher = {IEEE},
address = {Limassol, Cyprus},
pages = {1167--1172},
url = {https://doi.org/10.1109/IWCMC.2018.8450304},
doi = {10.1109/IWCMC.2018.8450304},
abstract = {
Traces collected from an operational Google data center over 29 days represent a very rich
and useful source of information for understanding the main features of a data center. In this
paper, we characterize the strong heterogeneity of jobs and the medium heterogeneity of machine
configurations. We analyze the off-periods of machines. We study the distribution of jobs per
category, per scheduling class, per priority and per number of tasks. The distribution of job
execution durations shows a high disparity, as does the job waiting time before being scheduled.
The resource requests in terms of CPU and memory are also analyzed. The distribution of these
parameter values is very useful to develop accurate models and algorithms for resource allocation
in data centers.},
keywords = {Data analysis, data center, big data application, resource allocation, scheduling},
}
@inproceedings{clusterdata:Minet2018m,
author = {Pascale Minet and Éric Renault and Ines Khoufi and Selma Boumerdassi},
title = {Data Analysis of a {Google} Data Center},
booktitle = {18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2018)},
year = 2018,
month = May,
publisher = {IEEE},
address = {Washington DC, USA},
pages = {342--343},
url = {https://doi.org/10.1109/CCGRID.2018.00049},
doi = {10.1109/CCGRID.2018.00049},
abstract = {
Data collected from an operational Google data center during 29 days represent a very rich
and very useful source of information for understanding the main features of a data center.
In this paper, we highlight the strong heterogeneity of jobs. The distribution of job execution
duration shows a high disparity, as well as the job waiting time before being scheduled. The
resource requests in terms of CPU and memory are also analyzed. The knowledge of all these features
is needed to design models of jobs, machines and resource requests that are representative of a
real data center.},
}
@ARTICLE{clusterdata:Sebastio2018b,
author = {Stefano Sebastio and Rahul Ghosh and Tridib Mukherjee},
journal = {IEEE Transactions on Services Computing},
title = {An availability analysis approach for deployment configurations of containers},
year = {2018},
month = Jan,
abstract = {Operating system (OS) containers enabling the microservice-oriented
architecture are becoming popular in the context of Cloud
services. Containers provide the ability to create lightweight and
portable runtime environments decoupling the application requirements
from the characteristics of the underlying system. Services built on
containers have a small resource footprint in terms of processing,
storage, memory and network, allowing a denser deployment
environment. While the performance of such containers is addressed in
few previous studies, understanding the failure-repair behavior of the
containers remains unexplored. In this paper, from an availability point
of view, we propose and compare different configuration models for
deploying a containerized software system. Inspired by Google
Kubernetes, a container management system, these configurations are
characterized with a failure response and migration service. We develop
novel non-state-space and state-space analytic models for container
availability analysis. Analytical as well as simulative solutions are
obtained for the developed models. Our analysis provides insights on k
out-of N availability and sensitivity of system availability for key
system parameters. Finally, we build an open-source software tool
powered by these models. The tool helps Cloud administrators to assess
the availability of containerized systems and to conduct a what-if
analysis based on user-provided parameters and configurations.},
keywords = {Containers;Analytical models;Cloud computing;Stochastic
processes;Tools;Computer architecture;Google;container;system
availability;virtual machine;cloud computing;analytic model;stochastic
reward net},
doi = {10.1109/TSC.2017.2788442},
ISSN = {1939-1374},
}
@article{clusterdata:Sebastio2018c,
author = {Sebastio, Stefano and Amoretti, Michele and Lafuente, Alberto Lluch and Scala, Antonio},
title = {A Holistic Approach for Collaborative Workload Execution in Volunteer Clouds},
journal = {ACM Transactions on Modeling and Computer Simulation (TOMACS)},
volume = 28,
number = 2,
month = Mar,
year = 2018,
issn = {1049-3301},
pages = {14:1--14:27},
articleno = {14},
numpages = {27},
url = {http://doi.acm.org/10.1145/3155336},
doi = {10.1145/3155336},
acmid = {3155336},
publisher = {ACM},
keywords = {Collective adaptive systems, ant colony optimization (ACO),
autonomic computing, cloud computing, collaborative computing,
computational fields, multiagent optimization, peer-to-peer (P2P), task
scheduling},
abstract={The demand for provisioning, using, and maintaining distributed
computational resources is growing hand in hand with the quest for
ubiquitous services. Centralized infrastructures such as cloud computing
systems provide suitable solutions for many applications, but their
scalability could be limited in some scenarios, such as in the case of
latency-dependent applications. The volunteer cloud paradigm aims at
overcoming this limitation by encouraging clients to offer their own
spare, perhaps unused, computational resources. Volunteer clouds are
thus complex, large-scale, dynamic systems that demand for self-adaptive
capabilities to offer effective services, as well as modeling and
analysis techniques to predict their behavior. In this article, we
propose a novel holistic approach for volunteer clouds supporting
collaborative task execution services able to improve the quality of
service of compute-intensive workloads. We instantiate our approach by
extending a recently proposed ant colony optimization algorithm for
distributed task execution with a workload-based partitioning of the
overlay network of the volunteer cloud. Finally, we evaluate our
approach using simulation-based statistical analysis techniques on a
workload benchmark provided by Google. Our results show that the
proposed approach outperforms some traditional distributed task
scheduling algorithms in the presence of compute-intensive workloads.}
}
@Article{clusterdata:Sebastio2018d,
author = {Stefano Sebastio and Giorgio Gnecco},
title = {A green policy to schedule tasks in a distributed cloud},
journal = {Optimization Letters},
year = 2018,
month = Oct,
day = 01,
volume = 12,
number = 7,
pages = {1535--1551},
abstract = {In the last years, demand and availability of computational
capabilities experienced radical changes. Desktops and laptops increased
their processing resources, exceeding users' demand for large part of
the day. On the other hand, computational methods are more and more
frequently adopted by scientific communities, which often experience
difficulties in obtaining access to the required
resources. Consequently, data centers for outsourcing use, relying on
the cloud computing paradigm, are proliferating. Notwithstanding the
effort to build energy-efficient data centers, their energy footprint is
still considerable, since cooling a large number of machines situated in
the same room or container requires a significant amount of power. The
volunteer cloud, exploiting the users' willingness to share a quote of
their underused machine resources, can constitute an effective solution
to have the required computational resources when needed. In this paper,
we foster the adoption of the volunteer cloud computing as a green
(i.e., energy efficient) solution even able to outperform existing data
centers in specific tasks. To manage the complexity of such a large
scale heterogeneous system, we propose a distributed optimization policy
to task scheduling with the aim of reducing the overall energy
consumption executing a given workload. To this end, we consider an
integer programming problem relying on the Alternating Direction Method
of Multipliers (ADMM) for its solution. Our approach is compared with a
centralized one and other non-green targeting solutions. Results show
that the distributed solution found by the ADMM constitutes a good
suboptimal solution, worth to be applied in a real environment.},
issn = {1862-4480},
doi = {10.1007/s11590-017-1208-8},
url = {https://doi.org/10.1007/s11590-017-1208-8}
}
################ 2017
@Article{clusterdata:Carvalho2017b,
author = {Marcus Carvalho and Daniel A. Menasc\'{e} and Francisco Brasileiro},
title = {Capacity planning for {IaaS} cloud providers offering multiple
service classes},
journal = {Future Generation Computer Systems},
volume = {77},
pages = {97--111},
month = Dec,
year = 2017,
abstract = {Infrastructure as a Service (IaaS) cloud providers typically offer
multiple service classes to satisfy users with different requirements
and budgets. Cloud providers are faced with the challenge of estimating
the minimum resource capacity required to meet Service Level Objectives
(SLOs) defined for all service classes. This paper proposes a capacity
planning method that is combined with an admission control mechanism to
address this challenge. The capacity planning method uses analytical
models to estimate the output of a quota-based admission control
mechanism and find the minimum capacity required to meet availability
SLOs and admission rate targets for all classes. An evaluation using
trace-driven simulations shows that our method estimates the best cloud
capacity with a mean relative error of 2.5\% with respect to the
simulation, compared to a 36\% relative error achieved by a single-class
baseline method that does not consider admission control
mechanisms. Moreover, our method exhibited a high SLO fulfillment for
both availability and admission rates, and obtained mean CPU utilization
over 91\%, while the single-class baseline method had values not greater
than 78\%.},
url = {http://www.sciencedirect.com/science/article/pii/S0167739X16308561},
doi = {10.1016/j.future.2017.07.019},
issn = {0167-739X},
}
@inproceedings{clusterdata:Janus2017,
author = {Pawel Janus and Krzysztof Rzadca},
title = {{SLO}-aware Colocation of Data Center Tasks Based on Instantaneous Processor Requirements},
booktitle = {ACM Symposium on Cloud Computing (SoCC)},
year = 2017,
month = Sep,
pages = {256--268},
address = {Santa Clara, CA, USA},
publisher = {ACM},
abstract = {In a cloud data center, a single physical machine simultaneously
executes dozens of highly heterogeneous tasks. Such colocation results
in more efficient utilization of machines, but, when tasks' requirements
exceed available resources, some of the tasks might be throttled down or
preempted. We analyze version 2.1 of the Google cluster trace that
shows short-term (1 second) task CPU usage. Contrary to the assumptions
taken by many theoretical studies, we demonstrate that the empirical
distributions do not follow any single distribution. However, high
percentiles of the total processor usage (summed over at least 10 tasks)
can be reasonably estimated by the Gaussian distribution. We use this
result for a probabilistic fit test, called the Gaussian Percentile
Approximation (GPA), for standard bin-packing algorithms. To check
whether a new task will fit into a machine, GPA checks whether the
resulting distribution's percentile corresponding to the requested
service level objective, SLO is still below the machine's capacity. In
our simulation experiments, GPA resulted in colocations exceeding the