forked from GoogleCloudDataproc/bdutil
-
Notifications
You must be signed in to change notification settings - Fork 1
/
CHANGES.txt
438 lines (375 loc) · 23 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
1.3.3 - 2015-11-12
1. Added support for Apache Tajo via pull-request:
https://github.com/GoogleCloudPlatform/bdutil/pull/67
2. The YARN capacity scheduler can now be configured when deploying a
cluster: https://github.com/GoogleCloudPlatform/bdutil/pull/64
1.3.2 - 2015-09-15
1. Updated Spark configurations to make Cloud Bigtable work with Spark.
2. Added wrappers bigtable-spark-shell and bigtable-spark-submit to use
with bigtable plugin; only installed if bigtable_env.sh is used.
3. Updated default Hadoop 2 version to 2.7.1.
4. Added support for Apache Hama.
5. Added support for setting up a standalone NFS cache server for GCS
consistency using standalone_nfs_cache_env.sh, along with configurable
GCS_CACHE_MASTER_HOSTNAME to point subsequent clusters at the shared
NFS cache server. See standalone_nfs_cache_env.sh for usage.
6. Added explicit check for ordering of imports spark_env.sh relative
to bigtable_env.sh; Spark must come before Bigtable.
7. Fixed spelling of "amount" in some documentation.
8. Fixed directory resolution to bdutil when using symlinks.
9. Added Dockerfile for bdutil.
10. Updated default Spark version to 1.5.0; for Spark 1.5.0+, core-site.xml
will also set 'fs.gs.reported.permissions' to 733, otherwise the Hive
1.2.1 will error out when using SparkSQL. Hadoop MapReduce will print
a harmless warning in this case, but otherwise works fine. Additionally,
Spark auto-restart configuration now contains logic to use correct
syntax for start-slave.sh depending on whether it's Spark 1.4+, and
Spark auto-restarted daemons now correctly run under user 'hadoop'.
1.3.1 - 2015-07-09
1. Added plugin for deploying MapR under platforms/mapr/mapr_env.sh; see
platforms/mapr/README.md for details.
2. Changed mapreduce.fileoutputcommitter.algorithm.version to "2"; this should
only have an effect when running with Hadoop 2.7+, where it significantly
speeds up job-commit time when using the GCS connector.
See https://issues.apache.org/jira/browse/MAPEDUCE-4815 for more details.
3. Added an option ENABLE_STORM_BIGTABLE to extensions/storm/storm_env.sh to
set up using Google Cloud Bigtable from Apache Storm.
4. Updated Flink version to 0.9.0.
5. Switched from using SPARK_CLASSPATH to using SPARK_DIST_CLASSPATH pointed
at the Hadoop classpath to inherit gcs-connector and other Hadoop libraries
on the default Spark classpath. This gets rid of a warning message about
SPARK_CLASSPATH deprecation when running Spark, and improves access to
related Hadoop libraries from Spark jobs.
6. Fixed reboot recovery for single-node clusters; this includes the ability
for single-node clusters to recover from issuing "Stop" and then "Start"
commands via the GCE API.
7. Added explicit value for mapreduce.job.working.dir in Ambari config; this
works around a bug in PigInputFormat where an exception is thrown with
"Wrong FS scheme" when the default filesystem doesn't have the same scheme
as the filesystem of the input file(s) (e.g. when reading GCS files and
the default FS is HDFS). Pig reading from GCS should now work in ambari
bdutil deployments.
8. Fixed a bug where Hive deployed under ambari_env.sh is unable to
LOAD DATA INPATH 'gs://<...>' due to Hive server needing to be restarted
after GCS connector installation to pick it up on its classpath.
1.3.0 - 2015-05-27
1. Upgraded default Hadoop 2 version to 2.6.0.
2. Added support for making a portion of worker VMs run as "preemptible vms"
by setting --preemptible or PREEMPTIBLE_FRACTION to a value between
0.0 and 1.0 to specify the fraction of workers to run as preemptible.
3. Added support for deploying onto ubuntu-12-04 or ubuntu-14-04 images.
4. Added support for specifying boot disk sizes via --master_boot_disk_size_gb
and --worker_boot_disk_size_gb or MASTER_BOOT_DISK_SIZE_GB and
WORKER_BOOT_DISK_SIZE_GB; uses default derived from base image if unset.
5. Upgraded default Spark version to 1.3.1.
6. Removed datastore-connector installation options and samples; the connector
has been deprecated since February 17th, 2015. For alternatives see:
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/D3_OZuqn4_o
7. Added workaround for a bug where ambari_env.sh and ambari_manual_env.sh
would fail to copy mapreduce.tar.gz, pig.tar.gz, etc., into
hdfs:///hdp/apps/... during setup. Ambari should now work out-of-the-box.
1.2.1 - 2015-05-05
1. Install Java JDK with Spark; this allows spark-sql to correctly run
out-of-the-box.
2. New --master_machine_type/-M flag for setting a different master
machine type vs worker machine type.
3. Updated default Spark version to 1.3.0; SparkSQL scripts may need
modifications to use the new DataFrames; see Spark's migration guide:
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#migration-guide
4. Fixed CentOS 7 support.
5. Added basic support for using local SSDs via --master_local_ssd_count
and --worker_local_ssd_count.
6. Removed default zone, trying to get default zone from gcloud instead or
otherwise requiring an explicit zone setting.
7. Fixed JobHistory permissions on HDFS.
8. Added new extension: extensions/bigtable/bigtable_env.sh for deploying
a cluster with the HBase-compatible connector for Google Cloud Bigtable
installed.
1.2.0 - 2015-02-26
1. Fixed reboots on CentOS 7.
2. Fixed Ambari-plugin support of reusing persistent disks across deployments.
3. Added support for Apache Flink.
4. Made all UPLOAD_FILES be relative to the bdutil directory.
5. Added basic HBase, CDH, and Storm plugins for bdutil.
./bdutil -e hbase
./bdutil -e cdh
./bdutil -e storm
6. Only symlink the GCS connector for Client nodes.
7. Fixed memory allocation for Spark executors running on YARN; Created
extension extensions/spark/spark_on_yarn_env.sh to support Spark
on YARN without Spark daemons. the combination of spark_env.sh and
hadoop2_env.sh will allow the user to submit Spark Jobs to either the
Spark Master or YARN.
8. Enabled restart spark processes on reboot.
9. Added support for the GCS connector with ambari_manual_env.sh. See
the "Can I deploy HDP manually using Ambari" section in
platforms/hdp/README.md
10. Added an experimental env file to enable cluster resizing.
See extensions/google/experimental/resize_env.sh.
11. Updated default Spark version to 1.2.1.
12. Updated Spark driver memory settings to scale with VM size.
13. Added import_env support to generate_config. For example:
"./bdutil -e spark -b my-bucket -p my-project generate_config my_env.sh"
makes my_env.sh contain "import_env /path/to/spark_env.sh".
14. Use default mount options to avoid SELinux issues on reboot.
1.1.0 - 2015-01-22
1. Added plugin for deploying Ambari/HDP with:
./bdutil -e platforms/hdp/ambari_env.sh deploy
2. Set dfs.replication to 2 under conf/hadoop*/hdfs-template.xml; this suits
PD deployments better than r=3, but if deploying with HDFS residing on
non-PD storage, the value should be reverted to 3.
3. Enabled Spark EventLog for Spark deployments, logging to
gs://${CONFIGBUCKET}/spark-eventlog-base/${MASTER_HOSTNAME}
4. Migrated off of misc deprecated fields in favor of using
spark-defaults.conf for Spark 1.0+; cleans up warnings on spark-submit.
5. Moved SPARK_LOG_DIR from default of ${SPARK_HOME}/logs into
/hadoop/spark/logs so that they reside on the large PD if it exists.
6. Upgraded default Spark version to 1.2.0.
7. Added bdutil_env option INSTALL_JDK_DEVEL to optionally install full JDK
with compiler/tools instead of just the minimal JRE; set to 'true' in
single_node_env.sh and ambari_env.sh.
8. Added python script to allocate memory more intelligently in Hadoop 2.
9. Upgraded default Hadoop 2 version to 2.5.2.
1.0.1 - 2014-12-16
1. Replaced usage of deprecated gcutil with gcloud compute.
2. Changed GCE_SERVICE_ACCCOUNT_SCOPES from a comma separated list to a bash
array.
3. Fixed cleanup of pig-validate-setup.sh, hive-validate-setup.sh and
spark-validate-setup.sh.
4. Upgraded default Spark version to 1.1.1.
5. The default zone for instances is now us-central1-a.
0.36.4 - 2014-10-17
1. Added bdutil flags --worker_attached_pds_size_gb and
--master_attached_pd_size_gb corresponding to the bdutil_env variables of
the same names.
2. Added bdutil_env.sh variables and corresponding flags:
--worker_attached_pds_type and --master_attached_pd_type to specify
the type or PD to create, 'pd-standard' or 'pd-ssd'. Default: pd-standard.
3. Fixed a bug where we forgot to actually add
extensions/querytools/setup_profiles.sh to the COMMAND_GROUPS under
extensions/querytools/querytools_env.sh; now it's actually possible to
run 'pig' or 'hive' directly with querytools_env.sh installed.
4. Fixed a bug affecting Hadoop 1.2.1 HDFS persistence across deployments
where dfs.data.dir directories inadvertently had their permissions
modified to 775 from the correct 755, and thus caused datanodes to fail to
recover the data. Only applies in the use case of setting:
CREATE_ATTACHED_PDS_ON_DEPLOY=false
DELETE_ATTACHED_PDS_ON_DELETE=false
after an initial deployment to persist HDFS across a delete/deploy command.
The explicit directory configuration is now set in bdutil_env.sh with
the variable HDFS_DATA_DIRS_PERM, which is in turn wired into
hdfs-site.xml.
5. Added mounted disks to /etc/fstab to re-mount them on boot.
6. bdutil now uses a search path mechanism to look for env files to reduce
the amount of typing necessary to specify env files. For each argument
to the -e (or --env_var_files) command line option, if the argument
specifies just a base filename without a directory, bdutil will use
the first file of that name that it finds in the following directories:
1. The current working directory (.).
2. Directories specified as a colon-separated list of directories in
the environment variable BDUTIL_EXTENSIONS_PATH.
3. The bdutil directory (where the bdutil script is located).
4. Each of the extensions directories within the bdutil directory.
If the base filename is not found, it will try appending "_env.sh" to
the filename and look again in the same set of directories.
This change allows the following:
1. You can specify standard extensions succinctly, such as
"-e spark" for the spark extension, or "-e hadoop2" to use Hadoop 2.
2. You can put the bdutil directory in your PATH and run bdutil
from anywhere, and it will still find all its own files.
3. You can run bdutil from a directory containing your custom env
files and use filename completion to add them to a bdutil command.
4. You can collect your custom env files into one directory, set
BDUTIL_EXTENSIONS_PATH to point to that directory, run bdutil
from anywhere, and specify your custom env files by name only.
7. Added new boolean setting to bdutil_env.sh, ENABLE_NFS_GCS_FILE_CACHE,
which defaults to 'true'. When true, the GCS connector will be configured
to use its new "FILESYSTEM_BACKED" DirectoryListCache for immediate
cluster-wide list consistency, allowing multi-stage pipelines in e.g. Pig
and Hive to safely operate with DEFAULT_FS=gs. With this setting, bdutil
will install and configure an NFS export point on the master node, to
be mounted as the shared metadata cache directory for all cluster nodes.
8. Fixed a bug where the datastore-to-bigquery sample neglected to set a
'filter' in its query based on its ancestor entities.
9. YARN local directories are now set to spread IO across all directories
under /mnt.
10. YARN container logs will be written to /hadoop/logs/.
11. The Hadoop 2 MR Job History Server will now be started on the master node.
12. Added /etc/init.d entries for Hadoop daemons to restart them after
VM restarts.
13. Moved "hadoop fs -test" of gcs-connector to end of Hadoop setup, after
starting Hadoop daemons.
14. The spark_env.sh extension will now install numpy.
0.35.2 - 2014-09-18
1. When installing Hadoop 1 and 2, snappy will now be installed and symbolic
links will be created from the /usr/lib or /usr/lib64 tree to the Hadoop
native library directory.
2. When installing Hadoop 2, bdutil will attempt to download and install
precompiled native libraries for the installed version of Hadoop.
3. Modified default hadoop-validate-setup.sh to use 10MB of random data
instead of the old 1MB, otherwise it doesn't work for larger clusters.
4. Added a health check script in Hadoop 1 to check if Jetty failed to load
for the TaskTracker as in [MAPREDUCE-4668].
5. Added ServerAliveInterval and ServerAliveCountMax SSH options to SSH
invocations to detect dropped connections.
6. Pig and Hive installation (extensions/querytools/querytools_env.sh) now
sets DEFAULT_FS='hdfs'; reading from GCS using explicit gs:// URIs will
still work normally, but intermediate data for multi-stage pipelines will
now reside on HDFS. This is because Pig and Hive more commonly rely on
immediate "list consistency" across clients, and thus are more susceptible
to GCS "eventual list consistency" semantics even if the majority case
works fine.
7. Changed occurrences of 'hdpuser' to 'hadoop' in querytools_env.sh, such
that Pig and Hive will be installed under /home/hadoop instead of
/home/hdpuser, and the files will be owned by 'hadoop' instead of
'hdpuser'; this is more consistent with how other extensions have been
handled.
8. Modified extensions/querytools/querytools_env.sh to additionally insert
the Pig and Hive 'bin' directories into the PATH environment variable
for all users, such that SSH'ing into the master provides immediate
access to launching 'pig' or 'hive' without requiring
"sudo sudo -i -u hdpuser"; removed 'chmod 600 hive-site.xml' so that any
user can successfully run 'hive' directly.
9. Added extensions/querytools/{hive, pig}-validate-setup.sh which can be
used as a quick test of Pig/Hive functionality:
./bdutil shell < extensions/querytools/pig-validate-setup.sh
10. Updated extensions/spark/spark_env.sh to now use spark-1.1.0 by default.
11. Added new BigQuery connector sample under bdutil-0.35.2/samples as file
streaming_word_count.sh which demonstrates using the new support for
the older "hadoop.mapred.*" interfaces via hadoop-streaming.jar.
0.35.1 - 2014-08-07
1. Added a boolean bdutil option DEBUG_MODE with corresponding flags
-D/--debug which turns on high-verbosity modes for gcutil and gsutil
calls during the deployment, including on the remote VMs.
2. Added the ability for the Google connectors, bdconfig, and Hadoop
distributions to be stored and fetched from gs:// style URLs in addition
to http:// URLs.
3. In VERBOSE_MODE, on failure the detailed debuginfo.txt is now also printed
to the console in addition to being available in the /tmp directory.
4. Moved all configuration code in conf/.
5. Changed the default PREFIX to 'hadoop' instead of 'hs-ghfs', and the
naming convention for masters/workers to follow $PREFIX-m and $PREFIX-w-$i
instead of $PREFIX-nn and $PREFIX-dn-$i. IMPORTANT: This breaks
compatibility with existing clusters deployed with bdutil 0.34.x and older,
but there is a new flag "--old_hostname_suffixes" to continue using the old
-nn/-dn naming convention. For example, to turn
down an old cluster if you've been using the default prefix:
./bdutil --prefix=hs-ghfs --old_hostname_suffixes delete
6. Fixed a bug in VM environments where run_command could not find commands
such as 'hadoop' in their PATH.
7. Update BigQuery / Datastore sample scripts to be used with
"./bdutil run_command." rather than locally.
8. Added a test to guarantee VMs had no more than 64 characters in their fully
qualified domain names.
9. Added the import_env helper to allow "_env.sh" files to depend on each
other.
10. Renamed spark1_env.sh to spark_env.sh.
11. Added a gsutil update check upon first entering a VM.
0.34.4 - 2014-06-23
1. Switched default gcs-connector version to 1.2.7 for patch fixing a bug
where globs wrongly reported "not found" in some cases in Hadoop 2.2.0.
0.34.3 - 2014-06-13
1. Jobtracker / Resource manager recovery has been enabled by default to
preserve job queues if the daemon dies.
2. Fixed single_node_env.sh to work with hadoop2_env.sh
3. Two new commands were added to bdutil: socksproxy and shell; socksproxy
will establish a SOCKS proxy to the cluster and shell will start an SSH
session to the namenode.
4. A new variable, GCE_NETWORK, was added to bdutil_env.sh and can be set
from the command line via the --network flag when deploying a cluster or
generating a configuration file. The network specified by GCE_NETWORK
must exist and must allow SSH connections from the host running bdutil
and must allow intra-cluster communication.
5. Increased configured heap sizes of the master daemons (JobTracker,
NameNode, SecondaryNameNode, and ResourceManager).
6. The HADOOP_LOG_DIR is now /hadoop/logs instead of the default
/home/hadoop/hadoop-install/logs; if using attached PDs for larger disk
storage, this directory resides on that attached PD rather than the
boot volume, so that Hadoop logs will no longer fill up the boot disk.
7. Added new extensions under bdutil-<version>/extensions/spark; includes
spark_shark_env.sh and spark1_env.sh, both compatible for mixing with
Hadoop2 as well. For now, doesn't use Mesos or YARN in either case,
but suitable for single-user or Spark-only setups. The spark_shark_env.sh
extension installs Spark + Shark 0.9.1, while spark1_env.sh only installs
Spark 1.0.0, in which case Spark SQL serves as the alternative to Shark.
8. Cleaned updating of login scripts and $PATHs. Added safety check around
sourcing of hadoop-config.sh, because it can kill shells by calling exit in
Hadoop 2.
0.34.2 - 2014-06-05
1. When using Hadoop 2 / YARN, and the default filesystem is set to 'gs', YARN
log aggregation will be enabled and YARN application logs, including
map-reduce task logs will be persisted to gs://<CONFIGBUCKET>/yarn-logs/.
0.34.1 - 2014-05-12
1. Fixed a bug in the USE_ATTACHED_PDS feature (also enabled with
-d/--use_attached_pds) where disks didn't get attached properly.
0.34.0 - 2014-05-08
1. Changed sample applications and tools to use GenericOptionsParser instead
of creating a new Configuration object directly.
2. Added printout of bdutil version number alongside "usage" message.
3. Added sleeps between async invocations of GCE API calls during deployment,
configurable with: GCUTIL_SLEEP_TIME_BETWEEN_ASYNC_CALLS_SECONDS
4. Added tee'ing of client-side console output into debuginfo.txt with better
delineation of where the error is likely to have occurred.
5. Just for extensions/querytools/querytools_env.sh, added an explicit
mapred.working.dir to fix a bug where PigInputFormat crashes whenever the
default FileSystem is different from the input FileSystem. This fix allows
using GCS input paths in Pig with DEFAULT_FS='hdfs'.
6. Added a retry-loop around "apt-get -y -qq update" since it may flake under
high load.
7. Significantly refactored bdutil into better-isolated helper functions, and
added basic support for command-line flags and several new commnds. The old
command "./bdutil env1.sh env2.sh" is now "./bdutil -e env1.sh,env2.sh".
Type ./bdutil --help for an overview of all the new functionality.
8. Added better checking of env and upload files before starting deployment.
9. Reorganized bdutil_env.sh into logical sections with better descriptions.
10. Significantly reduced amount of console output; printed dots indicate
progress of async subprocesses. Controllable with VERBOSE_MODE or '-v'.
11. Script and file dependencies are now staged through GCS rather than using
gcutil push; drastically decreases bandwidth and improves scalability.
12. Added MAX_CONCURRENT_ASYNC_PROCESSES to splitting the async loops into
multiple smaller batches, to avoid OOMing.
13. Made delete_cluster continue on error, still reporting a warning at the
end if errors were encountered. This way, previously-failed cluster
creations or deletions with partial resources still present can be
cleaned up by retrying the "delete" command.
0.33.1 - 2014-04-09
1. Added deployment scripts for the BigQuery and Datastore connectors.
2. Added sample jarfiles for the BigQuery and Datastore connectors under
a new /samples/ subdirectory along with scripts for running the samples.
3. Set the default image type to backports-debian-7 for improved networking.
0.33.0 - 2014-03-21
1. Renamed 'ghadoop' to 'bdutil' and ghadoop_env.sh to bdutil_env.sh.
2. Bundled a collection of *-site.xml.template files in conf/ subdirectory
which are integrated into the hadoop conf/ files in the remote scripts.
3. Switched core-site template to new 'fs.gs.auth.*' syntax for
enabling service-account auth.
0.32.0 - 2014-02-12
1. ghadoop now always includes ghadoop_env.sh; only the overrides file needs
to be specified, e.g. ghadoop deploy single_node_env.sh.
2. Files in COMMAND_GROUPS are now relative to the directory in which ghadoop
resides, rather than having to be inside libexec/. Absolute paths are
also supported now.
3. Added UPLOAD_FILES to ghadoop_env.sh which ghadoop will use to upload
a list of relative or absolute file paths to every VM before starting
execution of COMMAND_STEPS.
4. Include full Hive and Pig sampleapp from Cloud Solutions with ghadoop;
added extensions/querytools/querytools_env.sh to auto-install Hive and
Pig as part of deployment. Usage:
./ghadoop deploy extensions/querytools/querytools_env.sh
0.31.1 - 2014-01-23
1. Added CHANGES.txt for release notes.
2. Switched from /hadoop/temp to /hadoop/tmp.
3. Added support for running ghadoop as root; will display additional
confirmation prompt before starting.
4. run_gcutil_cmd() now displays the full copy/paste-able gcutil command.
5. Now, only the public key of the master-generated ssh keypair is copied
into GCS during setup and onto the datanodes. This fixes the occasional
failed deployment due to GCS list-consistency, and is cleaner anyways.
The ssh keypair is now more descriptively named: 'hadoop_master_id_rsa'.
6. Added check for sshability from master to workers in start_hadoop.sh.
7. Cleaned up internal gcutil commands, added printout of full command
to ssh into the namenode at the end of the deployment.
8. Removed indirect config references from *-site.xml.
9. Stopped explicitly setting mapred.*.dir.
0.31.0 - 2014-01-14
1. Preview release of ghadoop.