forked from pivotal-cf/docs-pcf-install
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtroubleshooting.html.md.erb
258 lines (176 loc) · 11.7 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
---
title: Troubleshooting Problems in Pivotal Platform
owner: Ops Manager
---
<% current_page.data.title = "Troubleshooting Problems in " + vars.product_name %>
This guide provides help with resolving issues encountered
during a [<%= vars.product_name %>](https://network.pivotal.io/products/pivotal-cf) installation.
## <a id='retry'></a> Retrying the Deployment
Although an install or update can fail for many reasons, the system is self-healing, and can often automatically correct or work around hardware or network faults.
Click **Install** or **Review Pending Changes**, then **Apply Changes** again, and the system may resolve a problem on its own.
Some failures only produce generic errors like `Exited with 1`.
In cases like this, where a failure is not accompanied by useful information,
click **Install** or **Review Pending Changes**, then **Apply Changes** to retry.
When the system does provide informative evidence, review the [Common Problems](#common_probs) section at the end of this guide to see if your
problem is covered there.
## <a id='common_probs'></a>Common Issues ##
Compare evidence that you have gathered to the descriptions below.
If your issue is covered, try the recommended remediation procedures.
### <a id='reinstall_bosh'></a>BOSH Does Not Reinstall ###
You might want to reinstall BOSH for troubleshooting purposes.
However, if <%= vars.product_name %> does not detect any changes, BOSH does not reinstall.
To force a reinstall of BOSH, select **BOSH Director > Resource Sizes**
and change a resource value.
For example, you could increase the amount of RAM by 4 MB.
### <a id='time_out'></a>Creating Bound Missing VMs Times Out ###
This task happens immediately following package compilation, but before job
assignment to agents.
For example:
<pre class="terminal">
cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)
</pre>
This is most likely a NATS issue with the VM in question.
To identify a NATS issue, inspect the agent log for the VM.
Since the BOSH director is unable to reach the BOSH agent, you must access the
VM using another method.
You will likely also be unable to access the VM using TCP.
In this case, access the VM using your virtualization console.
To diagnose:
1. Access the VM using your virtualization console and log in.
1. Navigate to the **Credentials** tab of the Pivotal Application Service (PAS) tile and
locate the VM in question to find the **VM credentials**.
1. Become root.
1. Run `cd /var/vcap/bosh/log`.
1. Open the file `current`.
1. First, determine whether the BOSH agent and director have successfully
completed a handshake, represented in the logs as a "ping-pong":
<pre class="terminal">
2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[],
"reply_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}
2013-10-03_14:35:48.60182 #[608] INFO: reply_to: director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab:
payload: {:value=>"pong"}
</pre>
This handshake must complete for the agent to receive instructions from the
director.
1. If you do not see the handshake, look for another line near the beginning of
the file, prefixed `INFO: loaded new infrastructure settings`.
For example:
<pre class="terminal">
2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
{"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
"agent_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
"networks"=>{"default"=>{"ip"=>"192.0.2.19",
"netmask"=>"255.255.255.0", "cloud_properties"=>{"name"=>"VMNetwork"},
"default"=>["dns", "gateway"],
"dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2",
"dns_record_name"=>"0.nats.default.cf-d729343071061.microbosh",
"mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
"persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
"options"=>{"endpoint"=>"http://192.0.2.17:25250",
"user"=>"agent", "password"=>"agent"}},
"mbus"=>"nats://nats:[email protected]:4222",
"env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
</pre>
This is a JSON blob of key/value pairs representing the expected infrastructure
for the BOSH agent.
For this issue, the following section is the most important:
`"mbus"=>"nats://nats:[email protected]:4222"`
This key/value pair represents where the agent expects the NATS server to be.
One diagnostic tactic is to try pinging this NATS IP address from the VM to
determine whether you are experiencing routing issues.
### <a id='RSA-cert'></a>Install Exits With a Creates/Updates/Deletes App Failure or With a 403 Error###
**Scenario 1:**
Your <%= vars.product_name %> install exits with the following 403 error when you attempt to
log in to the Apps Manager:
<pre>
{"type": "step_finished", "id": "apps-manager.deploy"}
/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)
</pre>
**Scenario 2:**
Your <%= vars.product_name %> install exits with a `creates/updates/deletes an app (FAILED - 1)` error message with the following stack trace:
<pre>
1) App CRUD creates/updates/deletes an app
Failure/Error: Unable to find matching line from backtrace
CFoundry::TargetRefused:
Connection refused - connect(2)
</pre>
In either of the above scenarios, ensure that you have correctly entered your
domains in wildcard format:
1. Browse to the Operations Manager fully qualified domain name (FQDN).
1. Click the PAS tile.
1. Select **HAProxy** and click **Generate Self-Signed RSA Certificate**.
1. Enter your system and app domains in wildcard format, as well as optionally
any custom domains, and click **Save**.
Refer to PAS **Cloud Controller** for explanations of these
domain values.
<%= image_tag('rsa_cert.png') %>
### <a id='runtime_depend'></a>Install Fails When Gateway Instances Exceed Zero ###
If you configure the number of Gateway instances to be greater than zero for a
given product, you create a dependency on PAS for that product
installation.
If you attempt to install a product tile with an PAS dependency
before installing PAS, the install fails.
To change the number of Gateway instances, click the product
tile, then select **Settings > Resource sizes > INSTANCES** and change the
value next to the product Gateway job.
To remove the PAS dependency, change the value of this field to `0`.
### <a id='out_of_disk_space'></a>Out of Disk Space Error ###
<%= vars.product_name %> displays an `Out of Disk Space` error if log files expand to fill
all available disk space.
If this happens, rebooting the <%= vars.product_name %> installation VM clears the tmp
directory of these log files and resolves the error.
If users receive `Out of Disk Space` errors when trying to push apps, this can mean that
Diego cells may be running out of disk space capacity.
To perform a detailed analysis of disk usage by containers and host VMs in your PAS deployment,
see [Examining GrootFS Disk Usage](../adminguide/examining_grootfs_disk.html).
### <a id='vsphere_fails'></a>Installing BOSH Director Fails ###
If the DNS information for the <%= vars.product_name %> VM is incorrectly specified when
deploying the <%= vars.product_name %> .ova file, installing BOSH Director fails at the "Installing Micro BOSH" step.
To resolve this issue, correct the DNS settings in the <%= vars.product_name %> Virtual
Machine properties.
### <a id='delete_om_fails'></a>Deleting Ops Manager Fails ###
Ops Manager displays an error message when it cannot delete your installation. This scenario might happen if the BOSH Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:
1. Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Ops Manager VM.
1. SSH into your Ops Manager VM and remove the `installation.yml` file from `/var/tempest/workspaces/default/`.
<p class="note"><strong>Note</strong>: Deleting the <code>installation.yml</code> file does not prevent you from reinstalling Ops Manager. For future deploys, Ops Manager regenerates this file when you click <strong>Save</strong> on any page in the BOSH Director.
Your installation is now deleted.
### <a id='elastic_runtime_fails'></a>Installing PAS Fails ###
The following sections describe errors that may occur when installing PAS and how to resolve them.
#### Ops Manager Cannot Verify App Push
If the DNS information for the <%= vars.product_name %> VM becomes incorrect after BOSH Director has been installed, installing PAS with Pivotal Operations
Manager fails at the "Verifying app push" step.
To resolve this issue, correct the DNS settings in the <%= vars.product_name %> Virtual
Machine properties.
#### MySQL Monitor Not Running After Update
When MySQL cannot communicate with UAA, Ops Manager shows the following error:
> **Error: 'mysql_monitor/12a3b456-cd7e-8fgh-9012-345678b90ijk (0)' is not running after update. Review logs for failed jobs: replication-canary.**
If you see this error, create firewall rules that allow MySQL to reach UAA, using the [MySQL Network Communications](../security/networking/mysql-network-paths.html) topic as a reference.
### <a id='ip_address_taken'></a>Ops Manager Hangs During MicroBOSH Install or HAProxy States "IP Address Already Taken" ###
During an Ops Manager installation, you might receive the following errors:
* The Ops Manager GUI shows that the installation stops at the "Setting MicroBOSH deployment manifest" task.
* When you set the IP address for the HAProxy, the "IP Address Already Taken" message appears.
When you install Ops Manager, you assign it an IP address. Ops Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:
```
203.0.113.1 - Ops Manager (User assigned)
203.0.113.2 - MicroBOSH (Ops Manager assigned)
203.0.113.3 - Reserved (Ops Manager reserved)
```
To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.
### <a id='pcf_slow'></a>Poor <%= vars.product_name %> Performance ###
If you notice poor network performance by your <%= vars.product_name %> deployment and your deployment uses a Network Address Translation (NAT) gateway, your NAT gateway may be under-resourced.
####<a id='troubleshoot-slow'></a>Troubleshoot
To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic originating from your private network directly to an S3-compatible object store. If you see decreased average latency and improved network performance, perform the solution below to scale up your NAT gateway.
####<a id='solution-slow'></a>Scale Up Your NAT Gateway
Perform the following steps to scale up your NAT gateway:
1. Navigate to your IaaS console.
1. Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
1. Change the routes to direct traffic through the new NAT gateway.
1. Spin down the old NAT gateway.
The specific procedures will vary depending on your IaaS. Consult your IaaS documentation for more information.
## <a id='common_probs_firewalls'></a>Common Issues Caused by Firewalls ##
This section describes various issues you might encounter when installing
PAS in an environment that uses a strong firewall.
### <a id='DNS_res_fails'></a>DNS Resolution Fails ###
When you install <%= vars.product_name %> in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, refer to the [Troubleshooting DNS Resolution Issues](./config_firewall.html#DNS_fails) section of the Preparing Your Firewall for Deploying <%= vars.product_name %> topic.