Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AC/FC integration #2995

Merged
merged 32 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
671edf2
Add agent control health checks
kaylareopelle Dec 12, 2024
a510eb2
Indent HEREDOC contents
kaylareopelle Jan 7, 2025
e5ad084
Set frequency default to 5 on HealthCheck init
kaylareopelle Jan 7, 2025
f333e68
Write to a single file per agent instance
kaylareopelle Jan 7, 2025
aa4409d
Update array destructure for update_message
kaylareopelle Jan 7, 2025
95808e8
Rubocop
kaylareopelle Jan 7, 2025
0d55f7d
Update test expectation
kaylareopelle Jan 7, 2025
20cdad8
Freeze and dup HealthCheck constants
kaylareopelle Jan 7, 2025
9538c38
Update lib/new_relic/agent/configuration/default_source.rb
kaylareopelle Jan 8, 2025
9d01044
Start health check loop before license key checked
kaylareopelle Jan 9, 2025
aab917f
Merge branch 'sa-health-check' of github.com:newrelic/newrelic-ruby-a…
kaylareopelle Jan 9, 2025
5499054
Add tests for update_status
kaylareopelle Jan 14, 2025
8f257f2
Remove remaining `::` before NewRelic::Agent
kaylareopelle Jan 14, 2025
3b45b9d
Enable health checks in monitoring? test
kaylareopelle Jan 14, 2025
eb1e8fb
get => set
kaylareopelle Jan 14, 2025
267a58f
Undo :: change
kaylareopelle Jan 14, 2025
ac1bc31
Add debugging for CI failure
kaylareopelle Jan 14, 2025
53d1b1d
rubocop
kaylareopelle Jan 14, 2025
21e65c9
Add check for else
kaylareopelle Jan 14, 2025
29067ea
Remove puts, use correct config name in test
kaylareopelle Jan 14, 2025
9e943df
Merge branch 'dev' into sa-health-check
kaylareopelle Jan 15, 2025
05ef5e4
Apply suggestions from code review
kaylareopelle Jan 15, 2025
b3b6766
Add healthy? to health check
kaylareopelle Jan 15, 2025
ac40640
Merge branch 'sa-health-check' of github.com:newrelic/newrelic-ruby-a…
kaylareopelle Jan 15, 2025
a7e87cb
Set @continue = false on init when checks disabled
kaylareopelle Jan 15, 2025
9dc3cd4
agent_control.fleet_id => agent_control.enabled
kaylareopelle Jan 17, 2025
5ab814b
Set default value for delivery location
kaylareopelle Jan 17, 2025
a1bcd2a
Update documentation
kaylareopelle Jan 17, 2025
d85904b
Merge branch 'dev' into sa-health-check
kaylareopelle Jan 17, 2025
0c06889
Rubocop - Minitest/EmptyLineBeforeAssertionMethods
kaylareopelle Jan 17, 2025
c38bff3
Make agent_control settings private
kaylareopelle Jan 21, 2025
87a0909
Do not create the health directory
kaylareopelle Jan 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@

The agent now supports Ruby 3.4.0. We've made incremental changes throughout the preview stage to reach compatibility. This release includes an update to the Thread Profiler for compatibility with Ruby 3.4.0's new backtrace format. [Issue#2992](https://github.com/newrelic/newrelic-ruby-agent/issues/2992) [PR#2997](https://github.com/newrelic/newrelic-ruby-agent/pull/2997)

- **Feature: Add health checks when the agent runs within Agent Control**

When the agent is started with a within an agent control environment, automatic health check files will be created within the configured file destination at the configured frequency. [PR#2995](https://github.com/newrelic/newrelic-ruby-agent/pull/2995)
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved

- **Bugfix: Stop emitting inaccurate debug-level log about deprecated configuration options**

In the previous major release, we dropped support for `disable_<library_name>` configuration options in favor of `instrumentation.<library_name>`. Previously, a DEBUG level log warning appeared whenever `disable_*` options were set to `true`, even for libraries (e.g. Action Dispatch) without equivalent `instrumentation.*` options:
Expand Down
4 changes: 4 additions & 0 deletions lib/new_relic/agent/agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
require 'new_relic/coerce'
require 'new_relic/agent/autostart'
require 'new_relic/agent/harvester'
require 'new_relic/agent/health_check'
require 'new_relic/agent/hostname'
require 'new_relic/agent/new_relic_service'
require 'new_relic/agent/pipe_service'
Expand Down Expand Up @@ -88,6 +89,7 @@ def init_basics
end

def init_components
@health_check = HealthCheck.new
@service = NewRelicService.new
@events = EventListener.new
@stats_engine = StatsEngine.new
Expand Down Expand Up @@ -139,6 +141,8 @@ def instance
# Holds all the methods defined on NewRelic::Agent::Agent
# instances
module InstanceMethods
# the agent control health check file generator
attr_reader :health_check
# the statistics engine that holds all the timeslice data
attr_reader :stats_engine
# the transaction sampler that handles recording transactions
Expand Down
27 changes: 15 additions & 12 deletions lib/new_relic/agent/agent_helpers/connect.rb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def note_connect_failure
# to tell the user what happened, since this is not an error
# we can handle gracefully.
def log_error(error)
::NewRelic::Agent.logger.error("Error establishing connection with New Relic Service at #{control.server}:",
NewRelic::Agent.logger.error("Error establishing connection with New Relic Service at #{control.server}:",
error)
end

Expand All @@ -66,13 +66,14 @@ def log_error(error)
# no longer try to connect to the server, saving the
# application and the server load
def handle_license_error(error)
::NewRelic::Agent.logger.error(error.message,
NewRelic::Agent.logger.error(error.message,
'Visit newrelic.com to obtain a valid license key, or to upgrade your account.')
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::INVALID_LICENSE_KEY)
disconnect
end

def handle_unrecoverable_agent_error(error)
::NewRelic::Agent.logger.error(error.message)
NewRelic::Agent.logger.error(error.message)
disconnect
shutdown
end
Expand All @@ -96,15 +97,15 @@ def event_harvest_config
# connects, then configures the agent using the response from
# the connect service
def connect_to_server
request_builder = ::NewRelic::Agent::Connect::RequestBuilder.new(
request_builder = NewRelic::Agent::Connect::RequestBuilder.new(
@service,
Agent.config,
event_harvest_config,
environment_for_connect
)
connect_response = @service.connect(request_builder.connect_payload)

response_handler = ::NewRelic::Agent::Connect::ResponseHandler.new(self, Agent.config)
response_handler = NewRelic::Agent::Connect::ResponseHandler.new(self, Agent.config)
response_handler.configure_agent(connect_response)

log_connection(connect_response) if connect_response
Expand All @@ -114,17 +115,17 @@ def connect_to_server
# Logs when we connect to the server, for debugging purposes
# - makes sure we know if an agent has not connected
def log_connection(config_data)
::NewRelic::Agent.logger.debug("Connected to NewRelic Service at #{@service.collector.name}")
::NewRelic::Agent.logger.debug("Agent Run = #{@service.agent_id}.")
::NewRelic::Agent.logger.debug("Connection data = #{config_data.inspect}")
NewRelic::Agent.logger.debug("Connected to NewRelic Service at #{@service.collector.name}")
NewRelic::Agent.logger.debug("Agent Run = #{@service.agent_id}.")
NewRelic::Agent.logger.debug("Connection data = #{config_data.inspect}")
if config_data['messages']&.any?
log_collector_messages(config_data['messages'])
end
end

def log_collector_messages(messages)
messages.each do |message|
::NewRelic::Agent.logger.send(message['level'].downcase, message['message'])
NewRelic::Agent.logger.send(message['level'].downcase, message['message'])
end
end

Expand Down Expand Up @@ -186,21 +187,23 @@ def connect(options = {})
opts = connect_options(options)
return unless should_connect?(opts[:force_reconnect])

::NewRelic::Agent.logger.debug("Connecting Process to New Relic: #$0")
NewRelic::Agent.logger.debug("Connecting Process to New Relic: #$0")
connect_to_server
@connected_pid = $$
@connect_state = :connected
signal_connected
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HEALTHY)
rescue NewRelic::Agent::ForceDisconnectException => e
handle_force_disconnect(e)
rescue NewRelic::Agent::LicenseException => e
handle_license_error(e)
rescue NewRelic::Agent::UnrecoverableAgentException => e
handle_unrecoverable_agent_error(e)
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
retry if retry_from_error?(e, opts)
rescue Exception => e
::NewRelic::Agent.logger.error('Exception of unexpected type during Agent#connect():', e)
NewRelic::Agent.logger.error('Exception of unexpected type during Agent#connect():', e)

raise
end
Expand All @@ -214,7 +217,7 @@ def retry_from_error?(e, opts)
return false unless opts[:keep_retrying]

note_connect_failure
::NewRelic::Agent.logger.info("Will re-attempt in #{connect_retry_period} seconds")
NewRelic::Agent.logger.info("Will re-attempt in #{connect_retry_period} seconds")
sleep(connect_retry_period)
true
end
Expand Down
3 changes: 3 additions & 0 deletions lib/new_relic/agent/agent_helpers/harvest.rb
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def send_data_to_endpoint(endpoint, payload, container)
rescue UnrecoverableServerException => e
NewRelic::Agent.logger.warn("#{endpoint} data was rejected by remote service, discarding. Error: ", e)
rescue ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(endpoint, e)
container.merge!(payload)
rescue => e
Expand All @@ -133,9 +134,11 @@ def check_for_and_handle_agent_commands
rescue ForceRestartException, ForceDisconnectException
raise
rescue UnrecoverableServerException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
NewRelic::Agent.logger.warn('get_agent_commands message was rejected by remote service, discarding. ' \
'Error: ', e)
rescue ServerConnectionException => e
NewRelic::Agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(:get_agent_commands, e)
rescue => e
NewRelic::Agent.logger.info('Error during check_for_and_handle_agent_commands, will retry later: ', e)
Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/shutdown.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def shutdown
revert_to_default_configuration

@started = nil
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::SHUTDOWN)
Control.reset
end

Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/start_worker_thread.rb
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def handle_force_restart(error)
# is the worker thread that gathers data and talks to the
# server.
def handle_force_disconnect(error)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FORCED_DISCONNECT)
::NewRelic::Agent.logger.warn('Agent received a ForceDisconnectException from the server, disconnecting. ' \
"(#{error.message})")
disconnect
Expand Down
7 changes: 7 additions & 0 deletions lib/new_relic/agent/agent_helpers/startup.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ def start
# setting up the worker thread and the exit handler to shut
# down the agent
def check_config_and_start_agent
# some health statuses, such as invalid license key, are ran before
# the agent officially starts
@health_check.create_and_run_health_check_loop
return unless monitoring? && has_correct_license_key?
return if using_forking_dispatcher?

Expand Down Expand Up @@ -129,6 +132,7 @@ def monitoring?
if Agent.config[:monitor_mode]
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::AGENT_DISABLED)
::NewRelic::Agent.logger.warn('Agent configured not to send data in this environment.')
false
end
Expand All @@ -140,6 +144,7 @@ def has_license_key?
if Agent.config[:license_key] && Agent.config[:license_key].length > 0
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_LICENSE_KEY)
::NewRelic::Agent.logger.warn('No license key found. ' +
'This often means your newrelic.yml file was not found, or it lacks a section for the running ' \
"environment, '#{NewRelic::Control.instance.env}'. You may also want to try linting your newrelic.yml " \
Expand All @@ -160,6 +165,7 @@ def correct_license_length
if key.length == 40
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::INVALID_LICENSE_KEY)
::NewRelic::Agent.logger.error("Invalid license key: #{key}")
false
end
Expand All @@ -180,6 +186,7 @@ def agent_should_start?
end

unless app_name_configured?
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_APP_NAME)
NewRelic::Agent.logger.error('No application name configured.',
'The agent cannot start without at least one. Please check your ',
'newrelic.yml and ensure that it is valid and has at least one ',
Expand Down
24 changes: 24 additions & 0 deletions lib/new_relic/agent/configuration/default_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2188,6 +2188,30 @@ def self.notify
:transform => DefaultSource.method(:convert_to_constant_list),
:description => 'Specify a list of exceptions you do not want the agent to strip when [strip_exception_messages](#strip_exception_messages-enabled) is `true`. Separate exceptions with a comma. For example, `"ImportantException,PreserveMessageException"`.'
},
# Agent Control
:'agent_control.fleet_id' => {
:default => nil,
:allow_nil => true,
:public => true,
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
:type => String,
:allowed_from_server => false,
:description => 'This assigns a fleet ID to the language agent. This ID is generated by agent control. If this setting is present, it indicates the agent is running in an agent control/fleet environment and health file(s) will be generated. This configuration will be set by agent control, or one of its components, prior to agent startup.'
},
:'agent_control.health.delivery_location' => {
:default => nil,
:allow_nil => true,
:public => true,
:type => String,
:allowed_from_server => false,
:description => 'A `file:` URI that specifies the fully qualified directory path for health file(s) to be written to. For example: `file:///var/lib/newrelic-super-agent/fleet/agents.d/<fleet_id>`. This configuration will be set by agent control, or one of its components, prior to agent startup.'
},
:'agent_control.health.frequency' => {
:default => 5,
:public => true,
:type => Integer,
:allowed_from_server => false,
:description => 'The interval, in seconds, of how often the health file(s) will be written to. This configuration will be set by agent control, or one of its components, prior to agent startup.'
},
# Thread profiler
:'thread_profiler.enabled' => {
:default => DefaultSource.thread_profiler_enabled,
Expand Down
2 changes: 2 additions & 0 deletions lib/new_relic/agent/configuration/yaml_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def initialize(path, env)
erb_file = process_erb(raw_file)
config = process_yaml(erb_file, env, config, @file_path)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
log_failure("Failed to read or parse configuration file at #{path}", e)
end

Expand Down Expand Up @@ -99,6 +100,7 @@ def process_erb(file)
file.gsub!(/^\s*#.*$/, '#')
ERB.new(file).result(binding)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
message = 'Failed ERB processing configuration file. This is typically caused by a Ruby error in <% %> templating blocks in your newrelic.yml file.'
failure_array = [message, e]
failure_array << e.backtrace[0] if Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('3.4.0')
Expand Down
125 changes: 125 additions & 0 deletions lib/new_relic/agent/health_check.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# This file is distributed under New Relic's license terms.
# See https://github.com/newrelic/newrelic-ruby-agent/blob/main/LICENSE for complete details.
# frozen_string_literal: true

module NewRelic
module Agent
class HealthCheck
def initialize
@start_time = nano_time
@fleet_id = ENV['NEW_RELIC_AGENT_CONTROL_FLEET_ID']
# The spec states file paths for the delivery location will begin with file://
# This does not create a valid path in Ruby, so remove the prefix when present
@delivery_location = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION']&.gsub('file://', '')
@frequency = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'] ? ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'].to_i : 5
@continue = true
@status = HEALTHY
end

HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'}.freeze
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid liense key (HTTP status code 401)'}.freeze
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
MISSING_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-002', message: 'License key missing in configuration'}.freeze
FORCED_DISCONNECT = {healthy: false, last_error: 'NR-APM-003', message: 'Forced disconnect received from New Relic (HTTP status code 410)'}.freeze
HTTP_ERROR = {healthy: false, last_error: 'NR-APM-004', message: 'HTTP error response code [%s] recevied from New Relic while sending data type [%s]'}.freeze
MISSING_APP_NAME = {healthy: false, last_error: 'NR-APM-005', message: 'Missing application name in agent configuration'}.freeze
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'}.freeze
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'}.freeze
AGENT_DISABLED = {healthy: false, last_error: 'NR-APM-008', message: 'Agent is disabled via configuration'}.freeze
FAILED_TO_CONNECT = {healthy: false, last_error: 'NR-APM-009', message: 'Failed to connect to New Relic data collector'}.freeze
FAILED_TO_PARSE_CONFIG = {healthy: false, last_error: 'NR-APM-010', message: 'Agent config file is not able to be parsed'}.freeze
SHUTDOWN = {healthy: true, last_error: 'NR-APM-099', message: 'Agent has shutdown'}.freeze

def create_and_run_health_check_loop
unless health_check_enabled?
@continue = false
end
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved

return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_FLEET_ID not found, skipping health checks') unless @fleet_id
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION not found, skipping health checks') unless @delivery_location
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY zero or less, skipping health checks') unless @frequency > 0

NewRelic::Agent.logger.debug('Agent control health check conditions met. Starting health checks.')
NewRelic::Agent.record_metric('Supportability/AgentControl/Health/enabled', 1)

Thread.new do
while @continue
begin
sleep @frequency
write_file
@continue = false if @status == SHUTDOWN
rescue StandardError => e
NewRelic::Agent.logger.error("Aborting agent control health check. Error raised: #{e}")
@continue = false
end
end
end
end

def update_status(status, options = [])
return unless @continue

@status = status.dup
update_message(options) unless options.empty?
end

private

def contents
<<~CONTENTS
healthy: #{@status[:healthy]}
status: #{@status[:message]}#{last_error}
start_time_unix_nano: #{@start_time}
status_time_unix_nano: #{nano_time}
CONTENTS
end

def last_error
@status[:healthy] ? '' : "\nlast_error: #{@status[:last_error]}"
end

def nano_time
Process.clock_gettime(Process::CLOCK_REALTIME, :nanosecond)
end

def file_name
"health-#{NewRelic::Agent::GuidGenerator.generate_guid(32)}.yml"
end

def write_file
@file ||= "#{create_file_path}/#{file_name}"

File.write(@file, contents)
rescue StandardError => e
NewRelic::Agent.logger.error("Agent control health check raised an error while writing a file: #{e}")
@continue = false
end

def create_file_path
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
for abs_path in [File.expand_path(@delivery_location),
File.expand_path(File.join('', @delivery_location))] do
if File.directory?(abs_path) || (Dir.mkdir(abs_path) rescue nil)
return abs_path[%r{^(.*?)/?$}]
end
end
nil
rescue StandardError => e
NewRelic::Agent.logger.error(
'Agent control health check raised an error while finding or creating the file path defined in ' \
"NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION: #{e}"
)
@continue = false
end

def health_check_enabled?
@fleet_id && @delivery_location && (@frequency > 0)
end

def update_message(options)
@status[:message] = sprintf(@status[:message], *options)
rescue StandardError => e
NewRelic::Agent.logger.debug("Error raised while updating agent control health check message: #{e}." \
"Reverting to original message. options = #{options}, @status[:message] = #{@status[:message]}")
end
end
end
end
Loading
Loading