Skip to content

Commit

Permalink
mon: add NVMe-oF gateway monitor and HA
Browse files Browse the repository at this point in the history
- gateway submodule

Fixes: https://tracker.ceph.com/issues/64777

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).

The implementation consists of the following main modules:

- NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
- NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations.
- MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
- MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 3 new mon commands:
- nvme-gw create
- nvme-gw delete
- nvme-gw show

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

Signed-off-by: Leonid Chernin <[email protected]>
Signed-off-by: Alexander Indenbaum <[email protected]>
  • Loading branch information
leonidc authored and Alexander Indenbaum committed Jul 31, 2024
1 parent 3f4aee2 commit 5843c6b
Show file tree
Hide file tree
Showing 37 changed files with 3,885 additions and 10 deletions.
9 changes: 8 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,11 @@
[submodule "src/BLAKE3"]
path = src/BLAKE3
url = https://github.com/BLAKE3-team/BLAKE3.git

[submodule "src/boost_redis"]
path = src/boost_redis
url = https://github.com/boostorg/redis.git
[submodule "src/nvmeof/gateway"]
path = src/nvmeof/gateway
url = https://github.com/ceph/ceph-nvmeof.git
fetchRecurseSubmodules = false
shallow = true
8 changes: 8 additions & 0 deletions PendingReleaseNotes
Original file line number Diff line number Diff line change
Expand Up @@ -506,3 +506,11 @@ Relevant tracker: https://tracker.ceph.com/issues/57090
set using the `fs set` command. This flag prevents using a standby for another
file system (join_fs = X) when standby for the current filesystem is not available.
Relevant tracker: https://tracker.ceph.com/issues/61599
* mon: add NVMe-oF gateway monitor and HA
This PR adds high availability support for the nvmeof Ceph service. High availability
means that even in the case that a certain GW is down, there will be another available
path for the initiator to be able to continue the IO through another GW.
It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion:
- nvme-gw create
- nvme-gw delete
Relevant tracker: https://tracker.ceph.com/issues/64777
15 changes: 15 additions & 0 deletions ceph.spec.in
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ BuildRequires: gperf
BuildRequires: cmake > 3.5
BuildRequires: fuse-devel
BuildRequires: git
BuildRequires: grpc-devel
%if 0%{?fedora} || 0%{?suse_version} > 1500 || 0%{?rhel} == 9 || 0%{?openEuler}
BuildRequires: gcc-c++ >= 11
%endif
Expand Down Expand Up @@ -642,6 +643,17 @@ system. One or more instances of ceph-mon form a Paxos part-time
parliament cluster that provides extremely reliable and durable storage
of cluster membership, configuration, and state.

%package mon-client-nvmeof
Summary: Ceph NVMeoF Gateway Monitor Client
%if 0%{?suse_version}
Group: System/Filesystems
%endif
Provides: ceph-test:/usr/bin/ceph-nvmeof-monitor-client
Requires: librados2 = %{_epoch_prefix}%{version}-%{release}
%description mon-client-nvmeof
Ceph NVMeoF Gateway Monitor Client distributes Paxos ANA info
to NVMeoF Gateway and provides beacons to the monitor daemon

%package mgr
Summary: Ceph Manager Daemon
%if 0%{?suse_version}
Expand Down Expand Up @@ -2077,6 +2089,9 @@ if [ $1 -ge 1 ] ; then
fi
fi

%files mon-client-nvmeof
%{_bindir}/ceph-nvmeof-monitor-client

%files fuse
%{_bindir}/ceph-fuse
%{_mandir}/man8/ceph-fuse.8*
Expand Down
112 changes: 112 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,12 @@ endif(WITH_BLKIN)

if(WITH_JAEGER)
find_package(thrift 0.13.0 REQUIRED)

if(EXISTS "/etc/redhat-release" OR EXISTS "/etc/fedora-release")
# absl is installed as grpc build dependency on RPM based systems
add_definitions(-DHAVE_ABSEIL)
endif()

include(BuildOpentelemetry)
build_opentelemetry()
add_library(jaeger_base INTERFACE)
Expand Down Expand Up @@ -875,6 +881,112 @@ if(WITH_FUSE)
install(PROGRAMS mount.fuse.ceph DESTINATION ${CMAKE_INSTALL_SBINDIR})
endif(WITH_FUSE)

# NVMEOF GATEWAY MONITOR CLIENT
# Supported on RPM-based platforms only, depends on grpc devel libraries/tools
if(EXISTS "/etc/redhat-release" OR EXISTS "/etc/fedora-release")
option(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT "build nvmeof gateway monitor client" ON)
else()
option(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT "build nvmeof gateway monitor client" OFF)
endif()

if(WITH_NVMEOF_GATEWAY_MONITOR_CLIENT)

# Find Protobuf installation
# Looks for protobuf-config.cmake file installed by Protobuf's cmake installation.
option(protobuf_MODULE_COMPATIBLE TRUE)
find_package(Protobuf REQUIRED)

set(_REFLECTION grpc++_reflection)
if(CMAKE_CROSSCOMPILING)
find_program(_PROTOBUF_PROTOC protoc)
else()
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
endif()

# Find gRPC installation
# Looks for gRPCConfig.cmake file installed by gRPC's cmake installation.
find_package(gRPC CONFIG REQUIRED)
message(STATUS "Using gRPC ${gRPC_VERSION}")
set(_GRPC_GRPCPP gRPC::grpc++)
if(CMAKE_CROSSCOMPILING)
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
else()
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:gRPC::grpc_cpp_plugin>)
endif()

# Gateway Proto file
get_filename_component(nvmeof_gateway_proto "nvmeof/gateway/control/proto/gateway.proto" ABSOLUTE)
get_filename_component(nvmeof_gateway_proto_path "${nvmeof_gateway_proto}" PATH)

# Generated sources
set(nvmeof_gateway_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/gateway.pb.cc")
set(nvmeof_gateway_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/gateway.pb.h")
set(nvmeof_gateway_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/gateway.grpc.pb.cc")
set(nvmeof_gateway_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/gateway.grpc.pb.h")

add_custom_command(
OUTPUT "${nvmeof_gateway_proto_srcs}" "${nvmeof_gateway_proto_hdrs}" "${nvmeof_gateway_grpc_srcs}" "${nvmeof_gateway_grpc_hdrs}"
COMMAND ${_PROTOBUF_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${nvmeof_gateway_proto_path}"
--experimental_allow_proto3_optional
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
"${nvmeof_gateway_proto}"
DEPENDS "${nvmeof_gateway_proto}")


# Monitor Proto file
get_filename_component(nvmeof_monitor_proto "nvmeof/gateway/control/proto/monitor.proto" ABSOLUTE)
get_filename_component(nvmeof_monitor_proto_path "${nvmeof_monitor_proto}" PATH)

# Generated sources
set(nvmeof_monitor_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/monitor.pb.cc")
set(nvmeof_monitor_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/monitor.pb.h")
set(nvmeof_monitor_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/monitor.grpc.pb.cc")
set(nvmeof_monitor_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/monitor.grpc.pb.h")

add_custom_command(
OUTPUT "${nvmeof_monitor_proto_srcs}" "${nvmeof_monitor_proto_hdrs}" "${nvmeof_monitor_grpc_srcs}" "${nvmeof_monitor_grpc_hdrs}"
COMMAND ${_PROTOBUF_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${nvmeof_monitor_proto_path}"
--experimental_allow_proto3_optional
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
"${nvmeof_monitor_proto}"
DEPENDS "${nvmeof_monitor_proto}")

# Include generated *.pb.h files
include_directories("${CMAKE_CURRENT_BINARY_DIR}")

set(ceph_nvmeof_monitor_client_srcs
${nvmeof_gateway_proto_srcs}
${nvmeof_gateway_proto_hdrs}
${nvmeof_gateway_grpc_srcs}
${nvmeof_gateway_grpc_hdrs}
${nvmeof_monitor_proto_srcs}
${nvmeof_monitor_proto_hdrs}
${nvmeof_monitor_grpc_srcs}
${nvmeof_monitor_grpc_hdrs}
ceph_nvmeof_monitor_client.cc
nvmeof/NVMeofGwClient.cc
nvmeof/NVMeofGwMonitorGroupClient.cc
nvmeof/NVMeofGwMonitorClient.cc)
add_executable(ceph-nvmeof-monitor-client ${ceph_nvmeof_monitor_client_srcs})
add_dependencies(ceph-nvmeof-monitor-client ceph-common)
target_link_libraries(ceph-nvmeof-monitor-client
client
mon
global-static
ceph-common
${_REFLECTION}
${_GRPC_GRPCPP}
)
install(TARGETS ceph-nvmeof-monitor-client DESTINATION bin)
endif()
# END OF NVMEOF GATEWAY MONITOR CLIENT

if(WITH_DOKAN)
add_subdirectory(dokan)
endif(WITH_DOKAN)
Expand Down
79 changes: 79 additions & 0 deletions src/ceph_nvmeof_monitor_client.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
// vim: ts=8 sw=2 smarttab
/*
* Ceph - scalable distributed file system
*
* Copyright (C) 2023 IBM Inc
*
* Author: Alexander Indenbaum <[email protected]>
*
* This is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License version 2.1, as published by the Free Software
* Foundation. See file COPYING.
*
*/

#include <pthread.h>

#include "include/types.h"
#include "include/compat.h"
#include "common/config.h"
#include "common/ceph_argparse.h"
#include "common/errno.h"
#include "common/pick_address.h"
#include "global/global_init.h"

#include "nvmeof/NVMeofGwMonitorClient.h"

static void usage()
{
std::cout << "usage: ceph-nvmeof-monitor-client\n"
" --gateway-name <GW_NAME>\n"
" --gateway-address <GW_ADDRESS>\n"
" --gateway-pool <CEPH_POOL>\n"
" --gateway-group <GW_GROUP>\n"
" --monitor-group-address <MONITOR_GROUP_ADDRESS>\n"
" [flags]\n"
<< std::endl;
generic_server_usage();
}

/**
* A short main() which just instantiates a Nvme and
* hands over control to that.
*/
int main(int argc, const char **argv)
{
ceph_pthread_setname(pthread_self(), "ceph-nvmeof-monitor-client");

auto args = argv_to_vec(argc, argv);
if (args.empty()) {
std::cerr << argv[0] << ": -h or --help for usage" << std::endl;
exit(1);
}
if (ceph_argparse_need_usage(args)) {
usage();
exit(0);
}

auto cct = global_init(nullptr, args, CEPH_ENTITY_TYPE_CLIENT,
CODE_ENVIRONMENT_UTILITY, // maybe later use CODE_ENVIRONMENT_DAEMON,
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);

pick_addresses(g_ceph_context, CEPH_PICK_ADDRESS_PUBLIC);

global_init_daemonize(g_ceph_context);
global_init_chdir(g_ceph_context);
common_init_finish(g_ceph_context);

NVMeofGwMonitorClient gw_monitor_client(argc, argv);
int rc = gw_monitor_client.init();
if (rc != 0) {
std::cerr << "Error in initialization: " << cpp_strerror(rc) << std::endl;
return rc;
}

return gw_monitor_client.main(args);
}

7 changes: 7 additions & 0 deletions src/common/options/global.yaml.in
Original file line number Diff line number Diff line change
Expand Up @@ -1755,6 +1755,13 @@ options:
default: 500
services:
- mon
- name: mon_max_nvmeof_epochs
type: int
level: advanced
desc: max number of nvmeof gateway maps to store
default: 500
services:
- mon
- name: mon_max_osd
type: int
level: advanced
Expand Down
34 changes: 34 additions & 0 deletions src/common/options/mon.yaml.in
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,25 @@ options:
default: 30
services:
- mon
- name: mon_nvmeofgw_beacon_grace
type: secs
level: advanced
desc: Period in seconds from last beacon to monitor marking a NVMeoF gateway as
failed
default: 10
services:
- mon
- name: mon_nvmeofgw_set_group_id_retry
type: uint
level: advanced
desc: Retry wait time in microsecond for set group id between the monitor client
and gateway
long_desc: The monitor server determines the gateway's group ID. If the monitor client
receives a monitor group ID assignment before the gateway is fully up during
initialization, a retry is required.
default: 1000
services:
- mon
- name: mon_mgr_inactive_grace
type: int
level: advanced
Expand Down Expand Up @@ -1341,3 +1360,18 @@ options:
with_legacy: true
see_also:
- osd_heartbeat_use_min_delay_socket
- name: nvmeof_mon_client_disconnect_panic
type: secs
level: advanced
desc: The duration, expressed in seconds, after which the nvmeof gateway
should trigger a panic if it loses connection to the monitor
default: 100
services:
- mon
- name: nvmeof_mon_client_tick_period
type: secs
level: advanced
desc: Period in seconds of nvmeof gateway beacon messages to monitor
default: 2
services:
- mon
Loading

0 comments on commit 5843c6b

Please sign in to comment.