add full import to ha

TuGraph-family · Jan 31, 2024 · 6aa89f0 · 6aa89f0
1 parent 57cdb3c
commit 6aa89f0
Show file tree

Hide file tree

Showing 14 changed files with 428 additions and 70 deletions.
diff --git a/docs/en-US/source/5.developer-manual/3.server-tools/1.data-import.md b/docs/en-US/source/5.developer-manual/3.server-tools/1.data-import.md
@@ -380,21 +380,27 @@ If the user and password are valid, and the specified graph exists, the import t
 
 ## 6. Online full import
 
-Online full import can be used to import a batch of files into an already running TuGraph instance. The execution method is to send an import request to the running TuGraph instance. After receiving the request, the instance first uses offline import (V3) to import the data into a temporary db, and then creates a new subgraph in the instance and stores the data in the temporary db. The files are migrated to the new subgraph, and finally the instance's metadata is refreshed. Compared with online incremental import, online full import has higher performance and is suitable for processing large-scale data. The `lgraph_import --online true --full true` option enables the import tool to fully import online. 
-
-Note: Online full import is not supported in HA (High Availability) mode for now.
-
+Online full import can be used to import a batch of files into a running TuGraph instance. TuGraph supports importing two different types of data into the instance online:
+1. Original data files of the same type as the offline import (csv, etc.)
+2. The underlying storage file of TuGraph is the data.mdb file. This file can be generated by offline import, or it can be a file of other TuGraph db.
+The applicable scenarios of these two methods are different. The first method is to directly import data into TuGraph. The advantage is that it is automatically imported at one time and the operation steps are simple. However, an offline import thread will be started on the server side, which is only suitable for small-scale data import in a stand-alone situation. The second is to import the prepared underlying storage files into TuGraph. Although the mdb file needs to be prepared in advance, it does not have high requirements on system resources. It also supports remote download and file import, which is very convenient and suitable for high availability. Online import of schema or large-scale data.
+
+### 6.1 Import from original data
+The execution method of importing from the original data is to send an import request to the running TuGraph instance. After receiving the request, the instance first uses offline import (V3) to import the data into a temporary db, and then creates a new subgraph in the instance and The data files of the temporary db are migrated to the new subgraph, and finally the metadata of the instance is refreshed. Compared with online incremental import, online full import has higher performance. The `lgraph_import --online true --online_type 1` option enables the import tool to fully import online. Like `offline mode`, online mode has its own set of command line options, which can be printed using the `-h, --help` option:
+Online full import can be used to import a batch of files into an already running TuGraph instance. The execution method is to send an import request to the running TuGraph instance. After receiving the request, the instance first uses offline import (V3) to import the data into a temporary db, and then creates a new subgraph in the instance and stores the data in the temporary db. The files are migrated to the new subgraph, and finally the instance's metadata is refreshed. Compared with online incremental import, online full import has higher performance and is suitable for processing large-scale data. The `lgraph_import --online true --full true` option enables the import tool to fully import online.
 Like `offline mode`, online mode has its own set of command line options, which can be printed using the `-h, --help` option:
 
 ```shell
-$ lgraph_import --online true --full true -h
+$ lgraph_import --online true --online_type 1 -h
 Available command line options:
-    --online            Whether to import online. Default=0.
+    --online_type       The type of import online, 0 for increment, 1 for full
+                        import data,2 for full import file. Default=0.
     --v3                Whether to use lgraph import V3. Default=1.
     -h, --help          Print this help message. Default=0.
 
 Available command line options:
-    --full              Whether to full import online. Default=0.
+    --online_type       The type of import online, 0 for increment, 1 for full
+                        import data,2 for full import file. Default=0.
     -h, --help          Print this help message. Default=0.
 
 Available command line options:
@@ -438,4 +444,31 @@ Available command line options:
 
 The relevant configuration of the file is specified in the configuration file, and its format is exactly the same as in `offline mode`. However, instead of importing the data into a local database, we now import the data into a running TuGraph instance, which is typically running on a different machine than the client machine running the import tool. Therefore, we need to specify the URL, DB user and password of the remote computer's HTTP address. Moreover, the configuration file (config_file parameter) is required to be the uri path on the TuGraph instance machine, and its file configuration is also required to be the absolute path of the resource on the TuGraph instance machine.
 
-If the user and password are valid, the import tool will perform an online full import on the server side. If the graph you want to import already exists, you can use the `--overwrite true` option to force overwriting of the subgraph.
+If the user and password are valid, the import tool will perform an online full import on the server side. If the graph you want to import already exists, you can use the `--overwrite true` option to force overwriting of the subgraph.
+
+### 6.2 Import from database file
+Although the full online import of original data is simple to operate and has high performance, it requires high server resources and takes a long time. A more general way is to first use offline import to import the subgraph in an empty db, obtain the data.mdb file, and then import the file into the TuGraph service online. How to use it is as follows:
+
+```shell
+$ ./lgraph_import --online true --online_type 2 -h
+Available command line options:
+    --online            Whether to import online. Default=0.
+    --v3                Whether to use lgraph import V3. Default=1.
+    -h, --help          Print this help message. Default=0.
+
+Available command line options:
+    --online_type       The type of import online, 0 for increment, 1 for full
+                        import data,2 for full import file. Default=0.
+    -h, --help          Print this help message. Default=0.
+
+Available command line options:
+    -r, --url           DB REST API address.
+    -u, --user          DB username.
+    -p, --password      DB password.
+    -g, --graph         The name of the graph to import into. Default=default.
+    --path              The path of data file.
+    --remote            Whether to download file from remote server. Default=0.
+    -h, --help          Print this help message. Default=0.
+```
+
+In addition to the url, user and password parameters used in ordinary online import, the online full import method imported from a database file uses the graph parameter to specify the name of the imported subgraph, the path parameter to specify the file path, and remote to specify whether the file exists remotely or locally. If it is a local file, you need to ensure that all nodes in the HA cluster have the file in the path. If it is a remote file, it will be downloaded first and then imported. It should be noted that since there is only one copy of data.mdb, it is necessary to ensure that the environment of each node of HA and the machine where data.mdb is generated offline are completely consistent to ensure that no environmental problems will occur.
diff --git a/docs/zh-CN/source/5.developer-manual/3.server-tools/1.data-import.md b/docs/zh-CN/source/5.developer-manual/3.server-tools/1.data-import.md
@@ -377,19 +377,25 @@ Available command line options:
 
 ## 6.在线全量导入
 
-在线全量导入可用于将一批文件导入已在运行中的 TuGraph 实例。其执行方式是向运行中的TuGraph实例发送导入请求，
+在线全量导入可用于将一批文件导入运行中的TuGraph实例，TuGraph支持将两种不同类型的数据在线导入到实例中：
+1. 和离线导入类型相同的原数据文件（csv等）
+2. TuGraph的底层存储文件，即data.mdb文件，该文件可以由离线导入生成，也可以是其他TuGraph db的文件。
+这两种方式的适用场景不同，第一种是直接将数据导入到TuGraph中，优点是一次性自动导入，操作步骤简单。但是会在server端启动一个离线导入线程，
+只适合单机情况下的小规模数据导入。第二种是将已经准备好的底层存储文件导入到TuGraph中，虽然需要提前准备好mdb文件，但是对系统资源却没有很高的要求，
+而且支持远程下载文件导入，非常方便，适用于高可用模式或者大规模数据的在线导入。
+
+### 6.1 从原数据导入
+从原数据导入的执行方式是向运行中的TuGraph实例发送导入请求，
 实例接到请求后先使用离线导入（V3）的方式将数据导入一个临时的db中，然后在实例中新建子图并将临时db的数据文件迁移到新子图中，最后刷新实例的元数据。
-相比在线增量导入，在线全量导入的性能更高，适合处理大规模数据。
-`lgraph_import --online true --full true`选项使导入工具能够在线全量导入。
-
-注：HA模式下，暂时不支持在线全量导入。
-
+相比在线增量导入，在线全量导入的性能更高。
+`lgraph_import --online true --online_type 1`选项使导入工具能够在线全量导入。
 与`离线模式`一样，在线模式有自己的命令行选项集，可以使用`-h，--help`选项进行打印输出：
 
 ```shell
-$ lgraph_import --online true --full true -h
+$ lgraph_import --online true --online_type 1 -h
 Available command line options:
-    --online            Whether to import online. Default=0.
+    --online_type       The type of import online, 0 for increment, 1 for full
+                        import data,2 for full import file. Default=0.
     --v3                Whether to use lgraph import V3. Default=1.
     -h, --help          Print this help message. Default=0.
 
@@ -440,4 +446,37 @@ Available command line options:
 该实例通常运行在与运行导入工具的客户端计算机不同的计算机上。因此，我们需要指定远程计算机的 HTTP 地址的URL、DB用户和密码。
 并且，配置文件（config_file参数）要求是 TuGraph 实例机器上的uri路径，其file配置也要求是 TuGraph 实例机器上资源的绝对路径。
 
-如果用户和密码有效，导入工具将在服务器端执行在线全量导入。如果想导入的图已存在，可以使用`--overwrite true` 选项强制覆盖子图。
+如果用户和密码有效，导入工具将在服务器端执行在线全量导入。如果想导入的图已存在，可以使用`--overwrite true` 选项强制覆盖子图。
+
+### 6.2 从数据库文件导入
+从原数据在线全量导入尽管操作简单、性能较高，但是对服务器资源要求较高，且耗时较长。
+一种更加通用的方式是先使用离线导入在一个空db中导入子图，得到data.mdb文件，然后把该文件在线导入到
+TuGraph服务中。其使用方式如下所示：
+
+```shell
+$ ./lgraph_import --online true --online_type 2 -h
+Available command line options:
+    --online            Whether to import online. Default=0.
+    --v3                Whether to use lgraph import V3. Default=1.
+    -h, --help          Print this help message. Default=0.
+
+Available command line options:
+    --online_type       The type of import online, 0 for increment, 1 for full
+                        import data,2 for full import file. Default=0.
+    -h, --help          Print this help message. Default=0.
+
+Available command line options:
+    -r, --url           DB REST API address.
+    -u, --user          DB username.
+    -p, --password      DB password.
+    -g, --graph         The name of the graph to import into. Default=default.
+    --path              The path of data file.
+    --remote            Whether to download file from remote server. Default=0.
+    -h, --help          Print this help message. Default=0.
+```
+
+除普通在线导入用到的url, user和password参数之外，从数据库文件导入的在线全量导入方式
+使用graph参数指定导入的子图名称，path参数指定文件路径，remote指定文件存在在远程或者本地。
+如果是本地文件，则需要保证HA集群中所有的节点在path路径下都有该文件。如果是远程文件，则会先下载再导入。
+需要注意的是，由于data.mdb只有一份，需要保证HA的各个节点和离线导入生成data.mdb的机器的环境完全一致，
+以保证不会出现环境问题。
diff --git a/src/cypher/procedure/procedure.cpp b/src/cypher/procedure/procedure.cpp
@@ -1970,6 +1970,55 @@ void BuiltinProcedure::DbImportorFullImportor(RTContext *ctx, const Record *reco
     }
 }
 
+void BuiltinProcedure::DbImportorFullFileImportor(RTContext *ctx, const Record *record,
+                                              const VEC_EXPR &args, const VEC_STR &yield_items,
+                                              std::vector<Record> *records) {
+    ctx->txn_.reset();
+    ctx->ac_db_.reset();
+    CYPHER_ARG_CHECK((args.size() == 2 || args.size() == 3),
+                     "need no more than three parameters and no less than two parameters. "
+                     "e.g. db.importor.fullFileImportor({\"ha\",\"data.mdb\",[false]})")
+    CYPHER_ARG_CHECK(args[0].type == parser::Expression::STRING, "graph_name type should be string")
+    CYPHER_ARG_CHECK(args[1].type == parser::Expression::STRING, "path type should be string")
+    bool remote = false;
+    if (args.size() == 3) {
+        CYPHER_ARG_CHECK(args[2].type == parser::Expression::BOOL, "remote type should be bool")
+        remote = args[2].Bool();
+    }
+    if (!ctx->galaxy_->IsAdmin(ctx->user_))
+        throw lgraph::AuthError("Admin access right required.");
+    std::string graph_name = args[0].String(), path = args[1].String();
+    if (remote) {
+        std::string outputFilename = ctx->galaxy_->GetConfig().dir +
+                                     "/.import.file.data.mdb";
+        std::string command = "wget -O " + outputFilename + " " + path;
+        int result = std::system(command.c_str());
+        if (result == 0) {
+            LOG_INFO() << "Downloaded file saved as: " << outputFilename;
+        } else {
+            LOG_ERROR() << "Failed to download file";
+            throw lgraph::CypherException("Failed to download file");
+        }
+        path = ctx->galaxy_->GetConfig().dir + "/.import.file.data.mdb";
+    }
+
+    auto& fs = fma_common::FileSystem::GetFileSystem(ctx->galaxy_->GetConfig().dir +
+                                                     "/.import_tmp");
+    lgraph::DBConfig dbConfig;
+    dbConfig.dir = ctx->galaxy_->GetConfig().dir;
+    dbConfig.name = graph_name;
+    dbConfig.create_if_not_exist = true;
+    ctx->galaxy_->CreateGraph(ctx->user_,
+                              graph_name, dbConfig,
+                              path);
+    fs.RemoveDir(ctx->galaxy_->GetConfig().dir + "/.import_tmp");
+    if (remote) {
+        auto& fs_download = fma_common::FileSystem::GetFileSystem(
+            ctx->galaxy_->GetConfig().dir + "/.import.file.data.mdb");
+        fs_download.Remove(ctx->galaxy_->GetConfig().dir + "/.import.file.data.mdb");
+    }
+}
+
 void BuiltinProcedure::DbImportorSchemaImportor(RTContext *ctx, const Record *record,
                                                 const VEC_EXPR &args, const VEC_STR &yield_items,
                                                 std::vector<Record> *records) {

diff --git a/src/cypher/procedure/procedure.h b/src/cypher/procedure/procedure.h
@@ -315,6 +315,10 @@ class BuiltinProcedure {
     static void DbImportorFullImportor(RTContext *ctx, const Record *record, const VEC_EXPR &args,
                                        const VEC_STR &yield_items, std::vector<Record> *records);
 
+    static void DbImportorFullFileImportor(RTContext *ctx, const Record *record,
+                                           const VEC_EXPR &args, const VEC_STR &yield_items,
+                                           std::vector<Record> *records);
+
     static void DbImportorSchemaImportor(RTContext *ctx, const Record *record, const VEC_EXPR &args,
                                          const VEC_STR &yield_items, std::vector<Record> *records);
 
@@ -926,6 +930,11 @@ static std::vector<Procedure> global_procedures = {
               Procedure::SIG_SPEC{{"conf", {0, lgraph_api::LGraphType::MAP}}},
               Procedure::SIG_SPEC{{"result", {0, lgraph_api::LGraphType::STRING}}}, false, true),
 
+    Procedure("db.importor.fullFileImportor", BuiltinProcedure::DbImportorFullFileImportor,
+              Procedure::SIG_SPEC{{"graph_name", {0, lgraph_api::LGraphType::STRING}},
+                                  {"path", {1, lgraph_api::LGraphType::STRING}}},
+              Procedure::SIG_SPEC{{"", {0, lgraph_api::LGraphType::NUL}}}, false, true),
+
     Procedure("db.importor.schemaImportor", BuiltinProcedure::DbImportorSchemaImportor,
               Procedure::SIG_SPEC{{"description", {0, lgraph_api::LGraphType::STRING}}},
               Procedure::SIG_SPEC{{"", {0, lgraph_api::LGraphType::NUL}}}, false, true),

diff --git a/src/db/graph_manager.cpp b/src/db/graph_manager.cpp
@@ -134,7 +134,8 @@ bool lgraph::GraphManager::CreateGraphWithData(KvTransaction& txn, const std::st
 
         std::unique_ptr<LightningGraph> graph(new LightningGraph(real_config));
         std::string new_file_path = GetGraphActualDir(real_config.dir, "data.mdb");
-        std::rename(data_file_path.c_str(), new_file_path.c_str());
+        std::filesystem::copy_file(data_file_path, new_file_path,
+                                   std::filesystem::copy_options::overwrite_existing);
         graph = std::make_unique<LightningGraph>(real_config);
         graph->FlushDbSecret(secret);
         graphs_.emplace_hint(it, name, GcDb(graph.release()));
@@ -143,7 +144,8 @@ bool lgraph::GraphManager::CreateGraphWithData(KvTransaction& txn, const std::st
         real_config.dir = GetGraphActualDir(parent_dir_, origin_graph->GetSecret());
         std::string new_file_path = GetGraphActualDir(GetGraphActualDir(
                                     parent_dir_, origin_graph->GetSecret()), "data.mdb");
-        std::rename(data_file_path.c_str(), new_file_path.c_str());
+        std::filesystem::copy_file(data_file_path, new_file_path,
+                                   std::filesystem::copy_options::overwrite_existing);
         std::unique_ptr<LightningGraph> new_graph(new LightningGraph(real_config));
         new_graph->FlushDbSecret(origin_graph->GetSecret());
         graphs_[name] = GcDb(new_graph.release());

diff --git a/src/import/import_client.cpp b/src/import/import_client.cpp
@@ -173,6 +173,18 @@ void lgraph::import_v2::OnlineImportClient::DoFullImport() const {
     LOG_INFO() << "Full online import finished in " << t2 - t1 << " seconds.";
 }
 
+void lgraph::import_v2::OnlineImportClient::DoFullImportFile() const {
+    double t1 = fma_common::GetTime();
+    RestClient client(config_.url);
+    client.Login(config_.username, config_.password);
+    std::string cypher = FMA_FMT(R"(CALL db.importor.fullFileImportor("{}","{}",{}))",
+                                 config_.graph_name, config_.path,
+                                 config_.remote ? "true" : "false");
+    client.EvalCypher("", cypher);
+    double t2 = fma_common::GetTime();
+    LOG_INFO() << "Full online import file finished in " << t2 - t1 << " seconds.";
+}
+
 void lgraph::import_v2::OnlineImportClient::SignalHandler(int signum) {
     exit_flag_ = true;
     LOG_WARN() << "signal received, exiting......";

diff --git a/src/import/import_client.h b/src/import/import_client.h
@@ -50,10 +50,13 @@ class OnlineImportClient {
         bool keep_vid_in_memory = true;
         bool enable_fulltext_index = false;
         std::string fulltext_index_analyzer = "StandardAnalyzer";
+        std::string path;
+        bool remote = false;
     };
     explicit OnlineImportClient(const Config& config);
     void DoImport();
     void DoFullImport() const;
+    void DoFullImportFile() const;
 
  protected:
     Config config_;