Skip to content

Commit

Permalink
get template config (#2317)
Browse files Browse the repository at this point in the history
* [WIP]get template config

* Update ex-ug-parameter.md
  • Loading branch information
cooper-lzy authored Dec 7, 2023
1 parent dabe525 commit 23554fa
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 60 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,22 @@
# Parameters in the configuration file

This topic describes how to configure the file [`application.conf`](https://github.com/vesoft-inc/nebula-exchange/blob/master/nebula-exchange_spark_2.4/src/main/resources/application.conf) when users use NebulaGraph Exchange.
This topic describes how to automatically generate a template configuration file when users use NebulaGraph Exchange, and introduces the configuration file [`application.conf`](https://github.com/vesoft-inc/nebula-exchange/blob/master/nebula-exchange_spark_2.4/src/main/resources/application.conf).

## Generate template configuration file automatically

Specify the data source to be imported with the following command to get the template configuration file corresponding to the data source.

```agsl
java -cp <exchange_jar_package> com.vesoft.exchange.common.GenerateConfigTemplate -s <source_type> -p <config_file_save_path>
```

Example:

```agsl
java -cp nebula-exchange_spark_2.4-3.0-SNAPSHOT.jar com.vesoft.exchange.common.GenerateConfigTemplate -s csv -p /home/nebula/csv_application.conf
```

## Configuration instructions

Before configuring the `application.conf` file, it is recommended to copy the file name `application.conf` and then edit the file name according to the file type of a data source. For example, change the file name to `csv_application.conf` if the file type of the data source is CSV.

Expand All @@ -16,7 +32,7 @@ The `application.conf` file contains the following content types:

- Edge configurations

## Spark configurations
### Spark configurations

This topic lists only some Spark parameters. For more information, see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#application-properties).

Expand All @@ -28,7 +44,7 @@ This topic lists only some Spark parameters. For more information, see [Spark Co
|`spark.executor.memory`|string|`1G`|No|The amount of memory used by a Spark driver which can be specified in units, such as 512M or 1G.|
|`spark.cores.max`|int|`16`|No|The maximum number of CPU cores of applications requested across clusters (rather than from each node) when a driver runs in a coarse-grained sharing mode on a standalone cluster or a Mesos cluster. The default value is `spark.deploy.defaultCores` on a Spark standalone cluster manager or the value of the `infinite` parameter (all available cores) on Mesos.|

## Hive configurations (optional)
### Hive configurations (optional)

Users only need to configure parameters for connecting to Hive if Spark and Hive are deployed in different clusters. Otherwise, please ignore the following configurations.

Expand All @@ -40,7 +56,7 @@ Users only need to configure parameters for connecting to Hive if Spark and Hive
|`hive.connectionUserName`|list\[string\]|-|Yes|The username for connections.|
|`hive.connectionPassword`|list\[string\]|-|Yes|The account password.|

## NebulaGraph configurations
### NebulaGraph configurations

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand Down Expand Up @@ -87,11 +103,11 @@ Users only need to configure parameters for connecting to Hive if Spark and Hive
}
```

## Vertex configurations
### Vertex configurations

For different data sources, the vertex configurations are different. There are many general parameters and some specific parameters. General parameters and specific parameters of different data sources need to be configured when users configure vertices.

### General parameters
#### General parameters

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -111,27 +127,27 @@ For different data sources, the vertex configurations are different. There are m
|`tags.batch`|int|`256`|Yes|The maximum number of vertices written into NebulaGraph in a single batch.|
|`tags.partition`|int|`32`|Yes|The number of partitions to be created when the data is written to {{nebula.name}}. If `tags.partition ≤ 1`, the number of partitions to be created in {{nebula.name}} is the same as that in the data source.|

### Specific parameters of Parquet/JSON/ORC data sources
#### Specific parameters of Parquet/JSON/ORC data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`tags.path`|string|-|Yes|The path of vertex data files in HDFS. Enclose the path in double quotes and start with `hdfs://`.|

### Specific parameters of CSV data sources
#### Specific parameters of CSV data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`tags.path`|string|-|Yes|The path of vertex data files in HDFS. Enclose the path in double quotes and start with `hdfs://`.|
|`tags.separator`|string|`,`|Yes|The separator. The default value is a comma (,). For special characters, such as the control character `^A`, you can use ASCII octal `\001` or UNICODE encoded hexadecimal `\u0001`, for the control character `^B`, use ASCII octal `\002` or UNICODE encoded hexadecimal `\u0002`, for the control character `^C`, use ASCII octal `\003` or UNICODE encoded hexadecimal `\u0003`.|
|`tags.header`|bool|`true`|Yes|Whether the file has a header.|

### Specific parameters of Hive data sources
#### Specific parameters of Hive data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`tags.exec`|string|-|Yes|The statement to query data sources. For example, `select name,age from mooc.users`.|

### Specific parameters of MaxCompute data sources
#### Specific parameters of MaxCompute data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -144,7 +160,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.partitionSpec`|string|-|No|Partition descriptions of MaxCompute tables.|
|`tags.sentence`|string|-|No|Statements to query data sources. The table name in the SQL statement is the same as the value of the table above.|

### Specific parameters of Neo4j data sources
#### Specific parameters of Neo4j data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -155,7 +171,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.database`|string|-|Yes|The name of the database where source data is saved in Neo4j.|
|`tags.check_point_path`|string|`/tmp/test`|No|The directory set to import progress information, which is used for resuming transfers. If not set, the resuming transfer is disabled.|

### Specific parameters of MySQL/PostgreSQL data sources
#### Specific parameters of MySQL/PostgreSQL data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -167,7 +183,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.password`|string|-|Yes|The account password.
|`tags.sentence`|string|-|Yes|Statements to query data sources. For example: `"select teamid, name from team order by teamid"`.|

### Specific parameters of Oracle data sources
#### Specific parameters of Oracle data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -178,7 +194,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.table`|string|-|Yes|The name of a table used as a data source.|
|`tags.sentence`|string|-|Yes|Statements to query data sources. For example: `"select playerid, name, age from player"`.|

### Specific parameters of ClickHouse data sources
#### Specific parameters of ClickHouse data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -188,7 +204,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.numPartition`|string|-|Yes|The number of ClickHouse partitions.
|`tags.sentence`|string|-|Yes|Statements to query data sources.|

### Specific parameters of Hbase data sources
#### Specific parameters of Hbase data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -197,7 +213,7 @@ For different data sources, the vertex configurations are different. There are m
|`tags.table`|string|-|Yes|The name of a table used as a data source.|
|`tags.columnFamily`|string|-|Yes|The column family to which a table belongs.|

### Specific parameters of Pulsar data sources
#### Specific parameters of Pulsar data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -206,29 +222,28 @@ For different data sources, the vertex configurations are different. There are m
|`tags.options.<topic|topics| topicsPattern>`|string|-|Yes|Options offered by Pulsar, which can be configured by choosing one from `topic`, `topics`, and `topicsPattern`.|
|`tags.interval.seconds`|int|`10`|Yes|The interval for reading messages. Unit: seconds.|

### Specific parameters of Kafka data sources
#### Specific parameters of Kafka data sources

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`tags.service`|string|-|Yes|The Kafka server address.|
|`tags.topic`|string|-|Yes|The message type.|
|`tags.interval.seconds`|int|`10`|Yes|The interval for reading messages. Unit: seconds.|

### Specific parameters for generating SST files
#### Specific parameters for generating SST files

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`tags.path`|string|-|Yes|The path of the source file specified to generate SST files.|
|`tags.repartitionWithNebula`|bool|`true`|No|Whether to repartition data based on the number of partitions of graph spaces in NebulaGraph when generating the SST file. Enabling this function can reduce the time required to DOWNLOAD and INGEST SST files.|


## Edge configurations
### Edge configurations

For different data sources, configurations of edges are also different. There are general parameters and some specific parameters. General parameters and specific parameters of different data sources need to be configured when users configure edges.

For the specific parameters of different data sources for edge configurations, please refer to the introduction of specific parameters of different data sources above, and pay attention to distinguishing tags and edges.

### General parameters
#### General parameters

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
Expand All @@ -248,17 +263,10 @@ For the specific parameters of different data sources for edge configurations, p
|`edges.batch`|int|`256`|Yes|The maximum number of edges written into NebulaGraph in a single batch.|
|`edges.partition`|int|`32`|Yes|The number of partitions to be created when the data is written to {{nebula.name}}. If `edges.partition ≤ 1`, the number of partitions to be created in {{nebula.name}} is the same as that in the data source.|

### Specific parameters for generating SST files
#### Specific parameters for generating SST files

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`edges.path`|string|-|Yes|The path of the source file specified to generate SST files.|
|`edges.repartitionWithNebula`|bool|`true`|No|Whether to repartition data based on the number of partitions of graph spaces in NebulaGraph when generating the SST file. Enabling this function can reduce the time required to DOWNLOAD and INGEST SST files.|

### Specific parameters of NebulaGraph

|Parameter|Type|Default value|Required|Description|
|:---|:---|:---|:---|:---|
|`edges.path`|string|`"hdfs://namenode:9000/path/edge"`|Yes|Specifies the storage path of the CSV file. You need to set a new path and Exchange will automatically create the path you set. If you store the data to the HDFS server, the path format is the same as the default value, such as `"hdfs://192.168.8.177:9000/edge/follow"`. If you store the data to the local, the path format is `"file:///path/edge"`, such as `"file:///home/nebula/edge/follow"`. If there are multiple Edges, different directories must be set for each Edge.|
|`edges.noField`|bool|`false`|Yes|If the value is `true`, source vertex IDs, destination vertex IDs, and ranks will be exported, not the property data. If the vaue is `false`, ranks, source vertex IDs, destination vertex IDs, ranks, and the property data will be exported.|
|`edges.return.fields`|list|`[]`|Yes|Specifies the properties to be exported. For example, to export `start_year` and `end_year`, you need to set the parameter value to `["start_year","end_year"]`. This parameter only takes effect when the value of `edges.noField` is `false`.|
Loading

0 comments on commit 23554fa

Please sign in to comment.