GITBOOK-25: Surya's Jul 31 changes

zinggAI · Jul 31, 2024 · b071e1c · b071e1c
1 parent 2522a94
commit b071e1c
Show file tree

Hide file tree

Showing 53 changed files with 307 additions and 311 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -2,7 +2,7 @@
 description: Hope you find us useful :-)
 ---
 
-# Welcome to Zingg
+# Welcome To Zingg
 
 ![](https://static.scarf.sh/a.png?x-pxid=d6dda06e-06c7-4e4a-99c9-ed9f6364dfeb)
 
@@ -24,4 +24,4 @@ Zingg is a quick and scalable way to build a single source of truth for core bus
 
 ## Book Office Hours
 
-If you want to schedule a 30-min call with our team to help you get set up, please select some time directly [here](https://calendly.com/sonalgoyal/30min).
+If you want to schedule a 30-min call with our team to help you get set up, please select a slot directly [here](https://calendly.com/sonalgoyal/30min).
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -1,13 +1,13 @@
 # Table of contents
 
-* [Welcome to Zingg](README.md)
+* [Welcome To Zingg](README.md)
 * [Step-By-Step Guide](stepByStep.md)
   * [Installation](setup/installation.md)
     * [Docker](stepbystep/installation/docker/README.md)
-      * [Sharing custom data and config files](stepbystep/installation/docker/sharing-custom-data-and-config-files.md)
-      * [Shared locations](stepbystep/installation/docker/shared-locations.md)
-      * [File read/write permissions](stepbystep/installation/docker/file-read-write-permissions.md)
-      * [Copying Files To and From the Container](stepbystep/installation/docker/copying-files-to-and-from-the-container.md)
+      * [Sharing Custom Data And Config Files](stepbystep/installation/docker/sharing-custom-data-and-config-files.md)
+      * [Shared Locations](stepbystep/installation/docker/shared-locations.md)
+      * [File Read/Write Permissions](stepbystep/installation/docker/file-read-write-permissions.md)
+      * [Copying Files To And From The Container](stepbystep/installation/docker/copying-files-to-and-from-the-container.md)
     * [Installing From Release](stepbystep/installation/installing-from-release/README.md)
       * [Single Machine Setup](stepbystep/installation/installing-from-release/single-machine-setup.md)
       * [Spark Cluster Checklist](stepbystep/installation/installing-from-release/spark-cluster-checklist.md)
@@ -19,7 +19,7 @@
   * [Zingg Command Line](stepbystep/zingg-command-line.md)
   * [Configuration](stepbystep/configuration/README.md)
     * [Configuring Through Environment Variables](stepbystep/configuration/configuring-through-environment-variables.md)
-    * [Data Input and Output](stepbystep/configuration/data-input-and-output/README.md)
+    * [Data Input And Output](stepbystep/configuration/data-input-and-output/README.md)
       * [Input Data](stepbystep/configuration/data-input-and-output/data.md)
       * [Output](stepbystep/configuration/data-input-and-output/output.md)
     * [Field Definitions](stepbystep/configuration/field-definitions.md)
@@ -31,13 +31,13 @@
     * [Finding Records For Training Set Creation](setup/training/findTrainingData.md)
     * [Labeling Records](setup/training/label.md)
     * [Find And Label](setup/training/findAndLabel.md)
-    * [Using pre-existing training data](setup/training/addOwnTrainingData.md)
+    * [Using Pre-existing Training Data](setup/training/addOwnTrainingData.md)
     * [Updating Labeled Pairs](updatingLabels.md)
     * [Exporting Labeled Data](setup/training/exportLabeledData.md)
   * [Add Incremental Data](runIncremental.md)
-  * [Building and saving the model](setup/train.md)
-  * [Finding the matches](setup/match.md)
-  * [Linking across datasets](setup/link.md)
+  * [Building And Saving The Model](setup/train.md)
+  * [Finding The Matches](setup/match.md)
+  * [Linking Across Datasets](setup/link.md)
 * [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
   * [Zingg Pipes](dataSourcesAndSinks/pipes.md)
   * [Databricks](dataSourcesAndSinks/databricks.md)
@@ -54,19 +54,19 @@
   * [Exasol](dataSourcesAndSinks/exasol.md)
 * [Working With Python](working-with-python.md)
   * [Python API](python/markdown/index.md)
-* [Running Zingg on Cloud](running/running.md)
-  * [Running on AWS](running/aws.md)
-  * [Running on Azure](running/azure.md)
-  * [Running on Databricks](running/databricks.md)
+* [Running Zingg On Cloud](running/running.md)
+  * [Running On AWS](running/aws.md)
+  * [Running On Azure](running/azure.md)
+  * [Running On Databricks](running/databricks.md)
 * [Zingg Models](zModels.md)
-  * [Pre-trained models](pretrainedModels.md)
+  * [Pre-Trained Models](pretrainedModels.md)
 * [Improving Accuracy](improving-accuracy/README.md)
   * [Ignoring Commonly Occuring Words While Matching](accuracy/stopWordsRemoval.md)
   * [Defining Domain Specific Blocking And Similarity Functions](accuracy/definingOwn.md)
 * [Documenting The Model](generatingdocumentation.md)
 * [Interpreting Output Scores](scoring.md)
-* [Reporting bugs and contributing](contributing.md)
-  * [Setting Zingg Development Environment](settingUpZingg.md)
+* [Reporting Bugs And Contributing](contributing.md)
+  * [Setting Up Zingg Development Environment](settingUpZingg.md)
 * [Community](community.md)
 * [Frequently Asked Questions](faq.md)
 * [Reading Material](reading.md)

diff --git a/docs/accuracy/definingOwn.md b/docs/accuracy/definingOwn.md
@@ -3,27 +3,27 @@ nav_order: 6
 description: To add blocking functions and how they work
 ---
 
-# Defining Own Functions
+# Defining Domain Specific Blocking And Similarity Functions
 
 You can add your own [blocking functions](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/hash) which will be evaluated by Zingg to build the [blocking tree.](../zModels.md)
 
-The blocking tree works on the matched records provided by the user as part of the training. At every node, it selects the hash function and the field on which it should be applied so that there is the least elimination of the matching pairs. Say we have data like this:
+The blocking tree works on the matched records provided by the user as part of the training. At every node, it selects the hash function and the field on which it should be applied so that there is the least elimination of the matching pairs. \
+\
+Say we have data like this:
 
 |  Pair 1  | firstname | lastname |
 | :------: | :-------: | :------: |
 | Record A |    john   |    doe   |
 | Record B |   johnh   |   d oe   |
 
-****
+***
 
 |   Pair 2  | firstname | lastname |
 | :-------: | :-------: | :------: |
 | Rrecord A |    mary   |    ann   |
 |  Record B |   marry   |          |
 
-
-
-Let us assume we have hash function first1char and we want to check if it is a good function to apply to firstname:
+Let us assume we have hash function **first1char** and we want to check if it is a good function to apply to **firstname**:
 
 | Pair |  Record  | Output |
 | :--: | :------: | ------ |
@@ -34,9 +34,7 @@ Let us assume we have hash function first1char and we want to check if it is a g
 
 There is no elimination in the pairs above, hence it is a good function.
 
-
-
-Now let us try last1char on firstname
+Now let us try **last1char** on **firstname:**
 
 | Pair |  Record  | Output |
 | :--: | :------: | ------ |
@@ -45,12 +43,12 @@ Now let us try last1char on firstname
 |   2  | Record A | y      |
 |   2  | Record B | y      |
 
-Pair 1 is getting eliminated above, hence last1char is not a good function.&#x20;
+Pair 1 is getting eliminated above, hence **last1char** is not a good function.
 
-So, first1char(firstname) will be chosen. This brings near similar records together - in a way, clusters them to break the cartesian join.
+So, **first1char**(**firstname**) will be chosen. This brings near similar records together - in a way, clusters them to break the cartesian join.
 
 These business-specific blocking functions go into [Hash Functions](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/hash) and must be added to [HashFunctionRegistry](../../common/core/src/main/java/zingg/common/core/hash/HashFunctionRegistry.java) and [hash functions config](../../common/core/src/main/resources/hashFunctions.json).
 
-Also, for similarity, you can define your own measures. Each dataType has predefined features, for example, [String](../../common/core/src/main/java/zingg/common/core/feature/StringFeature.java) fuzzy type is configured for Affine and Jaro.
+Also, for similarity, you can define your own measures. Each **dataType** has predefined features, for example, [String](../../common/core/src/main/java/zingg/common/core/feature/StringFeature.java) fuzzy type is configured for Affine and Jaro.
 
 You can define your own [comparisons](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/similarity/function) and use them.
diff --git a/docs/accuracy/stopWordsRemoval.md b/docs/accuracy/stopWordsRemoval.md
@@ -1,20 +1,18 @@
 # Ignoring Commonly Occuring Words While Matching
 
-Common words like Mr, Pvt, Av, St, Street etc do not add differential signal and confuse matching. These words are called stopwords and matching is more accurate when stopwrods are ignored.
+Common words like Mr, Pvt, Av, St, Street etc. do not add differential signals and confuse matching. These words are called **stopwords** and matching is more accurate when stopwords are ignored.
 
-In order to remove stopwords from a field, configure&#x20;
+The stopwords can be recommended by Zingg by invoking:
 
-The stopwords can be recommended by Zingg by invoking
-
-`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`&#x20;
+`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`
 
 By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:
 
 ```
 stopWordsCutoff: <a value between 0 and 1>
 ```
 
-Once you have verified the above stop words, you can configure them in the JSON variable **stopWords** with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain one word per row.
+Once you have verified the above stop words, you can configure them in the JSON variable **stopWords** with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain _one word per row_.
 
 ```
 "fieldDefinition":[
@@ -26,4 +24,3 @@ Once you have verified the above stop words, you can configure them in the JSON
    		"stopWords": "models/100/stopWords/fname.csv"
    	},
 ```
-
diff --git a/docs/connectors/jdbc/mysql.md b/docs/connectors/jdbc/mysql.md
@@ -1,6 +1,6 @@
 # MySQL
 
-## Reading from MySQL database:
+## Reading From MySQL Database:
 
 ```json
     "data" : [{
@@ -16,4 +16,4 @@
     }],
 ```
 
-Please replace \<db\_name> with the name of the database in addition to other props. For more details, refer to the [spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
+Please replace `<db_name>` with the _name_ of the database in addition to other props. For more details, refer to the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
diff --git a/docs/connectors/jdbc/postgres.md b/docs/connectors/jdbc/postgres.md
@@ -1,6 +1,6 @@
 # Postgres
 
-## JSON Settings for reading data from Postgres database:
+## JSON Settings For Reading Data From Postgres Database:
 
 ```json
     "data" : [{
@@ -16,4 +16,4 @@
     }],
 ```
 
-Please replace \<db\_name> with the name of the database in addition to other props. For more details, refer to the [spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
+Replace `<db_name>` with the _name_ of the database in addition to other props. For more details, refer to the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
diff --git a/docs/dataSourcesAndSinks/bigquery.md b/docs/dataSourcesAndSinks/bigquery.md
@@ -14,7 +14,7 @@ In addition, the following property needs to be set
 spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem                                                      
 ```
 
-If Zingg is run from outside the Google cloud, it requires authentication, please set the following env variable to the location of the file containing the service account key. A service account key can be created and downloaded in JSON format from the [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started).
+If Zingg is run from outside the Google cloud, it requires authentication, please set the following _environment variable_ to the location of the file containing the _service account key_. A service account key can be created and downloaded in JSON format from the [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started).
 
 ```bash
 export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file
@@ -24,7 +24,7 @@ Connection properties for BigQuery as a data source and data sink are given belo
 
 ## Properties for reading data from BigQuery:
 
-The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE\_APPLICATION\_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only.
+The property `credentialsFile` should point to the Google service account key file location. This is the same path that is used to set variable `GOOGLE_APPLICATION_CREDENTIALS`. The `table` property should point to a BigQuery table that contains source data. The property `viewsEnabled` must be set to **true** only.
 
 ```json
     "data" : [{
@@ -38,9 +38,9 @@ The property **"credentialsFile"** should point to the google service account ke
     }],
 ```
 
-## Properties for writing data to BigQuery:
+## Properties For Writing Data To BigQuery:
 
-To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property.
+To write to BigQuery, a bucket needs to be created and assigned to the `temporaryGcsBucket` property.
 
 ```json
     "output" : [{
@@ -57,7 +57,7 @@ To write to BigQuery, a bucket needs to be created and assigned to the **"tempor
 ## Notes:
 
 * The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar) and the library **"spark-bigquery-with-dependencies\_2.12-0.24.2"** from the [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies\_2.12/0.24.2/spark-bigquery-with-dependencies\_2.12-0.24.2.jar).
-* A typical service account key file looks like the below. The format of the file is JSON.
+* A typical service account key file looks like below (JSON).
 
 ```json
 {

diff --git a/docs/dataSourcesAndSinks/connectors.md b/docs/dataSourcesAndSinks/connectors.md
@@ -6,9 +6,10 @@ has_children: true
 
 # Data Sources and Sinks
 
-Zingg connects, reads, and writes to most on-premise and cloud data sources.
-
-Zingg can read and write to Databricks, Snowflake, Cassandra, S3, Azure, Elastic, Exasol, major RDBMS, and any Spark-supported data sources. Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV, etc. This is done through the Zingg [pipe](pipes.md) abstraction.
+Zingg _connects, reads,_ and _writes_ to most on-premise and cloud data sources.
 
+Zingg can read and write to **Databricks, Snowflake, Cassandra, S3, Azure, Elastic, Exasol**, major **RDBMS**, and any **Spark**-supported data sources. \
+\
+Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV, etc. This is done through the Zingg [Pipe](pipes.md) abstraction.
 
 ![](../../assets/zinggOSS.png)
diff --git a/docs/dataSourcesAndSinks/databricks.md b/docs/dataSourcesAndSinks/databricks.md
@@ -1,5 +1,5 @@
 # Databricks
 
-As a Spark based application, Zingg Open Source works seamlessly on Databricks. Zingg leverages Databricks Spark environment, and can access all the supported data sources like parquet and the delta file format. 
+As a Spark-based application, Zingg Open Source works seamlessly on Databricks. Zingg leverages Databricks' Spark environment, and can access all the supported data sources like parquet and the delta file format.
 
-Please check the various ways in which you can run Zingg On Databricks [here](../running/databricks.md) 
+Please check the various ways in which you can run Zingg on Databricks [here](../running/databricks.md)
diff --git a/docs/dataSourcesAndSinks/exasol.md b/docs/dataSourcesAndSinks/exasol.md
@@ -26,7 +26,7 @@ For example:
 spark.jars=spark-connector_2.12-1.3.0-spark-3.3.2-assembly.jar
 ```
 
-If there are more than one jar files, please use comma as separator. Additionally, please change the version accordingly so that it matches your Zingg and Spark versions.
+If there are more than one jar files, please use _comma_ as separator. Additionally, please change the version accordingly so that it matches your Zingg and Spark versions.
 
 ## Connector Settings
 
@@ -52,28 +52,28 @@ For example:
  ...
 ```
 
- Similarly, for output:
+Similarly, for output:
 
- ```json
- ...
+```json
+...
 "output": [
-    {
-        "name": "output",
-        "format": "com.exasol.spark",
-        "props": {
-            "host": "10.11.0.2",
-            "port": "8563",
-            "username": "sys",
-            "password": "exasol",
-            "create_table": "true",
-            "table": "DB_SCHEMA.ENTITY_RESOLUTION",
-        },
-        "mode": "Append"
-    }
+   {
+       "name": "output",
+       "format": "com.exasol.spark",
+       "props": {
+           "host": "10.11.0.2",
+           "port": "8563",
+           "username": "sys",
+           "password": "exasol",
+           "create_table": "true",
+           "table": "DB_SCHEMA.ENTITY_RESOLUTION",
+       },
+       "mode": "Append"
+   }
 ],
 ...
 ```
 
-Please note that, the `host` parameter should be the first internal node's IPv4 address.
+Please note that, the `host` parameter should be the first internal node's **IPv4** **address**.
 
-As Zingg uses [Exasol Spark connector](https://github.com/exasol/spark-connector) underneath, please also check out the [user guide](https://github.com/exasol/spark-connector/blob/main/doc/user_guide/user_guide.md) and [configuration options](https://github.com/exasol/spark-connector/blob/main/doc/user_guide/user_guide.md#configuration-options) for more information.
+As Zingg uses [Exasol Spark connector](https://github.com/exasol/spark-connector) underneath, please also check out the [user guide](https://github.com/exasol/spark-connector/blob/main/doc/user\_guide/user\_guide.md) and [configuration options](https://github.com/exasol/spark-connector/blob/main/doc/user\_guide/user\_guide.md#configuration-options) for more information.
diff --git a/docs/dataSourcesAndSinks/jdbc.md b/docs/dataSourcesAndSinks/jdbc.md
@@ -1,12 +1,11 @@
-# Jdbc
+# JDBC
 
-Zingg can connect to various databases such as Mysql, DB2, MariaDB, MS SQL, Oracle, PostgreSQL, etc. using JDBC. One just needs to download the appropriate driver and made it accessible to the application.
+Zingg can connect to various databases such as MySQL, DB2, MariaDB, MS SQL, Oracle, PostgreSQL, etc. using JDBC. One just needs to download the appropriate driver and made it accessible to the application.
 
 To include the JDBC driver for your particular database on the Spark classpath, please add the property **spark.jars** in [Zingg's runtime properties.](../stepbystep/zingg-runtime-properties.md)
 
 ```
 spark.jars=<location of jdbc driver jar>
 ```
 
-Connection details are given in the following sections for a few common JDBC sources.&#x20;
-
+Connection details are given in the following sections for a few common JDBC sources.