Expanding introduction text

bmeg · kellrott · Jan 30, 2024 · Jan 3, 2024 · Jan 3, 2024 · Jan 10, 2024
commit 0b7a778e12f2c65945f676692efed60ea734a134
diff --git a/website/content/docs.md b/website/content/docs.md
@@ -11,12 +11,12 @@ menu:
 
 Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of 
 file extractors that operate as inputs to these pipelines. The pipeline engine 
-connects togeather arrays of transform steps into direct acylic graph that is processed
+connects togeather arrays of transform steps into directed acylic graph that is processed
 in parallel.
 
 Example Message:
 
-```
+```json
 {
   "firstName" : "bob",
   "age" : "25"
@@ -37,3 +37,109 @@ be done in a transform pipeline these include:
  - Table based field translation
  - Outputing the message as a JSON Schema checked object
 
+
+# Script structure
+
+## Header
+Each sifter file starts with a set of field to let the software know this is a sifter script, and not some random YAML file. There is also a `name` field for the script. This name will be used for output file creation and logging. Finally, there is an `outdir` that defines the directory where all output files will be placed. All paths are relative to the script file, so the `outdir` set to `my-results` will create the directory `my-results` in the same directory as the script file, regardless of where the sifter command is invoked. 
+```yaml
+class : sifter
+name: <name of script>
+outdir: <where files should be stored>
+```
+
+# Config and templating
+The `config` section is a set of defined keys that are used throughout the rest of the script. 
+
+Example config:
+```
+config:
+  sqlite:  ../../source/chembl/chembl_33/chembl_33_sqlite/chembl_33.db
+  uniprot2ensembl: ../../tables/uniprot2ensembl.tsv
+  schema: ../../schema/
+```
+
+Various fields in the script file will be be parsed using a [Mustache](https://mustache.github.io/) template engine. For example, to access the various values within the config block, the template `{{config.sqlite}}`.
+
+
+# Inputs
+The input block defines the various data extractors that will be used to open resources and create streams of JSON messages for processing. The possible input engines include:
+ - AVRO
+ - JSON
+ - XML
+ - SQL-dump
+ - SQLite
+ - TSV/CSV
+ - GLOB
+
+For any other file types, there is also a plugin option to allow the user to call their own code for opening files.
+
+# Pipeline
+The `pipelines` defined a set of named processing pipelines that can be used to transform data. Each pipeline starts with a `from` statement that defines where data comes from. It then defines a linear set of transforms that are chained togeather to do processing. Pipelines may used `emit` steps to output messages to disk. The possible data transform steps include:
+- Accumulate
+- Clean
+- Distinct
+- DropNull
+- Field Parse
+- Field Process
+- Field Type
+- Filter
+- FlatMap
+- GraphBuild
+- Hash
+- JSON Parse
+- Lookup
+- Value Mapping
+- Object Validation
+- Project
+- Reduce
+- Regex
+- Split
+- UUID Generation
+
+Additionally, users are able to define their one transform step types using the `plugin` step.
+
+# Example script
+```yaml
+class: sifter
+
+name: go
+outdir: ../../output/go/
+
+config:
+  oboFile: ../../source/go/go.obo
+  schema: ../../schema
+
+inputs:
+  oboData:
+    plugin:
+      commandLine: ../../util/obo_reader.py {{config.oboFile}}
+
+pipelines:
+  transform:
+    - from: oboData
+    - project:
+        mapping:
+          submitter_id: "{{row.id[0]}}"
+          case_id: "{{row.id[0]}}"
+          id: "{{row.id[0]}}"
+          go_id: "{{row.id[0]}}"
+          project_id: "gene_onotology"
+          namespace: "{{row.namespace[0]}}"
+          name: "{{row.name[0]}}"
+    - map: 
+        method: fix
+        gpython: | 
+          def fix(row):
+            row['definition'] = row['def'][0].strip('"')
+            if 'xref' not in row:
+              row['xref'] = []
+            if 'synonym' not in row:
+              row['synonym'] = []
+            return row
+    - objectValidate:
+        title: GeneOntologyTerm
+        schema: "{{config.schema}}"
+    - emit:
+        name: term
+```
diff --git a/extractors/sqldump_step.go b/extractors/sqldump_step.go
 	go func() {
 		defer fhd.Close()
 		defer close(out)
 		tableColumns := map[string][]string{}
 		data, _ := io.ReadAll(hd)
 		tokens := sqlparser.NewStringTokenizer(string(data))
 		for {
diff --git a/task/task.go b/task/task.go

 func (m *Task) Emit(n string, e map[string]interface{}, useName bool) error {

 	new_name := m.GetName() + "." + n
 	if useName {
 		temp := strings.Split(n, ".")
 		new_name = temp[len(temp)-1]
diff --git a/transform/object_validate.go b/transform/object_validate.go
 	"github.com/bmeg/sifter/evaluate"
 	"github.com/bmeg/sifter/task"
 	"github.com/santhosh-tekuri/jsonschema/v5"
 	_ "github.com/santhosh-tekuri/jsonschema/v5/httploader"
 )

 type ObjectValidateStep struct {
	go func() {
	defer fhd.Close()
	defer close(out)
	tableColumns := map[string][]string{}
Check failure on line 62 in extractors/sqldump_step.go GitHub Actions / lint `tableColumns declared but not used (typecheck)`
	data, _ := io.ReadAll(hd)
	tokens := sqlparser.NewStringTokenizer(string(data))
	for {

	func (m *Task) Emit(n string, e map[string]interface{}, useName bool) error {

	new_name := m.GetName() + "." + n
Check failure on line 110 in task/task.go GitHub Actions / lint don't use underscores in Go names; var `new_name` should be `newName` (golint)
	if useName {
	temp := strings.Split(n, ".")
	new_name = temp[len(temp)-1]
	"github.com/bmeg/sifter/evaluate"
	"github.com/bmeg/sifter/task"
	"github.com/santhosh-tekuri/jsonschema/v5"
	_ "github.com/santhosh-tekuri/jsonschema/v5/httploader"
Check failure on line 13 in transform/object_validate.go GitHub Actions / lint `a blank import should be only in a main or test package, or have a comment justifying it (golint)`
	)

	type ObjectValidateStep struct {