Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sifter scanning commands [WIP] #62

Merged
merged 14 commits into from
Jan 30, 2024
Prev Previous commit
Next Next commit
Expanding introduction text
kellrott committed Jan 10, 2024
commit 0b7a778e12f2c65945f676692efed60ea734a134
110 changes: 108 additions & 2 deletions website/content/docs.md
Original file line number Diff line number Diff line change
@@ -11,12 +11,12 @@ menu:

Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of
file extractors that operate as inputs to these pipelines. The pipeline engine
connects togeather arrays of transform steps into direct acylic graph that is processed
connects togeather arrays of transform steps into directed acylic graph that is processed
in parallel.

Example Message:

```
```json
{
"firstName" : "bob",
"age" : "25"
@@ -37,3 +37,109 @@ be done in a transform pipeline these include:
- Table based field translation
- Outputing the message as a JSON Schema checked object


# Script structure

## Header
Each sifter file starts with a set of field to let the software know this is a sifter script, and not some random YAML file. There is also a `name` field for the script. This name will be used for output file creation and logging. Finally, there is an `outdir` that defines the directory where all output files will be placed. All paths are relative to the script file, so the `outdir` set to `my-results` will create the directory `my-results` in the same directory as the script file, regardless of where the sifter command is invoked.
```yaml
class : sifter
name: <name of script>
outdir: <where files should be stored>
```
# Config and templating
The `config` section is a set of defined keys that are used throughout the rest of the script.

Example config:
```
config:
sqlite: ../../source/chembl/chembl_33/chembl_33_sqlite/chembl_33.db
uniprot2ensembl: ../../tables/uniprot2ensembl.tsv
schema: ../../schema/
```

Various fields in the script file will be be parsed using a [Mustache](https://mustache.github.io/) template engine. For example, to access the various values within the config block, the template `{{config.sqlite}}`.


# Inputs
The input block defines the various data extractors that will be used to open resources and create streams of JSON messages for processing. The possible input engines include:
- AVRO
- JSON
- XML
- SQL-dump
- SQLite
- TSV/CSV
- GLOB

For any other file types, there is also a plugin option to allow the user to call their own code for opening files.

# Pipeline
The `pipelines` defined a set of named processing pipelines that can be used to transform data. Each pipeline starts with a `from` statement that defines where data comes from. It then defines a linear set of transforms that are chained togeather to do processing. Pipelines may used `emit` steps to output messages to disk. The possible data transform steps include:
- Accumulate
- Clean
- Distinct
- DropNull
- Field Parse
- Field Process
- Field Type
- Filter
- FlatMap
- GraphBuild
- Hash
- JSON Parse
- Lookup
- Value Mapping
- Object Validation
- Project
- Reduce
- Regex
- Split
- UUID Generation

Additionally, users are able to define their one transform step types using the `plugin` step.

# Example script
```yaml
class: sifter
name: go
outdir: ../../output/go/
config:
oboFile: ../../source/go/go.obo
schema: ../../schema
inputs:
oboData:
plugin:
commandLine: ../../util/obo_reader.py {{config.oboFile}}
pipelines:
transform:
- from: oboData
- project:
mapping:
submitter_id: "{{row.id[0]}}"
case_id: "{{row.id[0]}}"
id: "{{row.id[0]}}"
go_id: "{{row.id[0]}}"
project_id: "gene_onotology"
namespace: "{{row.namespace[0]}}"
name: "{{row.name[0]}}"
- map:
method: fix
gpython: |
def fix(row):
row['definition'] = row['def'][0].strip('"')
if 'xref' not in row:
row['xref'] = []
if 'synonym' not in row:
row['synonym'] = []
return row
- objectValidate:
title: GeneOntologyTerm
schema: "{{config.schema}}"
- emit:
name: term
```

Unchanged files with check annotations Beta

go func() {
defer fhd.Close()
defer close(out)
tableColumns := map[string][]string{}

Check failure on line 62 in extractors/sqldump_step.go

GitHub Actions / lint

tableColumns declared but not used (typecheck)
data, _ := io.ReadAll(hd)
tokens := sqlparser.NewStringTokenizer(string(data))
for {
func (m *Task) Emit(n string, e map[string]interface{}, useName bool) error {
new_name := m.GetName() + "." + n

Check failure on line 110 in task/task.go

GitHub Actions / lint

don't use underscores in Go names; var `new_name` should be `newName` (golint)
if useName {
temp := strings.Split(n, ".")
new_name = temp[len(temp)-1]
"github.com/bmeg/sifter/evaluate"
"github.com/bmeg/sifter/task"
"github.com/santhosh-tekuri/jsonschema/v5"
_ "github.com/santhosh-tekuri/jsonschema/v5/httploader"

Check failure on line 13 in transform/object_validate.go

GitHub Actions / lint

a blank import should be only in a main or test package, or have a comment justifying it (golint)
)
type ObjectValidateStep struct {