Skip to content

Commit

Permalink
Recognize ENV variables in Docker image
Browse files Browse the repository at this point in the history
  • Loading branch information
deric committed May 4, 2018
1 parent 922a309 commit 83a6a64
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 3 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
A tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`). Removal process consists of two phases:

1. Aggregate query find documents that have same `field` value and at least 2 occurences. One copy of such document is left in ES all other are deleted via Bulk API (almost all, usually - there's always some catch). We wait for index update after each `DELETE` operatation. Processed documents are logged into `/tmp/es_dedupe.log`.
2. Unfortunately aggregate queries are not necessarily exact. Based on `/tmp/es_dedupe.log` logfile we query for each `field` value and DELETE document copies on other shards. Depending on number of nodes and shards in cluster there might be still document that aggregate query didn't return. In order to disable 2nd step use `--no-chck` flag.
2. Unfortunately aggregate queries are not necessarily exact. Based on `/tmp/es_dedupe.log` logfile we query for each `field` value and DELETE document copies on other shards. Depending on number of nodes and shards in cluster there might be still document that aggregate query didn't return. In order to disable 2nd step use `--no-check` flag.

## Docker

Running from Docker:
```
docker run deric/es-dedupe -H localhost -P 9200 -i exact-index-name -f Uuid
docker run -it -e ES=locahost -e INDEX=my-index -e FIELD=id deric/es-dedupe
```
You can either override Docker commad or use ENV variable to pass arguments.

## Usage
```
Expand Down
25 changes: 24 additions & 1 deletion entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,25 @@
#!/bin/bash
python3 dedupe.py $@
ARGS=""
if [ ! -z "$ES" ]; then ARGS+=" --host $ES"; fi
if [ ! -z "$PORT" ]; then ARGS+=" --port $PORT"; fi
if [ ! -z "$BATCH" ]; then ARGS+=" -b $BATCH"; fi
if [ ! -z "$FIELD" ]; then ARGS+=" -f $FIELD"; fi
if [ ! -z "$FLUSH" ]; then ARGS+=" --flush $FLUSH"; fi
if [ ! -z "$INDEX" ]; then ARGS+=" -i $INDEX"; fi
if [ ! -z "$DOC_TYPE" ]; then ARGS+=" --type $DOC_TYPE"; fi
if [ ! -z "$PREFIX" ]; then ARGS+=" --prefix $PREFIX"; fi
if [ ! -z "$PREFIX_SEP" ]; then ARGS+=" -s $PREFIX_SEP"; fi
if [ ! -z "$DUPES" ]; then ARGS+=" -m $DUPES"; fi
if [ ! -z "$INC" ]; then ARGS+=" -I $INC"; fi
if [ ! -z "$SLEEP" ]; then ARGS+=" --sleep $SLEEP"; fi
if [ ! -z "$LOG_AGG" ]; then ARGS+=" --log_agg $LOG_AGG"; fi
if [ ! -z "$LOG_DONE" ]; then ARGS+=" --log_done $LOG_DONE"; fi
if [ "$VERBOSE" == true ]; then ARGS+=" --verbose"; fi
if [ "$DEBUG" == true ]; then ARGS+=" --debug"; fi
if [ "$NOOP" == true ]; then ARGS+=" --noop"; fi
if [ "$ALL" == true ]; then ARGS+=" --all"; fi
if [ "$NO_CHECK" == true ]; then ARGS+=" --no-check"; fi

cmd="python3 dedupe.py $ARGS $@"
echo "Running: ${cmd}"
exec ${cmd}

0 comments on commit 83a6a64

Please sign in to comment.