Skip to content

Commit

Permalink
Replace server with Go version. Change build process.
Browse files Browse the repository at this point in the history
  • Loading branch information
mlinhard committed Oct 2, 2019
1 parent eef5e15 commit 07cd3cf
Show file tree
Hide file tree
Showing 101 changed files with 3,355 additions and 5,567 deletions.
192 changes: 19 additions & 173 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Intro

**exactly** is an exact substring search tool. It is ablet to find positions of arbitrary substrings, not just the whole words or similar words
**exactly** is an exact substring search tool. It is able to find positions of arbitrary substrings, not just the whole words or similar words
as it is the case with the full-text search tools such as [Apache Lucene](http://lucene.apache.org/). **exactly** builds an index on
a set of files (binary or text, doesn't matter) and once the index is computed, it can be queried for occurence of a pattern, i.e.

Expand All @@ -10,212 +10,58 @@ Let *T* be concatenation of the documents, *T = D1 + sep + D2 + sep + ... + Dn*

a query for pattern *P* will find all of the tuples *(p, j)* where it holds that *P* is substring of *Dj* starting at position *p*.

The query can be answered pretty fast, in *O(P+Q)* time (*Q* is the number of the returned *(p,j)* tuples)
The query can be answered pretty fast, in *O(len(P)+Q)* time (len(P) is length of *P* and *Q* is the number of the returned *(p,j)* tuples)

Since **exactly** can find position of any pattern anywhere in the set of documents, this comes at the cost of memory.

**exactly** uses Enhanced suffix arrays as described in *Replacing suffix trees with enhanced suffix arrays* by *Abouelhoda, Kurtz, Ohlebusch* [article](https://www.sciencedirect.com/science/article/pii/S1570866703000650)
So far this is a basic straightforward implementation without any optimizations. The complete data structure takes *~ 25N* bytes of memory (*N* is length of total text *T*)
in the peak and *21N* afterwards. Currently only total text up to 2 GB is supported due to java array indexing limitation.
in the peak and *21N* afterwards. Currently only total text length up to 2 GB is supported.

To compute suffix array we use [SA-IS algorithm implementation by Yuta Mori](https://sites.google.com/site/yuta256/sais).

# Demo
## Installation

[![asciicast](https://asciinema.org/a/Pj6xP9ZP0DRz7OFtBJoLaiYXj.png)](https://asciinema.org/a/Pj6xP9ZP0DRz7OFtBJoLaiYXj)
Currently, the installer is only available for Fedora 64-bit system.

# Building and installation
## Requirements

So far I've only created installer for Fedora 64-bit. I thought that would automatically give us CentOS/RHEL 7 but there's no python3
on those so it's Fedora only. I haven't figured out a way to build and securely distribute RPMs yet,
so it needs to be built from sources, which requires git and docker. Everything else will be installed only inside of the docker container.

## Build

```
git clone https://github.com/mlinhard/exactly
pushd exactly/installer/rpm-builder
./build.sh
popd
```

After this you should find your installer in exactly/installer/rpm-builder/rpm/x86_64

## Install

Just install the RPM with yum/dnf tool. After this you should be able to use the **exactly** tool from command-line. The main dependencies are

- Java 1.8 JRE - the REST server is currently written as Spring boot REST service
- Python - the exactly command-line console tool is written in Python

## Non-RPM installation

I haven't yet produced a convenient installer for other linux distros, but it shouldn't be that hard to make **exactly** running without
the installer.

### Server
The server is a standard mavenized Java 1.8 project. It needs to be built by standard `mvn clean install` command (in server folder).
This will produce `server/target/exactly-server-<version>.jar` file. This is runnable by

`java -jar exactly-server-<version>.jar --dir=<root>`

where `<root>` is the folder to be indexed

### Client

You need to be root to perform some of this. Create file `/opt/exactly/lib/python/VERSION` with version string
(it should reflect the current version, e.g. same as in the `exactly-server-<version>.jar` filename). Then the
python client should be installable with `python3 setup.py install` inside of the client folder. After this you could place
client/bin/exactly into your /usr/bin/exactly and you should be fine.


# Usage

## Command line
## Usage

`exactly index <root>`

Will start the REST server at http://localhost:9201 and index for all of the files (recursively) under given `<root>` directory.
Start the indexing server on given root folder.

`exactly search`

Will start exactly console client where you'll be able to enter search queries.

## Build

## API

**WARNING: The API is not yet fixed and is subject to change even with bugfix releases.**

### Server statistics
*GET http://localhost:9201/version*

will return simple one-line version string (no JSON)

*GET http://localhost:9201/stats*

will return server stats, that look like this:

```json
{
"indexed_bytes": 110505,
"indexed_files": 39,
"done_crawling": true,
"done_loading": true,
"done_indexing": true
}
```

### Search query

*POST http://localhost:9201/search* (Content-Type: application/json)

With request data:
```json
{
"pattern": "cGF0dGVybg==",
"max_hits": 3,
"max_context": 20,
"offset": 0
}
```

Will output something like:
```json
{
"hits": [
{
"pos": 286,
"doc_id": "/home/mlinhard/dev/projects/exactly/workspace/exactly/server/src/main/java/sk/linhard/search/Search.java",
"ctx_before": "eHQuCgkgKiAKCSAqIEBwYXJhbSA=",
"ctx_after": "CgkgKiBAcmV0dXJuCgkgKi8KCVM="
},
{
"pos": 521,
"doc_id": "/home/mlinhard/dev/projects/exactly/workspace/exactly/server/src/main/java/sk/linhard/search/HitContext.java",
"ctx_before": "IHN0cmluZyArIGxlbmd0aCBvZiA=",
"ctx_after": "CgkgKi8KCWludCBoaWdobGlnaHQ="
},
{
"pos": 189,
"doc_id": "/home/mlinhard/dev/projects/exactly/workspace/exactly/server/src/main/java/sk/linhard/search/HitContext.java",
"ctx_before": "IHN0cmluZyBiZWZvcmUgKwogKiA=",
"ctx_after": "ICsgYWZ0ZXIgd2l0aCBoaWdobGk="
}
]
}
```

#### Request params:

- **pattern** - Base64 encoded binary string to search for
- **max_hits** - Maximum number of hits to return. Since the pattern can be even a single letter, the search query result size can be potentially quite huge, thus we need to limit the number of hits.
- **max_context** - Max number of bytes before and after the pattern that will be included in each hit to give the context to the position of the found pattern.
- **offset** - Optional parameter, if there are more pattern hits than max_hits, return segment starting at offset in complete hit list

#### Response format:

- **hits** - JSON array of hit JSON objects
- **cursor** - JSON Object representing the hit array cursor. If this object is not present this means
that the returned **hits** array is complete. If present this means that the array is only a portion of bigger array
that wasn't returned complete due to **max_hits** limitation

#### Hit format:

- **pos** - position of the hit in the document
- **doc_id** - string ID of the document, currently this is a file name
- **ctx_before** - Base64 encoded context *before* the pattern occurence
- **ctx_after** - Base64 encoded context *after* the pattern occurence

#### Cursor format:
The following snippets assume you checked out this git repository

- **complete_size** - size of the complete search result (number of hits)
- **offset** - offset of this result's segment in the complete array
### Installer

Usually what you want to do if you receive incomplete response (with cursor element present) is to POST /search
again with offset increased by max_hits.
See [Exactly installers](installer) section

### Document retrieval
### Server

*GET http://localhost:9201/document/{document_idx}*
See [Exactly indexing server](server) section

Will retrieve document by its index (order in which it was indexed)
### Client

*POST http://localhost:9201/document* (Content-Type: application/json)
Go into the `client` directory and run `setup.sh` script. This will locally build `exactly-index` golang binary and then include it in virtualenv folder `.venv`. You can then use exactly in a familiar fashion:

With request data:
```json
{
"document_id": "/home/mlinhard/Documents/textfile1.txt",
"document_index": 3
}
```
Can be used to retrieve the documents both by index and their string ID (usually path).

#### Response format:

Example
```json
{
"document_id": "/home/mlinhard/Documents/textfile1.txt",
"document_index": 3,
"content": "cGF0dGVybg=="
}
source .venv/bin/activate
exactly
```

- **document_id** - Document string ID, usually path
- **document_index** - Document index (order in which it was indexed)
- **content** - Base64 encoded binary document content

# TODO

- [ ] Improve console - display document paths and more context and more hits on demand
- [ ] Improve console - display server connection / indexing status in status-bar
- [ ] Add JVM stats (mainly memory usage)
- [ ] Enhanced suffix array memory optimization
- [ ] Hexadecimal console mode (allow search for binary strings)
- [ ] Add memory performance test (comparison with Lucene)
- [ ] Change the included Swing GUI code to REST client mode
- [ ] Allow UTF-8 String searches in GUI mode
- [ ] Allow UTF-8 String searches and UTF-8 context display
- [ ] Better test coverage

2 changes: 2 additions & 0 deletions client/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ build
dist/
exactly.egg-info
test.py
.venv
.vscode
44 changes: 24 additions & 20 deletions client/bin/exactly
Original file line number Diff line number Diff line change
Expand Up @@ -13,36 +13,40 @@ Options:
--debug <debug-string> Start PyDev debug server. Debug string format: host:port:pydev_src
"""
from docopt import docopt
import pkg_resources
import subprocess
import sys
from exactly import console
from exactly import set_debug, set_debug_logging
import json
from exactly import console, set_debug, set_debug_logging
from tempfile import NamedTemporaryFile


def get_version():
try:
with open("/opt/exactly/lib/python/VERSION", "r") as f:
return f.read()
except:
return "UNKNOWN"
return pkg_resources.require("exactly")[0].version


def index(root_folder, debug):
cmds = ['java', '-jar' ]
if debug:
cmds.append('-Dlogging.level.sk.linhard=DEBUG')
cmds.append('/opt/exactly/lib/java/exactly-server.jar')
cmds.append('--dir=' + root_folder)
p = subprocess.Popen(cmds)
try:
return p.wait()
except KeyboardInterrupt:
pass


with NamedTemporaryFile(mode="w", prefix="exactly-index-", suffix="-config.json") as tmpfile:
config = {
"listen_address": "localhost:9201",
"num_file_loaders": 4,
"num_file_staters": 4,
"roots": [root_folder],
"ignored_directories": []
}
json.dump(config, tmpfile)
tmpfile.flush()
p = subprocess.Popen(['exactly-index', '-config=' + tmpfile.name])
try:
return p.wait()
except KeyboardInterrupt:
pass


def search():
return console.ExactlyConsole.main()


if __name__ == '__main__':
args = docopt(__doc__, version=get_version())
set_debug_logging(args['--debug-log'])
Expand Down
4 changes: 2 additions & 2 deletions client/exactly/exactly.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ def get(self, relpath):

def post(self, relpath, json_req):
try:
return requests.post(self.uri + relpath, json=json.dumps(json_req))
return requests.post(self.uri + relpath, json=json_req)
except ConnectionError:
raise Exception("Can't connect to index at " + self.uri + ". Make sure that the index is running")
except Exception as e:
Expand All @@ -157,7 +157,7 @@ def search(self, query):
if r.status_code == 200:
return SearchResult.from_json(r.json())
else:
return None
raise Exception(f"Unexpected code: {r.status_code}: {r.text}")

def stats(self):
r = self.get("/stats")
Expand Down
2 changes: 2 additions & 0 deletions client/rpm-prep
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
%setup -n %{name}-%{unmangled_version} -n %{name}-%{unmangled_version}
%global debug_package %{nil}
18 changes: 4 additions & 14 deletions client/setup.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,16 @@
from setuptools import setup


def get_version():
try:
with open("/opt/exactly/lib/python/VERSION", "r") as f:
return f.read()
except Exception as e:
print("You need to supply VERSION file")
raise e


unified_version = get_version()
print("Using version: " + unified_version)
from os import getenv

setup(name='exactly',
version=unified_version,
version=getenv('EXACTLY_VERSION'),
description='Binary exact search',
url='http://github.com/mlinhard/exactly',
author='Michal Linhard',
author_email='[email protected]',
license='Apache 2.0',
packages=['exactly'],
scripts=['bin/exactly'],
data_files=[('bin', ['bin/exactly-index'])],
zip_safe=False,
install_requires=[
'docopt', 'requests'
Expand Down
29 changes: 29 additions & 0 deletions client/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
if [ "$1" != "update" ]; then
rm -rf .venv
python3 -m virtualenv .venv
fi

source .venv/bin/activate

export EXACTLY_VERSION=`git describe --tags`

if [ ! -f bin/exactly-index ]; then
if [ ! -f ../server/exactly-index ]; then
pushd ../server
go build -o exactly-index "-ldflags=-s -w -X main.Version=${EXACTLY_VERSION}"
popd
fi
mv ../server/exactly-index bin
fi

python3 setup.py install

# for some reaason the setup in virtualenv doesn't copy bin/exactly-index to .venv/bin
find .venv -name exactly-index -exec rm {} \;
mv bin/exactly-index .venv/bin

rm -rf build dist exactly.egg-info

deactivate

Loading

0 comments on commit 07cd3cf

Please sign in to comment.