Skip to content

zot/microfts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microfts

A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as much space as the files it indexes.

Microfts implements a trigram GIN (generalized inverted index), relying on LMDB for storage, an open source, embedded, NOSQL, key-value store library (so it’s linked into microfts, not an external service). It uses AskAlexSharov’s fork of bmatsuo’s lmdb-go package to connect to it.

LICENSE

Microfts is MIT licensed, (c) 2020 Bill Burdick. All rights reserved.

Building

Note that building may generate warning messages from lmdb-go’s compilation of the LMDB C code.

go build -o microfts

Examples

Creating a database

./microfts create /tmp/bubba

Adding Text

This adds /tmp/tst to the database in /tmp/bubba

rm -rf /tmp/bubba
./microfts create /tmp/bubba
cat > /tmp/tst <<here
one
two three
four
four five
one two three
one three two
here
./microfts input -file /tmp/bubba /tmp/tst

Getting Info

./microfts info /tmp/bubba

Searching

./microfts search /tmp/bubba "one two"

Deleting a file’s information

./microfts delete /tmp/bubba /tmp/tst

Reclaiming space in the database (only really matters after deleting a large file)

./microfts compact /tmp/bubba

Finding grams for a string

./microfts grams "this is a test"
./microfts grams -gx "this is a test"

Finding candidates for grams

./microfts search -candidates -grams /tmp/bubba thi tes est

Usage

Exit Codes

  1. misc error
  2. file is missing
  3. file has changed
  4. file is unreadable
  5. no entry for file in database
  6. database missing

Help text

Usage:
   microfts info -groups DB
                   print information about each group in the database,
                   whether it is missing or changed
                   whether it is an org-mode entry
   microfts info [-chunks] DB GROUP
                   print info for a GROUP
                   -chunks also prints the chunks in GROUP if it has a corresponding file
   microfts info [-grams] DB
                   print info for database
                   displays any groups which do not exist as files
                   displays any groups which refer to files that have changed
                   -grams displays distribution information about the trigram index
   microfts create [-s GRAMSIZE] DB
                   create DATABASE if it does not exist
   microfts chunk [-nx | -data D | -dx] -d DELIM DB GROUP GRAMS
   microfts chunk [-nx | -data D | -dx] -gx DB GROUP GRAMS
                   ADD a chunk to GROUP with GRAMS.
                   -d means use DELIM to split GRAMS.
                   -gx means GRAMS is hex encoded with two bytes for each gram using base 37.
   microfts grams [-gx] CHUNK
                   output grams for CHUNK
   microfts input [-nx | -dx | -org] DB FILE...
                   For each FILE, create a group with its name and add a CHUNK for each chunk of input.
                   Chunk data is the line number, offset, and length for each chunk (starting at 1).
                   -org means chunks are org elements, otherwise chunks are lines
   microfts delete [-nx] DB GROUP
                   delete GROUP, its chunks, and tag entries.
                   NOTE: THIS DOES NOT RECLAIM SPACE! USE COMPACT FOR THAT
   microfts compact DB
                   Reclaim space for deleted groups
   microfts search [-n | -partial | -f | - limit N | -filter REGEXP | -u] DB TEXT
                   query with TEXT for objects
                   -f force search to skip changed and missing files instead of exiting
                   -filter makes search only return chunks that match the REGEXP
                   REGEXP syntax is here: https://golang.org/pkg/regexp/syntax/
   microfts search -candidates [-grams | -gx | -gd | -n | -f | -limit N | -dx | -u] DB TERM1 ...
                   dispay all candidates with the grams for TERMS without filtering
                   -grams indicates TERMS are grams, otherwise extract grams from TERMS
                   -gx: grams are in hex, -gd: grams are in decimal, otherwise they are 3-char strings
   microfts data [-nx | -dx] DB GROUP
                   get data for each doc in GROUP
   microfts update [-t] DB
                   reinput files that have changed
                   delete files that have been removed
                   -t means do a test run, printing what would have happened
   microfts empty DB GROUP...
                   Create empty GROUPs, ignoring existing ones

   microfts is targeted for groups of small documents, like lines in a file.

  -candidates
        return docs with grams for search
  -chunks
        info DB GROUP: display all of a group's chunks
  -comp string
        compression type to use when creating a database
  -d string
        delimiter for unicode tags (default ",")
  -data string
        data to define for object
  -dx
        use hex instead of unicode for object data
  -end-format string
        search: Go format string for the end of a group
        Arg to printf is the FILE
        The default value is ""
        if -sexp is provided and -end-format is not, the default is "\n"
        Not used with search -fuzzy -sort
  -f    search: skip changed and missing files instead of exiting
  -file
        search: display files rather than chunks
  -filter string
        search: filter results that match REGEXP
  -format string
        search: Go format string for each result
        Args to printf are FILE POSITION LINE OFFSET PERCENTAGE CHUNK
        FILE (string) is the name of the file
        POSITION (int) is the 1-based character position of the chunk in the file
        LINE (int) is the 1-based line of the chunk in the file
        OFFSET (int) is the 0-based offset of the first match in the chunk
        PERCENTAGE (float) is the percentage of a fuzzy match
        Note that you can place [ARGNUM] after the % to pick a particular arg to format
        The default format is %s:%[2]s:%[5]s\n
        -sexp sets format to (:filename "%s" :line %[3]d :offset %[4]d :text "%[6]s" :percent %[5]f)
          Note that this will cause all matches to be on one (potentially large) line of output (default "%[6]s:%[2]d:%[5]s\n")
  -fuzzy float
        search: specify a percentage fuzzy match
  -gd
        use decimal instead of unicode for grams
  -grams
        get: specify tags for intead of text
        info: print gram coverage
        search: specify grams instead of search terms
  -groups
        info: display information for each group
  -gx
        use hex instead of unicode for grams
  -limit int
        search: limit the number of results (default 9223372036854775807)
  -n    only print line numbers for search
  -org
        index org-mode chunks instead of lines
  -partial
        search: allow partial matches in search
  -prof
        profile cpu
  -s int
        gram size
  -sep
        print candidates on separate lines
  -sexp
        search: output matches as an s-expression ((FILE (POS LINE OFFSET chunk) ... ) ... )
        POS is the 1-based character position of the chunk in the file
        LINE is the 1-based line of the chunk in the file
        OFFSET is the 0-based offset of the first match in the chunk
  -sort
        search -fuzzy: sort all matches
        This ignores start-format and end-format because it sorts all matches, regardless of
        which file they come from.
  -start-format string
        search: Go format string for the start of a group
        Arg to printf is the FILE
        The default value is ""
        Not used with search -fuzzy -sort
  -t    update: do a test run, printing what would have happened
  -u    search: update the database before searching
  -v    verbose

Notes

Grams

Only alphanumeric characters are represented faithfully in grams, other characters are considered whitespace and display as ‘.’. This makes a base-37 triple (0-9 and A-Z), which just fits into 2 bytes. Which is a big deal, spacewise. Grams for starts of words begin with two whitespaces and ends of words end with one whitespace. There are no grams that end with two whitespaces.

Groups and chunks

The index consists of grams for chunks that belong to groups. Groups have names and the default is to use file names as group names.

Supported groups and chunks

Microfts supports using file names as groups and splitting files into chunks either by line or by org-mode element, with the chunk data being a triple of line, offset, chunk-length. Searching finds candidate chunks by intersecting gram entries and then consults the files named by the groups for the actual content.

Custom groups and chunks

If this is not sufficient, the command also supports custom usage: you can add chunks to a group, specifying data and grams. Searching can return candidate chunks for a set of grams.

Compressed representation for unsigned integers (lexicographically orderable)

7 bits0 - 1270xxxxxxx
12 bits128 - 40951000xxxx X
20 bits4096 - 10485751001xxxx X X
28 bits1048576 - 2684354551010xxxx X X X
36 bits268435456 - 687194767351011xxxx X X X X
44 bits68719476736 - 175921860444151100xxxx X X X X X
52 bits17592186044416 - 45035996273704951101xxxx X X X X X X
60 bits4503599627370496 - 11529215046068469751110xxxx X X X X X X X
64 bits1152921504606846976 - 184467440737095516151111—- X X X X X X X X

LMDB Trees

Grams: GRAM-> BLOCK

GRAM is a 2-byte value

OID LIST

OID LISTS

9 lists of oids: [9][]byte.

Note – this is probably too ornate and a simple byte array and a count might have the same performance and space.

# 1-byte OIDS
# 2-byte OIDS
# 3-byte OIDS
# 4-byte OIDS
# 5-byte OIDS
# 6-byte OIDS
# 7-byte OIDS
# 8-byte OIDS
# 9-byte OIDS
OIDS

Gram 0 holds the info since 0 is not a legal gram

next unused oid
next unused gid
free oids
free gids

Chunks: OID -> BLOCK

OIDS are compressed integers

GID
data (e.g. line number)
gram count

Groups: GID -> BLOCK

GIDS are compressed integers

NAME
oid count
last changed timestamp
validity (valid = 0, deleted = 1)
org flag (whether -org was used)

Group Names: NAME->GID