Skip to content

Commit

Permalink
feat: restrict default file pattern to plain text
Browse files Browse the repository at this point in the history
  • Loading branch information
brunoarine committed Jun 29, 2023
1 parent 9cfe8de commit 6a838d8
Show file tree
Hide file tree
Showing 4 changed files with 487 additions and 44 deletions.
147 changes: 119 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,33 +85,117 @@ findlike [OPTIONS] [REFERENCE_FILE]

## Options

Here's the breakdown of the available options in Findlike:

```
--version Show the version and exit.
-q, --query TEXT query option if no reference file is provided
-d, --directory PATH directory to scan for similar files [default:
(current directory)]
-f, --filename-pattern TEXT filename pattern matching [default: *.*]
-R, --recursive recursive search
-a, --algorithm [bm25|tfidf] text similarity algorithm [default: tfidf]
-l, --language TEXT stemmer and stopwords language [default:
english]
-c, --min-chars INTEGER minimum document size (in number of
characters) to be considered [default: 1]
-A, --absolute-paths show absolute rather than relative paths
-m, --max-results INTEGER maximum number of results [default: 10]
-p, --prefix TEXT result lines prefix
-s, --show-scores show similarity scores
-h, --hide-reference remove REFERENCE_FILE from results
-H, --heading TEXT results list heading
-F, --format [plain|json] output format [default: plain]
-t, --threshold FLOAT minimum score for a result to be shown
[default: 0.0]
--help Show this message and exit.
```

## Examples
Here's the breakdown of the available options in findlike:

#### `--help`

Displays a short summary of the available options.

#### `-d, --directory PATH`

Specify the directory that is going to be scanned. Default is current working directory. Example:

```sh
findlike -d /path/to/another/directory
```

#### `-q, --query TEXT`

Passes an ad-hoc query to the program, so that no reference file is required. Useful when you want to quickly find documents by an overall theme. Example:

```sh
findlike -q "earthquakes"
```

#### `-f, --file-pattern`

Specifies the file pattern to use when scanning the directories for similar files. The pattern uses [glob](https://en.wikipedia.org/wiki/Glob_(programming)) convention, and should be passed with single or double quotes, otherwise your shell environment will likely try to expand it. Default is common plain-text file extensions (the full list can be seen [here](./findlike/constants.py)).

```sh
findlike -f "*.md" reference_file.txt
```

#### `-R, --recursive`

If used, this option makes `findlike` scans directories and their sub-directories as well. Example:

```sh
findlike reference_file.txt -R
```

#### `-l, --language TEXT`

Changing this value will impact stopwords filtering and word stemmer. The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish. Default is English.

```sh
findlike reference_file.txt -l "portuguese"
```

#### `-c, --min-chars INTEGER`

Minimum document size (in number of characters) to be included in the corpus. Default is 1. Example:

```sh
findlike reference_file.txt -c 50
```

#### `-A, --absolute-paths`

Show the absolute path of each result instead of relative paths. Example:

```sh
findlike reference_file.txt -A
```

#### `-m, --max-results INTEGER`

Number of items to show in the final results. Default is 10.

```sh
findlike reference_file.txt -m 5
```

#### `-p, --prefix TEXT`

String to prepend each entry in the final results. You can set it to "* " or "- " to turn them into a Markdown or Org-mode list. Default is "", so that no prefix is shown. Example:

```sh
findlike reference_file.txt -p "- "
```

#### `-h, --hide-reference`

Remove the first result from the scores list. Useful if the reference file is in the scanned directory, and you don't want to see it included in the top of the results. This option has no effect if the `--query` option is used.

```sh
findlike reference_file.txt -h
```

#### `-H, --heading TEXT`

Text to show as the list heading. Default is "", so no heading title is shown. Example:

```sh
findlike reference_file.txt -H "## Similar files"
```

#### `-F, --format [plain|json]`

This option sets the output format. `plain` will print the results as a simple list, one entry per line. `json` will print the results as a valid JSON list with `score` and `target` as keys for each entry. Default is "plain". Example:

```sh
findlike reference_file.txt -F json
```

#### `-t, --threshold FLOAT`

Similarity score threshold. All results whose score are below the determined threshold will be omitted. Default is 0.05. Set it to 0 if you wish to show all results. Example:

```sh
findlike reference_file.txt -t 0
```

## More Examples

To find similar documents in a directory (recursively):

Expand Down Expand Up @@ -149,7 +233,9 @@ source venv/bin/activate

Now install the development dependencies:

```sh
pip install -e '.[dev]'
``

To run the tests:

Expand All @@ -159,4 +245,9 @@ pytest

## License

This project is licensed under the terms of the MIT license. See [LICENSE](LICENSE) for more details.
This project is licensed under the terms of the MIT license. See [LICENSE](LICENSE) for more details.

## Acknowledgements

- [Simon Willison](https://simonwillison.net/) for being an inspiration on releasing small but useful tools more often.
- [Sindre Sorhus](https://raw.githubusercontent.com/sindresorhus/text-extensions) for the comprehensive list of plain-text file extensions.
25 changes: 9 additions & 16 deletions findlike/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,12 @@
from nltk.stem import SnowballStemmer
from stop_words import get_stop_words

from .format import BaseFormatter, JsonFormatter
from .preprocessing import (
Corpus,
Processor,
)
from .utils import try_read_file
from .wrappers import BM25, Tfidf

FORMATTER_CLASSES = {"plain": BaseFormatter, "json": JsonFormatter}
ALGORITHM_CLASSES = {"bm25": BM25, "tfidf": Tfidf}
from .utils import try_read_file, collect_paths
from .constants import FORMATTER_CLASSES, ALGORITHM_CLASSES, TEXT_FILE_EXT


@click.command()
Expand Down Expand Up @@ -43,7 +39,7 @@
type=str,
default="*.*",
help="filename pattern matching",
show_default=True,
show_default="plain-text file extensions",
required=False,
)
@click.option(
Expand Down Expand Up @@ -162,7 +158,7 @@ def cli(
query,
format,
threshold,
absolute_paths
absolute_paths,
):
"""'findlike' is a program that scans a given directory and returns the most
similar documents in relation to REFERENCE_FILE or --query QUERY.
Expand All @@ -188,15 +184,12 @@ def cli(

# Put together the list of documents to be analyzed.
directory_path = Path(directory)
glob_func = (
directory_path.rglob
if recursive
else directory_path.glob
document_paths = collect_paths(
directory=directory_path, extensions=TEXT_FILE_EXT, recursive=recursive
)
documents_paths = [x for x in glob_func(filename_pattern) if x.is_file()]

# Create a corpus with the collected documents.
corpus = Corpus(paths=documents_paths, min_chars=min_chars)
corpus = Corpus(paths=document_paths, min_chars=min_chars)

# Set up the documents pre-processor.
stemmer = SnowballStemmer(language).stem
Expand All @@ -223,7 +216,7 @@ def cli(
heading=heading,
threshold=threshold,
absolute_paths=absolute_paths,
is_query=bool(query)
)
is_query=bool(query),
)
formatted_results = formatter.format()
print(formatted_results)
Loading

0 comments on commit 6a838d8

Please sign in to comment.