Skip to content

Commit

Permalink
cEP-0026.md: Adds optimize caching cEP
Browse files Browse the repository at this point in the history
  • Loading branch information
palash25 committed May 24, 2018
1 parent 87ae1ac commit 7272ff6
Showing 1 changed file with 196 additions and 0 deletions.
196 changes: 196 additions & 0 deletions cEP-0026.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Optimize caching for the NextGen-Core

| Metadata | |
| -------- | ----------------------------------------- |
| cEP | 26 |
| Version | 1.0 |
| Title | Optimize caching for the NextGen-Core |
| Authors | Palash Nigam <mailto:[email protected]> |
| Status | Implementation Due |
| Type | Feature |

## Abstract

This document describes how a caching mechanism (an enhanced version of the one
being used by the old core) will be implemented and integrated with the
NextGen-Core as a part of this
[GSoC project](https://summerofcode.withgoogle.com/projects/#6434190552203264).

## Introduction

Since the NextGen-Core aims to provide faster speed and better performance for
coala-runs on large codebases it needs caching to achieve this kind of
performance.

Structure wise the NextGen cache is a dictionary with bear types and cache
tables as key-value pairs. The cache tables themselves are
dictionary-like-objects that map the `persistent_hash`s of the task objects to
the bear results. This current mechanism will be further enhanced by adding
features like `FileFactory` class to load files, `Directory` class to interface
with the directories, ignoring unmodified directories during successive runs and
adding cache control flags for better usability and providing the user various
options to perform caching according to a project's needs.

## Implementation

The implementation will be divided into four parts

- Prototype for `FileFactory` class implementation.
- Prototype for `Directory` class implementation.
- Ignore directories functionality.
- Cache control flags.

### Prototype of FileFactory class implementation

This class will follow the Factory design pattern and will act as a proxy to
dealing with files. This class will have various methods to provide the file
contents in different forms like string, binary format, etc. With the help of
this class we can replace the actual file contents in our file-dict with
these objects which will itself be a performance improvement.

The following is the prototype for `FileFactory` class.

```python
import linecache
import os

from coala_utils.decorators import generate_eq
from coalib.core.PersistentHash import persistent_hash


@generate_eq('_filename', '_stamp')
class FileFactory:

def __init__(self, filename):
self._filename = filename
self._parent = os.path.abspath(os.path.join(self._filename, os.pardir))
self._stamp = os.path.getmtime(self._filename)

def get_line(self, line):
return linecache.getline(self._filename, line)

def file_as_list(self):
self.lines = linecache.getlines(self._filename)

def file_as_bytes(self):
with open(self._filename, 'rb') as fp:
return fp.read()

def file_as_string(self):
return self.file_as_bytes().decode()

@property
def name(self):
return self._filename

@property
def parent(self):
return self._parent

@property
def timestamp(self):
return self._stamp

def hash(self):
# hash the absolute file path to get a unique hash
return persistent_hash(self._filename)
```

### Prototype of Directory class implementation

The objects of this class will also reside in the file-set (formerly known as
the file-dict). The objects of this class can also be used by `DirectoryBears`
as discussed on this thread [#4999](https://github.com/coala/coala/issues/4999)
by performing a simple type check on the entries in the file-set, whether the
objects are instances of `FileFactory` or `Directory`.

Following is the prototype for the `Directory` class.

```python
import os

from coalib.parsing.Globbing import glob


class Directory:
def __init__(self, path):
self.path = path
self.parent = os.path.abspath(os.path.join(path, os.pardir))
self.files = 0 # number of files contained in a directory
self.subdirs = 0 # number of sub-directories contained in a directory

def count_children(self):
children = os.listdir(self.path)

for child in children:
if os.path.isfile(child):
self.files += 1

self.subdirs = len(children) - self.files

def get_dir(pattern):
return glob(pattern) # get sub directory matching the given glob pattern
```

### Ignore directories functionality

This will probably be the biggest performance boost for coala. Instead of
ignoring only unmodified files on subsequent runs, coala shall also ignore
directories with no unchanged files and subdirectories. It can be implemented
accordingly:

- `FileFactory` will have an additional attribute parent
(the parent directory path) in its `__init__` method.
- `Directory` class will also have additional attributes `files`
(number of files) `subdirs` (number of subdirs) and `parent`
(the parent directory's path; in case of project's root it will be `None`)
- There will be a separate dict (called directory-tracker) containing directory
paths as keys and number of files and subdirs as a list of values for e.g.
```
{ Directory.path: [Directory.files, Directory.subdirs] }
```
- First all the changed files will be deleted from the cache. If a file
is not unmodified then it can be untracked, we will look up its parent
directory using `FileFactory.parent` and decrease the count of the number of
files (`Directory.files` in the directory-tracker dict) corresponding to that
path by one.
- Once all the modified files have been deleted from the cache coala will move
on to the directories in our dict. coala will look for directories with
`Directory.subdirs` not equal to 0 (starting from the bottommost level of the
directory tree) if the `Directory.files` associated with such paths is 0 (i.e.
all the files contained inside that directory have been uncached) then the
`Directory` object corresponding to this path will be deleted from the cache.
- As soon as a Directory object is uncached the value of the
`Directory.subdirs` of their parent directory will also be decreased by 1.
- This process will continue for all the directories and the paths having both
the values (`Directory.files` and `Directory.subdirs`) as zero will have
their corresponding `Directory` objects deleted from the cache.

### Cache control flags

This feature enhancement focuses on both performance and usability. The aim
is to provide different caching modes through cache control flags
to provide the user the flexibility to tweak the caching mechanism according to
their project's needs. The following cache flags can be implemented to
provide different caching modes:

- `--cache-protocol`: This will accept three different arguments:
- `none`: if the user wants to run coala without caching enabled
- `primitive` to store all the cache entries without removing the
unchanged ones on subsequent runs (memory wise quite inefficient for
large projects but this will prove to be useful in projects with many
incremental builds that need fast linting).
- `untrack` the opposite of the `primitive` mode which will stop tracking
unchanged files and directories on subsequent coala runs
- `lru` (last recently used) in which cached items only persist until the
next run (a count parameter can also be used to know when to discard an
unused cached item). This should be the default argument.
- `--clear-cache`: Similar to `--flush-cache`.
- `--cache-compression`: This will accept the following arguments
- `none`: for compression to be disabled.
- `lzma` or `gzip`: to compress the cache using these modules.
- `--cache-optimize`: For faster cache loading we can make use of this flag
which will utilize `pickletools.optimize` for faster cache loading.
- `--export-cache` or `--import-cache`: An API (yet to be designed) to transfer
cache across different machines (like the user's local machine and the CI
servers) and speed up builds.

0 comments on commit 7272ff6

Please sign in to comment.