-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cEP-0026.md: Adds optimize caching cEP
- Loading branch information
Showing
1 changed file
with
196 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# Optimize caching for the NextGen-Core | ||
|
||
| Metadata | | | ||
| -------- | ----------------------------------------- | | ||
| cEP | 26 | | ||
| Version | 1.0 | | ||
| Title | Optimize caching for the NextGen-Core | | ||
| Authors | Palash Nigam <mailto:[email protected]> | | ||
| Status | Implementation Due | | ||
| Type | Feature | | ||
|
||
## Abstract | ||
|
||
This document describes how a caching mechanism (an enhanced version of the one | ||
being used by the old core) will be implemented and integrated with the | ||
NextGen-Core as a part of this | ||
[GSoC project](https://summerofcode.withgoogle.com/projects/#6434190552203264). | ||
|
||
## Introduction | ||
|
||
Since the NextGen-Core aims to provide faster speed and better performance for | ||
coala-runs on large codebases it needs caching to achieve this kind of | ||
performance. | ||
|
||
Structure wise the NextGen cache is a dictionary with bear types and cache | ||
tables as key-value pairs. The cache tables themselves are | ||
dictionary-like-objects that map the `persistent_hash`s of the task objects to | ||
the bear results. This current mechanism will be further enhanced by adding | ||
features like `FileFactory` class to load files, `Directory` class to interface | ||
with the directories, ignoring unmodified directories during successive runs and | ||
adding cache control flags for better usability and providing the user various | ||
options to perform caching according to a project's needs. | ||
|
||
## Implementation | ||
|
||
The implementation will be divided into four parts | ||
|
||
- Prototype for `FileFactory` class implementation. | ||
- Prototype for `Directory` class implementation. | ||
- Ignore directories functionality. | ||
- Cache control flags. | ||
|
||
### Prototype of FileFactory class implementation | ||
|
||
This class will follow the Factory design pattern and will act as a proxy to | ||
dealing with files. This class will have various methods to provide the file | ||
contents in different forms like string, binary format, etc. With the help of | ||
this class we can replace the actual file contents in our file-dict with | ||
these objects which will itself be a performance improvement. | ||
|
||
The following is the prototype for `FileFactory` class. | ||
|
||
```python | ||
import linecache | ||
import os | ||
|
||
from coala_utils.decorators import generate_eq | ||
from coalib.core.PersistentHash import persistent_hash | ||
|
||
|
||
@generate_eq('_filename', '_stamp') | ||
class FileFactory: | ||
|
||
def __init__(self, filename): | ||
self._filename = filename | ||
self._parent = os.path.abspath(os.path.join(self._filename, os.pardir)) | ||
self._stamp = os.path.getmtime(self._filename) | ||
|
||
def get_line(self, line): | ||
return linecache.getline(self._filename, line) | ||
|
||
def file_as_list(self): | ||
self.lines = linecache.getlines(self._filename) | ||
|
||
def file_as_bytes(self): | ||
with open(self._filename, 'rb') as fp: | ||
return fp.read() | ||
|
||
def file_as_string(self): | ||
return self.file_as_bytes().decode() | ||
|
||
@property | ||
def name(self): | ||
return self._filename | ||
|
||
@property | ||
def parent(self): | ||
return self._parent | ||
|
||
@property | ||
def timestamp(self): | ||
return self._stamp | ||
|
||
def hash(self): | ||
# hash the absolute file path to get a unique hash | ||
return persistent_hash(self._filename) | ||
``` | ||
|
||
### Prototype of Directory class implementation | ||
|
||
The objects of this class will also reside in the file-set (formerly known as | ||
the file-dict). The objects of this class can also be used by `DirectoryBears` | ||
as discussed on this thread [#4999](https://github.com/coala/coala/issues/4999) | ||
by performing a simple type check on the entries in the file-set, whether the | ||
objects are instances of `FileFactory` or `Directory`. | ||
|
||
Following is the prototype for the `Directory` class. | ||
|
||
```python | ||
import os | ||
|
||
from coalib.parsing.Globbing import glob | ||
|
||
|
||
class Directory: | ||
def __init__(self, path): | ||
self.path = path | ||
self.parent = os.path.abspath(os.path.join(path, os.pardir)) | ||
self.files = 0 # number of files contained in a directory | ||
self.subdirs = 0 # number of sub-directories contained in a directory | ||
|
||
def count_children(self): | ||
children = os.listdir(self.path) | ||
|
||
for child in children: | ||
if os.path.isfile(child): | ||
self.files += 1 | ||
|
||
self.subdirs = len(children) - self.files | ||
|
||
def get_dir(pattern): | ||
return glob(pattern) # get sub directory matching the given glob pattern | ||
``` | ||
|
||
### Ignore directories functionality | ||
|
||
This will probably be the biggest performance boost for coala. Instead of | ||
ignoring only unmodified files on subsequent runs, coala shall also ignore | ||
directories with no unchanged files and subdirectories. It can be implemented | ||
accordingly: | ||
|
||
- `FileFactory` will have an additional attribute parent | ||
(the parent directory path) in its `__init__` method. | ||
- `Directory` class will also have additional attributes `files` | ||
(number of files) `subdirs` (number of subdirs) and `parent` | ||
(the parent directory's path; in case of project's root it will be `None`) | ||
- There will be a separate dict (called directory-tracker) containing directory | ||
paths as keys and number of files and subdirs as a list of values for e.g. | ||
``` | ||
{ Directory.path: [Directory.files, Directory.subdirs] } | ||
``` | ||
- First all the changed files will be deleted from the cache. If a file | ||
is not unmodified then it can be untracked, we will look up its parent | ||
directory using `FileFactory.parent` and decrease the count of the number of | ||
files (`Directory.files` in the directory-tracker dict) corresponding to that | ||
path by one. | ||
- Once all the modified files have been deleted from the cache coala will move | ||
on to the directories in our dict. coala will look for directories with | ||
`Directory.subdirs` not equal to 0 (starting from the bottommost level of the | ||
directory tree) if the `Directory.files` associated with such paths is 0 (i.e. | ||
all the files contained inside that directory have been uncached) then the | ||
`Directory` object corresponding to this path will be deleted from the cache. | ||
- As soon as a Directory object is uncached the value of the | ||
`Directory.subdirs` of their parent directory will also be decreased by 1. | ||
- This process will continue for all the directories and the paths having both | ||
the values (`Directory.files` and `Directory.subdirs`) as zero will have | ||
their corresponding `Directory` objects deleted from the cache. | ||
|
||
### Cache control flags | ||
|
||
This feature enhancement focuses on both performance and usability. The aim | ||
is to provide different caching modes through cache control flags | ||
to provide the user the flexibility to tweak the caching mechanism according to | ||
their project's needs. The following cache flags can be implemented to | ||
provide different caching modes: | ||
|
||
- `--cache-protocol`: This will accept three different arguments: | ||
- `none`: if the user wants to run coala without caching enabled | ||
- `primitive` to store all the cache entries without removing the | ||
unchanged ones on subsequent runs (memory wise quite inefficient for | ||
large projects but this will prove to be useful in projects with many | ||
incremental builds that need fast linting). | ||
- `untrack` the opposite of the `primitive` mode which will stop tracking | ||
unchanged files and directories on subsequent coala runs | ||
- `lru` (last recently used) in which cached items only persist until the | ||
next run (a count parameter can also be used to know when to discard an | ||
unused cached item). This should be the default argument. | ||
- `--clear-cache`: Similar to `--flush-cache`. | ||
- `--cache-compression`: This will accept the following arguments | ||
- `none`: for compression to be disabled. | ||
- `lzma` or `gzip`: to compress the cache using these modules. | ||
- `--cache-optimize`: For faster cache loading we can make use of this flag | ||
which will utilize `pickletools.optimize` for faster cache loading. | ||
- `--export-cache` or `--import-cache`: An API (yet to be designed) to transfer | ||
cache across different machines (like the user's local machine and the CI | ||
servers) and speed up builds. |