Project Ideas Improve Copyright Detection Accuracy and Speed

Improve Copyright detection accuracy and speed in ScanCode

Copyright detection is reasonably good by the slowest scanner in ScanCode. It is based on NLTK part of speech (PoS) tagging and a copyright grammar. The exact start and end lines where a copyright is found are approximate.

The goal of this project is to refactor Copyright detection for speed and simplicity possibly implementing a new parser (PEG?, etc.) or re-implementing core elements in Rust with a Python binding for speed or using a fork of NLTK or any other tool to be faster and more accurate.

This would include also keeping track of line numbers and offsets where copyrights are found.

Also we detect copyrights that are part of a standard license text (e.g. FSF copyright in a GPL text) and we should be able to filter these out.

Level
- Advanced
Tech
- Python, Rust, Go?
URLS
- https://github.com/nexB/scancode-toolkit/tree/develop/src/cluecode
Mentors
- @JonoYang https://github.com/JonoYang

http://aboutcode.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Ideas Improve Copyright Detection Accuracy and Speed

Improve Copyright detection accuracy and speed in ScanCode

Clone this wiki locally