Keywords4CV is a Python-based tool designed to help job seekers optimize their resumes and LinkedIn profiles for Applicant Tracking Systems (ATS) and human recruiters. It analyzes a collection of job descriptions and extracts the most important and relevant keywords, enabling users to tailor their application materials to specific job roles. By incorporating these keywords, users can significantly increase their chances of getting noticed by both ATS and recruiters, leading to more interview opportunities.
UNDER ACTIVE DEVELOPMENT - NOT CURRENTLY FUNCTIONAL
Currently at version 0.24 (Alpha)
. This is a pre-release version undergoing significant architectural changes. The script does not work correctly at this point in time. This README reflects the intended functionality of version 0.24, but critical issues remain. See the Known Issues section below.
- Enhanced Keyword Extraction: Identifies key skills, qualifications, and terminology from job descriptions using advanced NLP techniques (spaCy, NLTK, scikit-learn, rapidfuzz).
- Preserves multi-word skills (e.g., "machine learning") through entity recognition.
- Fuzzy Matching: Integrates
rapidfuzz
for flexible keyword matching, handling variations in spelling and phrasing. - Phrase-Level Synonyms: Supports synonyms for multi-word phrases, loaded from a static file or a REST API.
- Configurable Processing Order: Option to apply fuzzy matching before or after semantic validation.
- Trigram Optimization: Uses a trigram cache to improve n-gram generation efficiency.
- Dynamic N-gram Generation: Improved n-gram handling with robust error handling.
- TF-IDF Analysis with Advanced Weighting: Computes Term Frequency-Inverse Document Frequency (TF-IDF) scores, combined with whitelist boosting, keyword frequency, and section weighting. Configurable weights allow for fine-tuning keyword importance.
- Synonym Expansion: Leverages WordNet and, optionally, a REST API or static file to suggest synonyms, expanding keyword coverage and improving semantic matching.
- Semantic Keyword Categorization: Assigns keywords to categories using a hybrid approach:
- Direct matching against pre-defined category terms.
- Semantic similarity (cosine similarity between word vectors) for terms not directly matched.
- Configurable
default_category
for uncategorized terms. - Caching for improved performance.
- Contextual Validation: Uses a configurable context window to determine if a keyword is used in a relevant context within the job description, reducing false positives. Improved sentence splitting handles bullet points and numbered lists.
- Highly Configurable: Uses a
config.yaml
file for extensive customization:- Validation: Settings for input validation (e.g., allowing numeric titles, handling empty descriptions).
- Dataset: Parameters for controlling dataset processing (e.g., minimum/maximum job descriptions, short description threshold).
- Text Processing:
- spaCy model selection and pipeline component configuration.
- N-gram ranges (for general keywords and the whitelist).
- POS tag filtering.
- Semantic validation settings (enabled/disabled, similarity threshold).
- Synonym source (static file or API) and related settings.
- Context window size.
- Fuzzy matching order (before/after semantic validation).
- Categorization: Default category, categorization cache size, direct match threshold.
- Whitelist: Whitelist recall threshold, caching options, fuzzy matching settings (algorithm, minimum similarity, allowed POS tags).
- Weighting: Weights for TF-IDF, frequency, whitelist boost, and section-specific weights.
- Hardware Limits: GPU usage, batch size, memory thresholds, maximum workers, memory scaling factor.
- Optimization: Settings for reinforcement learning (Q-learning) based parameter tuning, including reward weights, learning rate, and complexity factors.
- Caching: Cache sizes for various components (general cache, TF-IDF max features, trigram cache). Includes a
cache_salt
for cache invalidation. - Intermediate Save: Options for saving intermediate results (enabled/disabled, save interval, format, working directory, cleanup).
- Advanced: Dask integration (currently disabled), success rate threshold, checksum relative tolerance, negative keywords, section headings.
- Stop words (with options to add and exclude specific words).
- Extensive and customizable whitelist of technical and soft skills.
- Keyword categories.
- Detailed Output Reports: Generates comprehensive Excel reports:
- Keyword Summary: Aggregated keyword scores, job counts, average scores, and assigned categories for a high-level overview.
- Job Specific Details: Detailed table showing keyword scores, TF-IDF, frequency, category, whitelist status, and other relevant information per job title.
- Robust Input Validation: Rigorous validation of job descriptions and configuration parameters, handling empty titles, descriptions, incorrect data types, encoding issues, and invalid configuration values. Clear error messages and logging.
- User-Friendly Command-Line Interface:
argparse
provides a clear and easy-to-use interface. - Comprehensive Error Handling and Logging: Detailed logging to
Keywords4CV.log
with improved error handling for configuration, input, memory issues, API calls, data integrity, and other potential problems. Custom exceptions are used for specific error types. - Multiprocessing for Parallel Processing: Core analysis uses
concurrent.futures.ProcessPoolExecutor
for parallel processing, significantly improving performance. - Efficient Caching: Uses
functools.lru_cache
,cachetools.LRUCache
, and custom caching mechanisms to optimize performance, with cache invalidation on configuration changes. - SpaCy Pipeline Optimization: Dynamically enables and disables spaCy pipeline components based on configuration, improving efficiency.
- Automatic NLTK Resource Management: Ensures WordNet and other NLTK resources are downloaded if missing.
- Memory Management and Adaptive Chunking:
- Smart Chunker: Uses a Q-learning algorithm to dynamically adjust the chunk size based on dataset statistics and system resource usage.
- Auto Tuner: Automatically adjusts parameters (e.g.,
chunk_size
,pos_processing
) based on metrics and trigram cache hit rate. - Memory Monitoring: Monitors memory usage and clears caches if necessary.
- Explicit Garbage Collection: Releases memory proactively.
- Intermediate Result Saving and Checkpointing: Saves intermediate results to disk in configurable formats (feather, jsonl, json) with checksum verification to ensure data integrity. This allows for resuming processing and prevents data loss.
- Streaming Data Aggregation: Uses a generator-based approach to aggregate results from intermediate files, enabling processing of very large datasets.
- Input: Accepts a JSON file (e.g.,
job_descriptions.json
) with job titles as keys and descriptions as values. Also uses aconfig.yaml
file for configuration. - Configuration Validation: Validates the structure and content of the
config.yaml
file usingschema
and Pydantic. - Preprocessing: Cleans text (lowercasing, URL/email removal), tokenizes, lemmatizes, and caches results.
- Keyword Extraction:
- Matches
keyword_categories
phrases asSKILL
entities using spaCy'sentity_ruler
. - Generates n-grams (configurable range).
- Filters out noisy n-grams (containing stop words or single characters).
- Expands keywords with synonyms (from WordNet, a static file, or a REST API).
- Applies fuzzy matching (using
rapidfuzz
) against the expanded set of skills. - Performs semantic validation, checking if keywords are used in context.
- Matches
- Keyword Weighting and Scoring: Combines TF-IDF, frequency, whitelist boost, and section weighting (configurable weights).
- Keyword Categorization: Assigns categories using a hybrid approach (direct match + semantic similarity).
- Intermediate Saving (Optional): Saves intermediate results (summary and detailed scores) to disk in a configurable format (feather, jsonl, or json) with checksum verification.
- Adaptive Parameter Tuning: Uses reinforcement learning (Q-learning) to dynamically adjust parameters (e.g., chunk size, POS processing strategy) based on performance metrics.
- Result Aggregation: Combines intermediate results (if any) into final summary and detailed DataFrames.
- Output Generation: Produces Excel reports:
- Summary: Ranked keywords with
Total_Score
,Avg_Score
,Job_Count
. - Detailed Scores: Per-job details including
Score
,TF-IDF
,Frequency
,Category
,In Whitelist
, and other information.
- Summary: Ranked keywords with
- Python 3.8+
- Required libraries:
pip install pandas nltk spacy scikit-learn pyyaml psutil requests rapidfuzz srsly xxhash cachetools pydantic schema pyarrow numpy
- SpaCy model:
(or another suitable spaCy model, specified in
python -m spacy download en_core_web_lg
config.yaml
)
- Clone the repository:
git clone https://github.com/DavidOsipov/Keywords4Cv.git cd Keywords4Cv
- (Recommended) Create a virtual environment:
python3 -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
- Install dependencies (see Prerequisites).
- Prepare Input: Create
job_descriptions.json
:{ "Data Scientist": "Experience in Python, machine learning, and SQL...", "Product Manager": "Agile methodologies, product roadmapping...", "Software Engineer": "Proficient in Java, Spring, REST APIs..." }
- Customize
config.yaml
: Thoroughly review and adjust theconfig.yaml
file. This is crucial for the script's behavior. Pay particular attention to:keyword_categories
: Define your skill categories and list relevant keywords for each.text_processing
: Configure spaCy model, n-gram ranges, synonym settings, fuzzy matching, and contextual validation.whitelist
: Adjust fuzzy matching parameters.weighting
: Set weights for TF-IDF, frequency, whitelist boost, and section weights.hardware_limits
: Configure memory usage and processing limits.optimization
: Tune reinforcement learning parameters (if desired).intermediate_save
: Enable/disable intermediate saving and configure related options.
- (Optional) Create
synonyms.json
: If using static phrase synonyms, create asynonyms.json
file:{ "product management": ["product leadership", "product ownership"], "machine learning": ["ml", "ai"] }
- Run the Script:
python keywords4cv.py -i job_descriptions.json -c config.yaml -o results.xlsx
- Review Output: Check
results.xlsx
andKeywords4CV.log
.
NOTE! At this point of time, the script doesn't work. This release aims to introduce critical architectural changes. This is an alpha release and is not yet functional.
keywords4cv_0.24.py.txt
: Main script.config.yaml.truncated.txt
: Configuration file (example - you'll need to adapt this).README.md
: This documentation.exceptions.py.txt
: Custom exception definitions.config_validation.py.txt
: Configuration validation logic.requirements.txt
: Dependency list (update withpip freeze > requirements.txt
).job_descriptions.json
: Sample input.synonyms.json
: (Optional) Static phrase synonyms.working_dir
: (Created by the script) Stores intermediate files.
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes.
- Add or update tests as needed.
- Run the tests and ensure they pass.
- Commit your changes with clear commit messages.
- Submit a pull request.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Important Note about Software Licensing:
While the CC BY-NC-SA 4.0 license is generally used for creative works, it is being used here to specifically restrict the commercial use of this software by others while allowing non-commercial use, modification, and sharing. The author chose this license to foster a community of non-commercial users and contributors while retaining the right to commercialize the software themselves.
Please be aware that this license is not a standard software license and may not address all software-specific legal concerns. Specifically:
- No Patent Grant: This license does not grant any patent rights related to the software.
- Disclaimer of Warranties: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Consult Legal Counsel: If you have any concerns about the legal implications of using this software under the CC BY-NC-SA 4.0 license, especially regarding patent rights or commercial use, you should consult with legal counsel.
By using, modifying, or distributing this software, you agree to the terms of the CC BY-NC-SA 4.0 license, including the limitations and disclaimers stated above.