Evo-CoT: Evolutionary Optimization of Chain-of-Thoughts
This repository contains code for the Evo-CoT framework, which uses staged evolutionary algorithms to generate, align, and correct chain-of-thought (CoT) reasoning exemplars. The framework is designed to explore novel reasoning patterns, refine them for problem alignment, and select top-quality CoTs using LLM-based correction.
- Installation
1.1 Python Version
Python >= 3.10 recommended.
1.2 Dependencies
Install required packages using pip:
pip install -r requirements.txt
Key dependencies include:
numpy – numerical computations
scipy – scientific operations (optional)
matplotlib – plotting results
tqdm – progress bars
transformers – LLM alignment and evaluation (for Stage 2/3)
torch – PyTorch backend for LLM inference
(Optional: jupyter or ipython for interactive experiments)
- Dataset / Population Initialization
The framework requires an initial CoT population stored as JSON (population.json).
Each entry must include:
problem : problem statement
cot : initial chain-of-thought
answer : ground truth answer (optional for exploration)
Download Instructions:
If using a benchmark dataset (e.g., GSM8K, MATH, or custom problems), preprocess into the above JSON format.
Example JSON snippet:
[ { "problem": "In a class of 40 students, 80% have puppies. 25% of those also have parrots. How many have both?", "cot": "First calculate the number of students with puppies. Then compute the subset with parrots.", "answer": "8" }, ... ]
- Running Experiments
3.1 Stage 1: Exploration
python stage1_exploration.py
Generates diverse CoTs using meta-heuristics, semantic-preserving mutations, and crossovers.
Logs fitness, diversity, and generation statistics.
3.2 Stage 2: Alignment
python stage2_alignment.py
Aligns top Stage 1 CoTs to their respective problems using LLM guidance.
No evolution occurs here; purely alignment and structural refinement.
3.3 Stage 3: Correction & Ranking
python stage3_correction.py
Uses LLM-based scoring to assign correctness fitness.
Ranks and selects Top-K CoTs.
- Reproducibility Notes
Random seeds are set in all stages, but LLM-based alignment may introduce non-determinism.
Stage 1 results can vary slightly depending on mutation and crossover operations.
Save population snapshots (population_stage1_genX.json) to resume experiments or compare intermediate results.
- Cost & Computational Considerations
Stage 1 with 2,000 population × 80 generations is computationally intensive (~7,400 total fitness evaluations).
LLM-based alignment and correction (Stage 2/3) can be GPU-accelerated for efficiency.
Suggested workflow for budgeted experiments:
-
Run smaller populations or fewer generations for prototype testing.
-
Run full-scale experiments on high-memory GPU nodes for final results.
Track elapsed time, mutation/crossover counts, diversity to monitor experiment efficiency.
- Plotting Results
Use matplotlib to visualize fitness trends across generations:
import matplotlib.pyplot as plt
plt.plot(generations, avg_fitness, label='Average Fitness', color='blue') plt.plot(generations, best_fitness, label='Best Fitness', color='red') plt.xlabel('Generation') plt.ylabel('Fitness') plt.title('Stage 1 Fitness Evolution') plt.legend() plt.savefig('stage1_fitness_plot.png') plt.show()
Upload .png or .pdf images to Overleaf for paper figures.
If you want, I can also prepare a requirements.txt and a sample population.json ready for your Overleaf/experiment so you can start immediately.
Do you want me to do that?