Skip to content

The-FinAI/Fino1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 

Repository files navigation

🚀 Fino1: On the Transferability of Reasoning-Enhanced LLMs to Finance

🤗 Training Data | 📄 Arxiv | 🤖 Model 🏆 Leaderboard


📈 Overview

📂 Datasets Used

Here, we utilized three evaluation datasets to assess the performance of our Fino1 model.

Dataset Description
FinQA FinQA is a large-scale dataset for numerical reasoning in finance, featuring expert-annotated QA pairs that require integrating structured and unstructured data from financial reports while handling complex domain-specific terminology.
DocMath DocMath-Eval is a benchmark for evaluating LLMs' numerical reasoning over long specialized documents and tables, with the simpllong subset focusing on reasoning across multi-tiered financial or specialized tables within extended contexts.
XBRL-Math XBRL-Math dataset evaluates LLMs' numerical reasoning in XBRL filings, requiring models to interpret structured financial data, US GAAP XBRL tags, equations, and hierarchical numerical relationships for accurate financial analysis.

🏆 Models Evaluated

We compared our Fino1 model against 16 state-of-the-art large language models (LLMs).

Model Description
GPT-4o GPT-4o is OpenAI's versatile, high-intelligence flagship model. It accepts text and image inputs and produces text outputs (including Structured Outputs).
GPT-o1 The o1 series of models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user.
GPT-o3-mini o3-mini is OpenAI's most recent small reasoning model, providing high intelligence at the same cost and latency targets of o1-mini. o3-mini also supports key developer features, like Structured Outputs, function calling, Batch API, and more.
DeepSeek-V3 DeepSeek-V3 is a 671B Mixture-of-Experts (MoE) model with 37B active parameters per token, leveraging Multi-head Latent Attention (MLA) and DeepSeekMoE for efficient training and inference, achieving state-of-the-art performance comparable to closed-source models with stable and cost-effective training.
DeepSeek-R1 DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Qwen2.5-72B-Instruct Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen2.5-Math-72B-Instruct Qwen2.5-Math-72B-Instruct is an upgraded open-source mathematical LLM supporting both Chain-of-Thought (CoT) and Tool-integrated Reasoning (TIR) for solving math problems in Chinese and English, offering significant performance improvements over Qwen2-Math.
DeepSeek-R1-Distill-Llama-70B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-70B-Instruct Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-70B-Instruct The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.
Llama3.3-70B-Instruct The Meta Llama 3.3 is a 70B instruction-tuned multilingual LLM optimized for dialogue, outperforming many open-source and closed chat models on industry benchmarks.
DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Qwen-14B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Llama-8B DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-8B-Instruct Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-8B-Instruct The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.

🧩 Reasoning Path Building

For the reasoning path building and training part, we were inspired by HuatuoGPT-o1

We release the reasoning path here: https://huggingface.co/datasets/TheFinAI/Fino1_Reasoning_Path_FinQA

🏗️ How to Train Fino1

Refer to HuatuoGPT-o1, we applied two-stage way to train our Fino1 model

  • Stage 1: Supervised Fine-Tuning (SFT)

  • Stage 2: Reinforcement Learning (RL)

We provide a simple PPO script using the trl library. Below is an example for training an 8B model with PPO on an 8-GPU A100 machine. Ensure you first download medical verifier as the reward model.

Please check HuatuoGPT-o1 for more training details.

🎯 Evaluation of all models

Inference: Local Models

Model inference for local models is conducted using FinBen with the VLLM framework.

Inference: API Models

For API-based models, evaluation is performed using the query_llm.py script.

Evaluation

For the final evaluation, we used DocMath-Eval to first use GPT to extract final answers from the result and then evaluate the correctness of the answer.


Key Results

📊 Performance of Different LLMs on Financial Datasets

Models FinQA DocMath XBRL-Math Average
GPT-4o 72.49 60.00 72.22 68.24
GPT-o1 49.07 56.00 74.44 59.84
GPT-o3-mini 60.87 59.00 76.67 65.51
DeepSeek-V3 73.20 53.00 76.67 67.62
DeepSeek-R1 65.13 53.00 86.67 68.93
Qwen2.5-72B-Instruct 73.38 59.00 67.78 66.72
Qwen2.5-72B-Instruct-Math 69.74 42.00 83.33 65.69
DeepSeek-R1-Distill-Llama-70B 66.73 53.00 86.67 68.80
Llama3-70B-Instruct 58.92 41.00 56.67 52.20
Llama3.1-70B-Instruct 63.18 48.00 63.33 58.17
Llama3.3-70B-Instruct 68.15 54.00 70.00 64.05
DeepSeek-R1-Distill-Qwen-32B 65.48 55.00 84.44 68.97
DeepSeek-R1-Distill-Qwen-14B 63.27 44.00 84.44 63.90
DeepSeek-R1-Distill-Llama-8B 45.96 33.00 81.11 53.36
Llama3-8B-Instruct 41.97 29.00 48.89 39.95
Llama3.1-8B-Instruct 54.13 34.00 62.22 50.12
Fino1-8B 60.87 40.00 82.22 61.03

🛠️ Updates

  • [2025/02/12] 🎉 We've trained Fino1 model and evaluated its performance recently

📄 Citation

If you find our work useful, please cite our paper:

BibTeX:

@misc{qian2025fino1transferabilityreasoningenhanced,
      title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance}, 
      author={Lingfei Qian and Weipeng Zhou and Yan Wang and Xueqing Peng and Jimin Huang and Qianqian Xie},
      year={2025},
      eprint={2502.08127},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08127}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages