🚀 Fino1: On the Transferability of Reasoning-Enhanced LLMs to Finance

🤗 Training Data | 📄 Arxiv | 🤖 Model 🏆 Leaderboard

📈 Overview

📂 Datasets Used

Here, we utilized three evaluation datasets to assess the performance of our Fino1 model.

Dataset	Description
FinQA	FinQA is a large-scale dataset for numerical reasoning in finance, featuring expert-annotated QA pairs that require integrating structured and unstructured data from financial reports while handling complex domain-specific terminology.
DocMath	DocMath-Eval is a benchmark for evaluating LLMs' numerical reasoning over long specialized documents and tables, with the simpllong subset focusing on reasoning across multi-tiered financial or specialized tables within extended contexts.
XBRL-Math	XBRL-Math dataset evaluates LLMs' numerical reasoning in XBRL filings, requiring models to interpret structured financial data, US GAAP XBRL tags, equations, and hierarchical numerical relationships for accurate financial analysis.

🏆 Models Evaluated

We compared our Fino1 model against 16 state-of-the-art large language models (LLMs).

Model	Description
GPT-4o	GPT-4o is OpenAI's versatile, high-intelligence flagship model. It accepts text and image inputs and produces text outputs (including Structured Outputs).
GPT-o1	The o1 series of models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user.
GPT-o3-mini	o3-mini is OpenAI's most recent small reasoning model, providing high intelligence at the same cost and latency targets of o1-mini. o3-mini also supports key developer features, like Structured Outputs, function calling, Batch API, and more.
DeepSeek-V3	DeepSeek-V3 is a 671B Mixture-of-Experts (MoE) model with 37B active parameters per token, leveraging Multi-head Latent Attention (MLA) and DeepSeekMoE for efficient training and inference, achieving state-of-the-art performance comparable to closed-source models with stable and cost-effective training.
DeepSeek-R1	DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Qwen2.5-72B-Instruct	Qwen2.5 is the latest series of Qwen LLMs, offering models from 0.5B to 72B parameters with improved knowledge, coding, math, instruction following, structured data handling, long-context support (up to 128K tokens), and multilingual capabilities across 29+ languages.
Qwen2.5-Math-72B-Instruct	Qwen2.5-Math-72B-Instruct is an upgraded open-source mathematical LLM supporting both Chain-of-Thought (CoT) and Tool-integrated Reasoning (TIR) for solving math problems in Chinese and English, offering significant performance improvements over Qwen2-Math.
DeepSeek-R1-Distill-Llama-70B	DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-70B-Instruct	Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-70B-Instruct	The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.
Llama3.3-70B-Instruct	The Meta Llama 3.3 is a 70B instruction-tuned multilingual LLM optimized for dialogue, outperforming many open-source and closed chat models on industry benchmarks.
DeepSeek-R1-Distill-Qwen-32B	DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Qwen-14B	DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
DeepSeek-R1-Distill-Llama-8B	DeepSeek-R1-Zero and DeepSeek-R1 are first-generation reasoning models, with DeepSeek-R1 incorporating cold-start data before RL to improve readability and performance, achieving results comparable to OpenAI-o1 across reasoning tasks, while open-sourced distilled models set new benchmarks for dense models.
Llama3-8B-Instruct	Meta released the Llama 3 family of 8B and 70B LLMs, optimized for dialogue, outperforming many open-source chat models while prioritizing helpfulness and safety.
Llama3.1-8B-Instruct	The Meta Llama 3.1 collection includes multilingual LLMs (8B, 70B, 405B) optimized for multilingual dialogue, outperforming many open-source and closed chat models on industry benchmarks.

🧩 Reasoning Path Building

For the reasoning path building and training part, we were inspired by HuatuoGPT-o1

We release the reasoning path here: https://huggingface.co/datasets/TheFinAI/Fino1_Reasoning_Path_FinQA

🏗️ How to Train Fino1

Refer to HuatuoGPT-o1, we applied two-stage way to train our Fino1 model

Stage 1: Supervised Fine-Tuning (SFT)
Stage 2: Reinforcement Learning (RL)

We provide a simple PPO script using the trl library. Below is an example for training an 8B model with PPO on an 8-GPU A100 machine. Ensure you first download medical verifier as the reward model.

Please check HuatuoGPT-o1 for more training details.

🎯 Evaluation of all models

Inference: Local Models

Model inference for local models is conducted using FinBen with the VLLM framework.

Inference: API Models

For API-based models, evaluation is performed using the query_llm.py script.

Evaluation

For the final evaluation, we used DocMath-Eval to first use GPT to extract final answers from the result and then evaluate the correctness of the answer.

Key Results

📊 Performance of Different LLMs on Financial Datasets

Models	FinQA	DocMath	XBRL-Math	Average
GPT-4o	72.49	60.00	72.22	68.24
GPT-o1	49.07	56.00	74.44	59.84
GPT-o3-mini	60.87	59.00	76.67	65.51
DeepSeek-V3	73.20	53.00	76.67	67.62
DeepSeek-R1	65.13	53.00	86.67	68.93
Qwen2.5-72B-Instruct	73.38	59.00	67.78	66.72
Qwen2.5-72B-Instruct-Math	69.74	42.00	83.33	65.69
DeepSeek-R1-Distill-Llama-70B	66.73	53.00	86.67	68.80
Llama3-70B-Instruct	58.92	41.00	56.67	52.20
Llama3.1-70B-Instruct	63.18	48.00	63.33	58.17
Llama3.3-70B-Instruct	68.15	54.00	70.00	64.05
DeepSeek-R1-Distill-Qwen-32B	65.48	55.00	84.44	68.97
DeepSeek-R1-Distill-Qwen-14B	63.27	44.00	84.44	63.90
DeepSeek-R1-Distill-Llama-8B	45.96	33.00	81.11	53.36
Llama3-8B-Instruct	41.97	29.00	48.89	39.95
Llama3.1-8B-Instruct	54.13	34.00	62.22	50.12
Fino1-8B	60.87	40.00	82.22	61.03

🛠️ Updates

[2025/02/12] 🎉 We've trained Fino1 model and evaluated its performance recently

📄 Citation

If you find our work useful, please cite our paper:

BibTeX:

@misc{qian2025fino1transferabilityreasoningenhanced,
      title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance}, 
      author={Lingfei Qian and Weipeng Zhou and Yan Wang and Xueqing Peng and Jimin Huang and Qianqian Xie},
      year={2025},
      eprint={2502.08127},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08127}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
README.md		README.md
query_llm.py		query_llm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Fino1: On the Transferability of Reasoning-Enhanced LLMs to Finance

📈 Overview

📂 Datasets Used

🏆 Models Evaluated

🧩 Reasoning Path Building

🏗️ How to Train Fino1

🎯 Evaluation of all models

Inference: Local Models

Inference: API Models

Evaluation

Key Results

📊 Performance of Different LLMs on Financial Datasets

🛠️ Updates

📄 Citation

About

Releases

Packages

Contributors 2

Languages

The-FinAI/Fino1

Folders and files

Latest commit

History

Repository files navigation

🚀 Fino1: On the Transferability of Reasoning-Enhanced LLMs to Finance

📈 Overview

📂 Datasets Used

🏆 Models Evaluated

🧩 Reasoning Path Building

🏗️ How to Train Fino1

🎯 Evaluation of all models

Inference: Local Models

Inference: API Models

Evaluation

Key Results

📊 Performance of Different LLMs on Financial Datasets

🛠️ Updates

📄 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages