Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain-specific large model benchmarks: the edge perspective #177

Open
MooreZheng opened this issue Jan 14, 2025 · 14 comments
Open

Domain-specific large model benchmarks: the edge perspective #177

MooreZheng opened this issue Jan 14, 2025 · 14 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@MooreZheng
Copy link
Collaborator

MooreZheng commented Jan 14, 2025

What would you like to be added/modified:
The issue aims to build an advanced benchmark for edge-oriented domain-specific large models on KubeEdge-Ianvs. It aims to help all Edge AI application developers validate and select the best-matched domain-specific large models. For Edge AI service providers, it also helps identify which scenarios, edge nodes, or even locations could have the best performance or improvement for their models. This issue includes:

  • Domain-specific Large Model Benchmark for the edge, including test datasets, testing toolkits, and usage guidelines.
  • (Advanced) Design and implementation of specific evaluation metrics.
  • (Advanced) Survey and research reports.

Why is this needed:
Common large-model benchmarks in the industry tend to focus on the cloud. As the era of scaled applications comes for large models, the cloud has already provided infrastructure and services for these large models. Relevant customers have further proposed targeted application requirements on the edge side, including personalization, data compliance, and real-time capabilities, making AI services with cloud-edge collaboration a major trend. Different institutions from edges often build their own large models or knowledge bases. However, benchmark tests for domain-specific large models with edge data have not yet been well developed. Due to the data distribution at different edges, it is expected that the performance of general large models could be varied significantly at edges. This work aims to pinpoint those performance fluctuations for Edge AI services and applications.

Recommended Skills:
KubeEdge-Ianvs, Python, LLMs

Useful links:
Introduction to Ianvs
Quick Start
How to test algorithms with Ianvs
Testing incremental learning in industrial defect detection
Benchmarking for embodied AI
KubeEdge-Ianvs
Example LLMs Benchmark List
Ianvs v0.1 documentation

=====
Those who wish to apply for the LFX mentorship for this project might to take a look at the pre-test.

@MooreZheng MooreZheng added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 14, 2025
@phoeenniixx
Copy link

Hi @MooreZheng, I was looking into the issues, and found this interesting issue!
I thought of something:

Domain-specific Large Model Benchmark for the edge, including test datasets, testing toolkits, and usage guidelines.

Selecting Test Datasets

Few proposed datasets in my opinion that we should add:

Domain Dataset
Healthcare MIMIC-III, PhysioNet
Finance BloombergGPT dataset*
Legal CaseLawQA
Autonomous Vehicles Waymo Open Dataset
... and so on

(*) although I just found the paper for this, but as this was just a superficial search, we can find better things if we search in depth (of course after your approval)

I have a question about these datasets:

  • Are you planning to add datasets for all domains
    or
  • Are you thinking on working on a more "general" architecture that can handle different domains or datasets (although this one will be a bit more difficult task to achieve imo)

Benchmarking

Each domain has its own challenges, so we need to modify the benchmarking accordingly

We might need to create domain-specific strategies:

  • Healthcare: Benchmark medical imaging models on MIMIC-III.
  • Finance: Evaluate text-based LLMs (e.g., BloombergGPT) for fraud analysis.

Adjust Ianvs test scripts to inject real-world constraints (like check the throughput, memory consumption etc)

Every domain will have their own constraints, like:

  • in healthcare, accuracy and efficiency (time taken to get to a decision) is the main focus while in case of Autonomous Vehicles, throughput, power consumption can also be considered.

So according to me, we first need to decide the domains we would like to work on and then progress to find relevant datasets and benchmarking techniques.

This is what I was able to make out of the issue, Please review it and see if I am thinking in the right direction..

Thank You

@MooreZheng
Copy link
Collaborator Author

MooreZheng commented Feb 6, 2025

hey @phoeenniixx Welcome! Thank you so much for your attention and information on this issue.

Are you planning to add datasets for all domains
...
So according to me, we first need to decide the domains we would like to work on and then progress to find relevant datasets and benchmarking techniques.

The domain selection is up to LFX mentee candidates to propose. For the KubeEdge mentor side, we would like to see One Selected Domain related to edge computing with high values when an LLM applies. Besides, a brand new dataset is even more appreciated than integrating an existing dataset.

@ggold7046
Copy link

ggold7046 commented Feb 6, 2025

Hi @MooreZheng , could you tell me the last date for application to this mentorship ? Could you please let me know the selection criteria and selection date for the same ?

@MooreZheng
Copy link
Collaborator Author

Hi @MooreZheng , could you tell me the last date for application to this mentorship ? Could you please let me know the selection criteria and selection date for the same ?

https://docs.linuxfoundation.org/lfx/mentorship/mentorship-program-timelines

@AryanPrakhar
Copy link

AryanPrakhar commented Feb 6, 2025

Hey @MooreZheng ! I'm Aryan Prakhar.

I’m a third-year B.Tech student at IIT (BHU) Varanasi, and this LLM benchmarking project immediately caught my attention. It resonated with my experience in benchmarking and evaluation, and I loved the idea of proposing a domain and even building a dataset from scratch. I'd love to highlight how I can contribute to the community.

After getting clarity about project requirements in today's international meeting, I went on to research an appropriate domain to build an LLM benchmark for maximum immediate utility on the edge computing front.

What I’ve Been Up To

LLMs & Benchmarking

  • Worked on LLM-based multi-agent systems for data-driven scientific discovery, including DataVoyager (accepted at ICML’24). reference
  • Interned at OpenLocus.ai, collaborating with Allen Institute for AI researchers to build a benchmark for evaluating LLMs’ data-driven discovery capabilities. This was accepted at ICLR’25, so I’ve got hands-on experience in benchmarking and evaluation metrics. reference
  • Contributed to CodeForGovTech DMP’24 open source program, focused on improving Digital Public Infrastructure. reference

Skills & Tools

  • Strong in Python, PyTorch, and benchmarking methodologies.
  • Hands-on experience with ETL processes and inferential statistics.
  • Built evaluation metrics while working on DiscoveryBench, where I gained deep insights into incorporating evaluation metrics, thanks to the great team I worked with.

Why I’m Excited About This Project

LLM benchmarking is still an evolving space, and every benchmark project requires deep industry understanding. That side learning keeps me engaged. Plus, this project’s practical impact—helping Edge AI developers select the right models—makes it even more exciting.

Domains for Edge AI Benchmarks

Here are the domains which should be the pick. Reason? I feel these are the domains where most of the LLM-Edge AI application would happen. Thus, providing benchmarks in this area would allow better decisions in these areas.

1. Autonomous Vehicles (Top Pick)

LLMs + Edge AI can improve autonomous driving:

  • Edge-Cloud Motion Planning: Combining edge devices (like LLaMA-Adapter) with cloud-based LLMs (GPT-4) for real-time decision-making. reference
  • Multimodal Perception & Decision-Making: Using vision-language models (VLMs) to process LiDAR, camera, and radar data. DriveLM refined trajectories in real-time using RL, achieving 5–13 decisions per second with GPT-3.5 Turbo. reference
  • Real-Time Safety & Adaptability: Processing sensor data locally to make driving decisions in <100ms.

Why It’s Important: Autonomous driving is a fast-growing field for Edge AI. A benchmark here would have immediate real-world value.

📌 Key Challenges to Test:

  • Inference Latency: Can LLMs meet sub-100ms response times for real-time decisions?
  • Model Efficiency vs. Performance: Which LLM balances efficiency and accuracy best?
  • Multimodal Integration: How well do LLMs combine LiDAR, camera, and radar data?
  • Generalization Across Scenarios: How do models handle unseen conditions like fog or roadblocks?

2. Logistics

I saw firsthand how tough route planning can be during a Himalayan trip where landslides disrupted navigation.

What Edge AI Can Do:

  • Predict traffic bottlenecks and suggest real-time alternate routes.
  • Optimize delivery routes (like UPS’s ORION system, which saves 10M gallons of fuel annually).
  • Manage autonomous fleets, adjusting vehicle behavior based on road conditions.

📌 Why I’d Be a Good Fit:
I’ve studied Advanced Traffic Engineering and its data analytics aspects, so I understand traffic systems at both macro and micro levels.

3. Energy

Why Edge AI is Useful:

  • Self-Healing Grids: Reroute power to reduce blackouts.
  • Transformer Health Monitoring: Predict failures using vibration, temperature, and load data, cutting downtime by 30–50%.

Would love to hear your thoughts! Do any of these domains align with what you're looking for? Open to feedback and excited to contribute!

@ggold7046
Copy link

ggold7046 commented Feb 6, 2025

https://docs.linuxfoundation.org/lfx/mentorship/mentorship-program-timelines

Thanks for the reply @MooreZheng .

Mentee applications open on LFX: approximately 4 weeks

Mentee application review and acceptance: approximately 2 weeks before the term begins.

So can we assume that 20th Feb is the last date before you start reviewing the applications ?
Could you please let us know what are your expectations for selecting the mentees ?

@MooreZheng
Copy link
Collaborator Author

MooreZheng commented Feb 7, 2025

Hey @MooreZheng ! I'm Aryan Prakhar.
...
Would love to hear your thoughts! Do any of these domains align with what you're looking for? Open to feedback and excited to contribute!

hey @AryanPrakhar Welcome! Thank you so much for your attention and thoughts on this issue. As discussed, a domain that fits edge computing with high market value and brand new datasets is particularly welcome. The domain selection is up to LFX mentee candidates to propose. More studies are also welcome.

@MooreZheng
Copy link
Collaborator Author

MooreZheng commented Feb 7, 2025

Mentee applications open on LFX: approximately 4 weeks

Mentee application review and acceptance: approximately 2 weeks before the term begins.

So can we assume that 20th Feb is the last date before you start reviewing the applications ? Could you please let us know what are your expectations for selecting the mentees ?

That date looks good to me.

The current expectations are as discussed above in this issue.

Also considering launching a pre-test around 20th Feb, depending on the number of candidates applied for this issue. If so, the pretest will be announced in this issue and candidates might want to keep an eye on it. @ggold7046

@Abioye-Bolaji
Copy link

Hi @MooreZheng
I came across this project under the LFX Mentorship Program and found it really interesting. I have experience with machine learning, model evaluation, and deployment, and I’m eager to learn more about benchmarking large models in edge environments. I would really love to learn and contribute to this project. I've already registered on the LFX mentee page
I look forward to a positive response. Thank you!

@AtalGupta
Copy link

Hi @MooreZheng,

I'm excited about this project! I would like to propose adding a new domain for evaluation: Camera-Reidentification Surveillance System. Given the growing need for real-time, edge-based surveillance solutions—especially with challenges like occlusion, varying lighting, and diverse environmental conditions—I believe this domain offers a rich testbed for domain-specific large models.

In this scenario, we could:

  • Develop or curate a dataset tailored to camera re-identification challenges on the edge with augmentation (e.g., Market1501, DukeMTMC-reID).
  • Define specific evaluation metrics that capture both the accuracy and the latency aspects critical for real-time surveillance.
  • Integrate with KubeEdge-Ianvs, leveraging its capabilities to simulate edge conditions and performance variations across different nodes and locations.

I'm enthusiastic about collaborating on this and can contribute using my experience with PyTorch, KubeEdge-Ianvs, and relevant models. Looking forward to discussing further how we can integrate this domain into the benchmark framework. Thank You!

@MooreZheng
Copy link
Collaborator Author

MooreZheng commented Feb 17, 2025

Pre-test

Domain-specific large model benchmarks: the edge perspective (2025 Term 1)

Brief introduction

Thank you all for your attention to this issue! Those who wish to apply for the LFX mentorship for this project may try out this pre-test. The pre-test result will be applied to help to better select the final mentee.

Tasks

The pre-test mainly contains two tasks:

  1. Test dataset
  • 1.1. (Basic) [Domain-specific LLM]
    A test dataset is prepared for a domain-specific LLM, which could be existing or brand-new.
  • 1.2. (Advance) [Brand new dataset]
    A brand new test set contains 20+ samples for each of the 20+ edge nodes.
  1. Research report
  • 2.1. (Basic) [Domain justification]
    Describe the reason why we need an LLM on the EDGE for the selected domain.
  • 2.2. (Basic) [EDGE characteristics]
    A basic and brief data justification on the prepared test set showing the EDGE characteristics of the test set, instead of Cloud-only or general (e.g., MNIST).
  • 2.3. (Advance) [Ianvs tutorial]
    A tutorial showing how to use the test set to evaluate an algorithm with KubeEdge-Ianvs (see Example ).
  • 2.4. (Advance) [Related work]
    Related work about the proposed benchmark, which contains over 10 other works with comments.

Submit method

After completing these tasks, the work should be submitted by the following.

  1. Use Google Disk for your test set and a link to the dataset should be put at the beginning of the research report.
  2. Use Google Docs for your research report and create a shared link for the docs. Please send the Google link via email to [email protected] and [email protected].

We will publish all received report links under this issue after the submission deadline.

Rating

  • Each item will be scored based on completion and quality, and the sum of the scores will be the total score for this task.
  • For the links that can not be opened successfully, the score will be ZERO without any notification. So, be careful about the link and ensure it can be opened before submission!

Task 1 Test dataset

Item Score
Task 1.1. Domain-specific LLM 10
Task 1.2. Brand new dataset 20

Task 2 Research report

Item Score
Task 2.1. Domain justification 10
Task 2.2. EDGE characteristics 20
Task 2.3. Ianvs tutorial 20
Task 2.4. Related work 20

Pre-test deadline

According to the official schedule of the LFX Mentorship, candidates need to complete registration and project applications between February 5 and February 18. The mentors will confirm candidates between February 19 and February 25. To ensure sufficient time for review, please complete this pre-test and send the report email by February 23, 2025 11 AM (PST).

@MooreZheng
Copy link
Collaborator Author

Hi @MooreZheng I came across this project under the LFX Mentorship Program and found it really interesting. I have experience with machine learning, model evaluation, and deployment, and I’m eager to learn more about benchmarking large models in edge environments. I would really love to learn and contribute to this project. I've already registered on the LFX mentee page I look forward to a positive response. Thank you!

Hi @MooreZheng,

I'm excited about this project! I would like to propose adding a new domain for evaluation: Camera-Reidentification Surveillance System. Given the growing need for real-time, edge-based surveillance solutions—especially with challenges like occlusion, varying lighting, and diverse environmental conditions—I believe this domain offers a rich testbed for domain-specific large models.
...

I'm enthusiastic about collaborating on this and can contribute using my experience with PyTorch, KubeEdge-Ianvs, and relevant models. Looking forward to discussing further how we can integrate this domain into the benchmark framework. Thank You!

Welcome @Abioye-Bolaji and @AtalGupta ! Thank you for your attention and you might want to take a look at the pre-test.

@ParamThakkar123
Copy link

Pre-test

Domain-specific large model benchmarks: the edge perspective (2025 Term 1)

Brief introduction

Thank you all for your attention to this issue! Those who wish to apply for the LFX mentorship for this project may try out this pre-test. The pre-test result will be applied to help to better select the final mentee.

Tasks

The pre-test mainly contains two tasks:

  1. Test dataset
  • 1.1. (Basic) [Domain-specific LLM]
    A test dataset is prepared for a domain-specific LLM, which could be existing or brand-new.
  • 1.2. (Advance) [Brand new dataset]
    A brand new test set contains 20+ samples for each of the 20+ edge nodes.
  1. Research report
  • 2.1. (Basic) [Domain justification]
    Describe the reason why we need an LLM on the EDGE for the selected domain.
  • 2.2. (Basic) [EDGE characteristics]
    A basic and brief data justification on the prepared test set showing the EDGE characteristics of the test set, instead of Cloud-only or general (e.g., MNIST).
  • 2.3. (Advance) [Ianvs tutorial]
    A tutorial showing how to use the test set to evaluate an algorithm with KubeEdge-Ianvs (see Example ).
  • 2.4. (Advance) [Related work]
    Related work about the proposed benchmark, which contains over 10 other works with comments.

Submit method

After completing these tasks, the work should be submitted by the following.

  1. Use Google Disk for your test set and a link to the dataset should be put at the beginning of the research report.
  2. Use Google Docs for your research report and create a shared link for the docs. Please send the Google link via email to [email protected] and [email protected].

We will publish all received report links under this issue after the submission deadline.

Rating

  • Each item will be scored based on completion and quality, and the sum of the scores will be the total score for this task.
  • For the links that can not be opened successfully, the score will be ZERO without any notification. So, be careful about the link and ensure it can be opened before submission!

Task 1 Test dataset

Item Score
Task 1.1. Domain-specific LLM 10
Task 1.2. Brand new dataset 20
Task 2 Research report

Item Score
Task 2.1. Domain justification 10
Task 2.2. EDGE characteristics 20
Task 2.3. Ianvs tutorial 20
Task 2.4. Related work 20

Pre-test deadline

According to the official schedule of the LFX Mentorship, candidates need to complete registration and project applications between February 5 and February 18. The mentors will confirm candidates between February 19 and February 25. To ensure sufficient time for review, please complete this pre-test and send the report email by February 23, 2025 11 AM (PST).

Hello @MooreZheng I have submitted the tasks and report to the email mentioned in the tasks. Hope you received it

@MooreZheng
Copy link
Collaborator Author

MooreZheng commented Feb 27, 2025

Thanks for your participation in the pretest of the LFX mentorship project: ”CNCF - KubeEdge: Domain-specific large model benchmarks: the edge perspective (2025 Term 1)”,

We received 48 applications. The following 7 outstanding candidates have done a great job of completing the research report, ranking Top 14.58% out of all 48 candidates! As promised in the pretest, after the submission deadline, we hereby publish all received pretest report links:

Robin Chen: https://docs.google.com/document/d/1UpCy70VnbvvOKCiwyluLm9SimeLzPrDgfzVIPennyug/edit?usp=sharing
Jingyang Li:https://docs.google.com/document/d/1Qa142NL5y8BI8myEfTEmQNCLWMrqDHna4yX1BmHbJU0/edit?usp=sharing
Param Thakkar:https://docs.google.com/document/d/1ILuxEWwo9WH_kYh9IcERWeJkuFJ0XEyf2fQLKwymcM0/edit?usp=sharing
Atal Gupta:https://docs.google.com/document/d/1GLMwvZPREzz6QoXA3HWSmHndztQHfmfidzCPB0H_Z0A/edit?tab=t.0#heading=h.aczyuw2yex2w
Prachi Agrawal:https://docs.google.com/document/d/1CTO5ev8UoyY6k8MHNAypqnwgPvxeEV5zHmwk-swm91I/edit?usp=sharing
Yash Dilip Phalke:https://docs.google.com/document/d/1GR0y8k1FdJU4xEpCBDYenLrb6MW8s1rxSizN1Kh0v74/edit?usp=sharing
Dikshant jha: https://docs.google.com/document/d/1FuFf-q3Xc8UpnZjQEUJUzIPbyMlXnW0SG5fpxzoDiRE/edit?usp=sharing

We also would like to take this opportunity to acknowledge your outstanding performance in the pre-test and invite all the above candidates to join KubeEdge SIG AI. There will be more events coming, and we look forward to seeing your contribution in the future~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

7 participants