Welcome to my categorised resources from my reading of academic papers!
- Language Models & Architecture
- Multimodal Models
- Training & Optimization
- Memory & Attention Mechanisms
- Cache & Memory Optimization
- Long Context Models
- Spatial Reasoning
- World Modeling Platforms
- Performance Optimization
- Retrieval & Augmentation
- Hallucination Reduction
- Response Filtering
- Coding & Development Tools
- Agentic Frameworks & Task Management
- Document Processing
- Computer Vision
- Robotics & Automation
- Text Analysis & Social Media Tools
- Security Frameworks
- Cybersecurity
- Detection & Monitoring
- Deepfake & Anomaly Detection
- Poisoning Attacks
- Bias & Fairness
- Privacy & Ethics
- Testing & Benchmarking
- User Interface & Interaction
- Infrastructure & Scaling
- Image Generation
- Speech & Audio
- Video Generation
- Financial Technology
- Gaming & Simulation
- Scientific Applications
- Tutorials & Guides
- Best Practices
- Energy Efficiency
-
DeepSeek-V3 Technical Report Paper: Link
Summary: DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671 billion parameters, with 37 billion activated per token. It is pre-trained on 14.8 trillion diverse, high-quality tokens, followed by supervised fine-tuning and reinforcement learning to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models, despite requiring only 2.788 million H800 GPU hours for full training. The training process is noted for its remarkable stability.
Tags: #language-model #Mixture-of-Experts #AI -
Large Concept Models: Language Modeling in a Sentence Representation Space Paper: Link
Summary: This paper introduces the concept of Large Concept Models (LCMs), which operate on higher-level semantic representations beyond individual tokens. By treating entire sentences as single concepts, LCMs utilize the SONAR sentence embedding space, supporting up to 200 languages in both text and speech modalities. The model is trained to perform autoregressive sentence prediction within this embedding space, exploring approaches such as MSE regression, diffusion-based generation variants, and models operating in a quantized SONAR space. This methodology aims to align AI processing more closely with human cognitive abstraction levels, enhancing the generation and analysis of complex information.
Tags: #language-modeling #sentence-embedding #multilingual #AI -
Byte Latent Transformer: Patches Scale Better Than Tokens Paper: Link
Summary: The Byte Latent Transformer (BLT) is a novel byte-level large language model (LLM) architecture that matches the performance of traditional tokenization-based LLMs while enhancing inference efficiency and robustness. BLT encodes raw bytes into dynamically sized patches, which serve as the primary computational units. These patches are segmented based on the entropy of the subsequent byte, allocating more computational resources to complex data segments. A FLOP-controlled scaling study of byte-level models up to 8 billion parameters and 4 trillion training bytes demonstrates that BLT can scale models trained on raw bytes without a fixed vocabulary. This approach improves both training and inference efficiency by dynamically selecting longer patches for predictable data, leading to qualitative enhancements in reasoning and generalization.
Tags: #LLM #byte-level-modeling #transformer #AI -
Kolmogorov-Arnold Transformers Paper: Link
Summary: This paper presents the Kolmogorov-Arnold Transformer, a novel architecture aimed at improving the efficiency and performance of transformer models.
Tags: #architecture #transformer #AI -
Hunyuan-Large Paper: Link
Summary: This paper introduces Hunyuan-Large, the largest open-source Transformer-based mixture of experts model, featuring a total of 389 billion parameters with 52 billion activated parameters and the ability to handle up to 256K tokens. Hunyuan-Large demonstrates superior performance in various benchmarks, including language understanding and generation, logical reasoning, and coding tasks, surpassing smaller models while maintaining competitive performance against much larger counterparts like LLama3.1-405B.
Tags: #MoE #transformer #open-source #AI
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper: Link
Summary: This paper introduces VITA-1.5, a Multimodal Large Language Model (MLLM) designed to integrate visual and speech modalities for real-time interaction. The authors propose a multi-stage training methodology that enables the model to understand both visual and speech information, facilitating efficient speech-to-speech dialogues without relying on separate Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) modules. This integration accelerates multimodal end-to-end response speed and enhances the model's capabilities in both vision and speech tasks. Comparative evaluations across benchmarks for image, video, and speech tasks demonstrate that VITA-1.5 achieves performance comparable to state-of-the-art models, enabling near real-time vision and speech interaction.
Tags: #multimodal #vision-and-speech #real-time-interaction #AI #MLLM -
Aria Paper: Link
Summary: Aria is an open multimodal-native mixture of experts model designed to handle diverse data types, enhancing AI's multimodal processing capabilities.
Tags: #architecture #multimodal #AI -
TableGPT2 Paper: Link
Summary: TableGPT2 is a large-scale multimodal model specifically designed to integrate tabular data into AI applications. Trained on over 593,800 tables and 2.36 million high-quality query-table-output tuples, it excels in table-centric tasks while maintaining robust general language and coding capabilities.
Tags: #multimodal-model #tabular-data #AI
-
Huggingface - AutoTrain Paper: Link
Summary: AutoTrain by Huggingface introduces a no-code solution for training machine learning models, simplifying model customization and deployment.
Tags: #UI #GUI #AI -
Initialization Using Update Approximation Is Effective for Training Transformers Paper: Link
Summary: This paper introduces a novel initialization method that approximates full fine-tuning within low-rank subspaces for training Transformer models. By employing a carefully designed initialization strategy, the approach achieves optimal scaling for high-rank gradient updates without the need for extensive hyperparameter tuning. The method demonstrates significant efficiency gains, using 27-90 times fewer parameters than standard low-rank adaptation techniques, while surpassing their performance across various tasks, including mathematical reasoning and language understanding.
Tags: #transformers #initialization #low-rank-fine-tuning #AI -
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training Paper: Link
Summary: This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system designed to streamline the training of large language models (LLMs). TorchTitan unifies state-of-the-art techniques, enabling modular 3D parallelism with elastic scaling, and provides comprehensive tools for logging, checkpointing, and debugging. It incorporates hardware-software co-designed solutions, such as Float8 training and SymmetricMemory, to enhance efficiency. The system has been thoroughly assessed on the Llama 3.1 family of LLMs, demonstrating exceptional performance, modular composability, and elastic scalability.
Tags: #distributed-training #PyTorch #large-language-models #AI -
Hot-DoG Paper: Link
Summary: Hot-DoG introduces a tuning-free coreset approach for Markov Chain Monte Carlo methods, enhancing optimization efficiency in AI models.
Tags: #optimization #bayesian-optimization #frameworks
-
Titans: Learning to Memorize at Test Time
Paper: Link
Summary: This paper introduces Titans, a novel neural architecture that combines attention mechanisms with a neural long-term memory module to address the limitations of traditional Transformers in handling long sequences. While attention mechanisms function effectively as short-term memory by capturing direct dependencies within a fixed context window, they struggle with scalability due to quadratic time and memory complexity. Titans overcome this by incorporating a neural long-term memory module capable of memorizing historical context, thus enabling the model to utilize information from both recent and distant past efficiently. The proposed architecture supports fast, parallelizable training and inference. Experimental results demonstrate that Titans outperform traditional Transformers and modern linear recurrent models across various tasks, including language modeling, common-sense reasoning, genomics, and time series analysis. Notably, Titans can effectively scale to context window sizes exceeding 2 million tokens, maintaining high accuracy in scenarios requiring retrieval of specific information from vast data sequences.
Tags: #transformers #long-term-memory #sequence-modeling #AI #Titans -
Lifelong Learning of Large Language Model based Agents: A Roadmap
Paper: Link
Summary: This survey explores the integration of lifelong learning—also known as continual or incremental learning—into large language model (LLM)-based agents. It identifies three core components essential for enabling continuous adaptation: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for interacting with dynamic environments. The paper discusses strategies to mitigate catastrophic forgetting and enhance long-term performance, providing a comprehensive roadmap for developing LLM agents capable of adapting over time. Additional resources and literature are available at https://github.com/qianlima-lab/awesome-lifelong-llm-agent.
Tags: #lifelong-learning #LLM #continual-learning #AI #survey -
Superposition in Transformers: A Novel Way of Building Mixture of Experts Paper: Link
Summary: This paper addresses the challenge of catastrophic forgetting in large language models (LLMs) when adapting to new tasks or domains. The authors introduce a novel architecture that employs autoencoders to superimpose hidden representations of a base model and a fine-tuned model within a shared parameter space. By utilizing B-spline-based blending coefficients and adaptive autoencoders, this method effectively mitigates catastrophic forgetting, allowing the model to retain original capabilities while incorporating domain-specific expertise. Additionally, it supports dynamic switching between model states during inference, enhancing flexibility and performance.
Tags: #transformers #superposition #mixture-of-experts #catastrophic-forgetting #AI -
An Evolved Universal Transformer Memory Paper: Link
Summary: This paper introduces Neural Attention Memory Models (NAMMs), a learned network for memory management that enhances both the performance and efficiency of transformers. NAMMs provide distinct latent contexts, focusing on the most relevant information for individual layers and attention heads. They condition exclusively on the values in the produced attention matrices, making them universally applicable to any model utilizing self-attention. Training NAMMs on a small set of problems yields significant performance improvements across multiple long-context benchmarks while reducing the model's input contexts to a fraction of their original sizes.
Tags: #transformers #memory-management #self-attention #AI -
Star Attention: Efficient LLM Inference over Long Sequences Paper: Link
Summary: Star Attention introduces a two-phase block-sparse approximation to enhance the efficiency of Transformer-based Large Language Models (LLMs) during inference on long sequences. The first phase employs blockwise-local attention processed in parallel across multiple hosts, while the second phase allows query and response tokens to attend to all prior cached tokens through sequence-global attention. This method reduces memory requirements and inference time by up to 11x, maintaining 95-100% accuracy.
Tags: #LLM #attention-mechanism #inference-optimization #block-sparse-attention -
Memory Layers at Scale Paper: Link
Summary: This paper introduces an enhanced memory layer for language models, demonstrating significant performance improvements, particularly in factual tasks. The proposed memory layer implementation is fully parallelizable and showcases scaling laws with up to 128 billion memory parameters, pretrained on 1 trillion tokens. The augmented models outperform both dense models with more than double the compute budget and mixture-of-expert models when matched for both compute and parameters.
Tags: #memory-augmentation #language-models #scaling-laws #AI -
RESOLVE: Relational Reasoning with Symbolic and Object-Level Features Using Vector Symbolic Processing
Paper: Link
Summary: RESOLVE introduces a neuro-vector symbolic architecture that integrates object-level features with relational representations in high-dimensional spaces. Utilizing operations like bundling (summation) and binding (Hadamard product), it enables the coexistence of object-level and relational information without interference. The model features a novel attention mechanism operating in a bipolar high-dimensional space, facilitating efficient attention score computation. RESOLVE demonstrates improved generalizability and accuracy in tasks requiring both pure and partial relational reasoning, such as sorting and mathematical problem-solving, compared to existing methods.
Tags: #relational-reasoning #vector-symbolic-processing #neuro-symbolic-AI #AI
- Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
Paper: Link
Summary: This paper introduces Cache-Augmented Generation (CAG) as an alternative to Retrieval-Augmented Generation (RAG) for enhancing language models. CAG involves preloading all relevant resources into a language model's extended context, eliminating the need for real-time retrieval during inference.
Tags: #cache-augmented-generation #language-models #memory-augmentation #AI
- LongWriter
Paper: Link
Summary: This paper addresses the limitation of current long-context large language models (LLMs) that struggle to generate outputs exceeding 2,000 words. The authors introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling LLMs to produce coherent outputs exceeding 20,000 words.
Tags: #long-context #LLM #text-generation #AI
-
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper: Link
Summary: This paper investigates whether Multimodal Large Language Models (MLLMs), trained on extensive video datasets, can develop visual-spatial intelligence akin to humans. The authors introduce VSI-Bench, a novel benchmark comprising over 5,000 question-answer pairs designed to assess video-based visual-spatial intelligence.
Tags: #MLLM #visual-spatial-intelligence #VSI-Bench #cognitive-maps -
R-CoT Paper: Link
Summary: This paper introduces R-CoT, a two-stage Reverse Chain-of-Thought geometry problem generation pipeline designed to enhance the geometric reasoning capabilities of Large Multimodal Models (LMMs).
Tags: #R-CoT #geometric-reasoning #multimodal-models #AI
- Cosmos World Foundation Model Platform for Physical AI
Paper: Link
Summary: This paper introduces the Cosmos World Foundation Model Platform, designed to assist developers in creating customized world models for Physical AI systems. The platform offers a general-purpose world foundation model that can be fine-tuned for specific applications, facilitating the development of digital twins for both the AI agent and its environment. Key components include a video curation pipeline, pre-trained world foundation models, post-training examples, and video tokenizers. The platform is open-source, with models available under permissive licenses at https://github.com/NVIDIA/Cosmos.
Tags: #world-modeling #physical-ai #digital-twin #open-source #AI
- O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
Paper: Link
Summary: This study critically examines current methods for replicating OpenAI's O1 model capabilities, focusing on the often undisclosed use of knowledge distillation techniques. The authors demonstrate that simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Experiments reveal that a base model fine-tuned on tens of thousands of O1-distilled samples outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Additionally, the study explores the generalization capabilities of O1-distilled models across diverse tasks, including hallucination, safety, and open-domain question answering. Notably, models trained solely on mathematical problem-solving data exhibited strong generalization to open-ended QA tasks and showed reduced susceptibility to sycophancy after fine-tuning. The authors advocate for transparency in AI research and caution against over-reliance on distillation approaches, emphasizing the importance of first-principles thinking in developing capable AI systems.
Tags: #O1-replication #knowledge-distillation #AI-research #transparency
-
MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
Paper: Link
Summary: This paper introduces MiniRAG, a novel Retrieval-Augmented Generation (RAG) system designed for efficiency and simplicity, particularly when deploying Small Language Models (SLMs). MiniRAG addresses performance degradation issues in existing RAG frameworks by implementing two key innovations: (1) a semantic-aware heterogeneous graph indexing mechanism that unifies text chunks and named entities, reducing dependence on complex semantic understanding; and (2) a lightweight topology-enhanced retrieval approach that leverages graph structures for efficient knowledge discovery without requiring advanced language capabilities. Experiments demonstrate that MiniRAG achieves performance comparable to Large Language Model (LLM)-based methods while using only 25% of the storage space. The implementation and datasets are open-sourced at https://github.com/HKUDS/MiniRAG.
Tags: #RAG #retrieval-augmented-generation #small-language-models #efficiency #AI -
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models Paper: Link
Summary: Auto-RAG introduces an autonomous iterative retrieval model that leverages the decision-making capabilities of Large Language Models (LLMs) to enhance Retrieval-Augmented Generation (RAG). Unlike traditional methods that rely on few-shot prompting or manually constructed rules, Auto-RAG enables LLMs to engage in multi-turn dialogues with retrievers, systematically planning retrievals and refining queries to acquire valuable knowledge.
Tags: #Retrieval-Augmented-Generation #LLM #iterative-retrieval #AI -
MemoRAG Paper: Link
Summary: MemoRAG explores advanced techniques in retrieval-augmented generation for more effective information synthesis in AI systems.
Tags: #RAG #retrieval #AI -
KAG Paper: Link
Summary: This paper explores Knowledge Augmented Generation (KAG) as a method to enhance large language models (LLMs) for specific professional domains, focusing on the integration of domain-specific knowledge to improve accuracy and relevance in outputs.
Tags: #KAG #knowledge-augmentation #professional-domains #AI
-
Prereq-Tune Paper: Link
Summary: Prereq-Tune utilizes synthetic data to reduce hallucination in large language models, enhancing factual accuracy.
Tags: #hallucination #factuality #AI -
DecoPrompt: Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises Paper: Link
Summary: DecoPrompt addresses the issue of hallucinations in large language models (LLMs) when confronted with false premises. The method leverages LLMs to "decode" misleading prompts without generating hallucinated outputs, effectively reducing inaccuracies.
Tags: #hallucination-reduction #LLM #false-premises #AI
-
Breaking the Low-Rank Dilemma of Linear Attention Paper: Link
Summary: This paper addresses the computational inefficiencies of the Softmax attention mechanism in Transformer models, particularly its quadratic complexity in vision applications. It introduces Rank-Augmented Linear Attention (RALA), which enhances the rank of linear attention's feature map, enabling it to model complex spatial information more effectively.
Tags: #linear-attention #transformer #vision-models #RALA #RAVLT -
LoLCATs: On Low-Rank Linearizing of Large Language Models Paper: Link
Summary: This paper introduces LoLCATs (Low-rank Linear Conversion via Attention Transfer), a method designed to enhance the efficiency of large language models (LLMs) by replacing their quadratic attention mechanisms with subquadratic linear attentions. The approach involves two key steps: training linear attentions to approximate the outputs of softmax attentions through attention transfer, and then applying low-rank adaptation (LoRA) to adjust for any approximation errors.
Tags: #linear-attention #LLM #LoRA #model-efficiency
-
Multi-Programming Language Sandbox for LLMs Paper: Link
Summary: This paper introduces MPLSandbox, an out-of-the-box framework designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). MPLSandbox automatically identifies the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. It integrates both traditional and LLM-based code analysis tools, offering a thorough examination of generated code. The framework can be seamlessly incorporated into the training and deployment of LLMs to enhance the quality and correctness of their code outputs, thereby streamlining workflows for various LLM-based code-related tasks and reducing development costs.
Tags: #LLM #code-analysis #sandbox #programming-languages -
Qwen2.5-Coder Technical Report Paper: Link
Summary: The Qwen2.5-Coder series is an advancement over its predecessor, CodeQwen1.5, featuring six models ranging from 0.5B to 32B parameters. Built upon the Qwen2.5 architecture, these code-specific models are pretrained on a vast corpus exceeding 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and mathematical skills. Evaluations across more than ten benchmarks—including code generation, completion, reasoning, and repair—show that Qwen2.5-Coder consistently outperforms larger models of the same size, achieving state-of-the-art performance.
Tags: #code-generation #LLM #AI #Qwen2.5-Coder
-
Agent S - Uses computers like a human Paper: Link
Summary: Agent S introduces an open agentic framework that enables AI systems to utilize computers similarly to humans, enhancing task management and execution capabilities.
Tags: #agent #task-management #AI -
TaskGen Paper: Link
Summary: TaskGen presents a task-based, memory-infused agentic framework designed to improve AI's ability to manage and execute complex tasks efficiently.
Tags: #agent #task-management #AI
- Docling Technical Report
Paper: Link
Summary: This technical report introduces Docling, an open-source, MIT-licensed package designed for efficient PDF document conversion. Leveraging advanced AI models like DocLayNet for layout analysis and TableFormer for table structure recognition, Docling operates effectively on standard hardware with minimal resource requirements.
Tags: #document-conversion #PDF #open-source #AI
- Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Paper: Link
Summary: Gaze-LLE addresses the challenge of gaze target estimation by leveraging features from a frozen DINOv2 encoder within a novel transformer framework. By extracting a unified feature representation for the scene and applying a person-specific positional prompt, the model streamlines gaze prediction without relying on complex, hand-crafted pipelines. This approach achieves state-of-the-art performance across multiple gaze benchmarks, demonstrating the effectiveness of utilizing large-scale learned encoders for this task.
Tags: #gaze-estimation #transformer #DINOv2 #AI
-
ROLO-SLAM: Rotation-Optimized LiDAR-Only SLAM in Uneven Terrain with Ground Vehicle
Paper: Link
Summary: This paper introduces ROLO-SLAM, a LiDAR-based SLAM method designed to enhance pose estimation accuracy for ground vehicles navigating rough terrains. Traditional LiDAR-based SLAM systems often experience significant pose estimation drifts, especially in the vertical direction, leading to distorted global maps. ROLO-SLAM addresses this issue by implementing forward location prediction to reduce discrepancies between consecutive scans, enabling precise determination of position and orientation. The method employs parallel spatial voxelization for correspondence matching and develops a spherical alignment-guided rotation registration within each voxel to estimate vehicle rotation. By incorporating geometric alignment and motion constraints into the optimization process, ROLO-SLAM achieves rapid and effective translation estimation. Keyframes are extracted to construct submaps, facilitating accurate pose estimation through scan-to-submap alignment. A global-scale factor graph is also established to mitigate cumulative errors. Experimental results demonstrate that ROLO-SLAM outperforms existing state-of-the-art LiDAR SLAM frameworks in pose estimation for ground vehicles.
Tags: #LiDAR #SLAM #pose-estimation #robotics #AI -
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving Paper: Link
Summary: OpenEMMA is an open-source end-to-end framework that leverages Multimodal Large Language Models (MLLMs) to enhance autonomous driving systems. By incorporating Chain-of-Thought reasoning processes, OpenEMMA improves performance across various challenging driving scenarios, offering a more efficient and effective approach to autonomous driving.
Tags: #autonomous-driving #MLLM #open-source #AI -
RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World Paper: Link
Summary: RoboMatrix introduces a skill-centric and hierarchical framework aimed at enhancing robot task planning and execution in open-world environments. By extracting common meta-skills from various complex tasks, the framework enables robots to dynamically combine these learned skills to address novel tasks. RoboMatrix is structured into three interconnected layers:
- Scheduling Layer: Utilizes a general Large Language Model (LLM) to decompose tasks and select appropriate skill models.
- Skill Layer: Houses a matrix of meta-skills learned through a unified Vision-Language-Action (VLA) model and hybrid models.
- Hardware Layer: Executes the selected skills on the physical robot.
This design facilitates dynamic task decomposition and skill arrangement, allowing robots to effectively perform new tasks by routing through the skill matrix. Experimental results demonstrate that RoboMatrix achieves remarkable generalization across novel objects, scenes, tasks, and embodiments.
Tags: #robotics #task-planning #hierarchical-framework #meta-skills
- SocialED: A Python Library for Social Event Detection
Paper: Link
Summary: SocialED is an open-source Python library designed to support social event detection (SED) tasks. It integrates 19 detection algorithms and 14 diverse datasets, providing a unified API with detailed documentation. The library offers modular components for preprocessing techniques like graph construction and tokenization, and includes standardized interfaces for training models and making predictions. By integrating popular deep learning frameworks, SocialED ensures high efficiency and scalability across both CPU and GPU environments. It adheres to high code quality standards, including unit testing and continuous integration, ensuring robust and maintainable software. SocialED is publicly available at https://github.com/RingBDStack/SocialED and can be installed via PyPI.
Tags: #python #social-event-detection #open-source #AI #library
-
Granite Guardian Paper: Link
Summary: Granite Guardian introduces a suite of models designed to detect risks in prompts and responses, facilitating the safe and responsible use of large language models (LLMs). These models cover multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset that combines human annotations from diverse sources with synthetic data, Granite Guardian addresses risks often overlooked by traditional detection models. It achieves AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks, respectively, making it a generalizable and competitive model in this domain. The models are open-sourced to promote responsible AI development within the community. The code is available at https://github.com/ibm-granite/granite-guardian.
Tags: #risk-detection #LLM #AI-safety #open-source -
Best-of-N Jailbreaking Paper: Link
Summary: Best-of-N (BoN) Jailbreaking is a black-box algorithm designed to bypass safety measures in advanced AI systems across various modalities. It operates by generating multiple variations of a prompt through augmentations—such as random shuffling or capitalization for text prompts—until a harmful response is elicited. BoN Jailbreaking has demonstrated high attack success rates (ASRs) on closed-source language models, achieving 89% on GPT-4o and 78% on Claude 3.5 Sonnet with 10,000 augmented prompts. It is also effective against state-of-the-art open-source defenses like circuit breakers. The algorithm extends seamlessly to other modalities, successfully jailbreaking vision language models (VLMs) like GPT-4o and audio language models (ALMs) such as Gemini 1.5 Pro using modality-specific augmentations. Notably, BoN's effectiveness improves with an increased number of sampled prompts, following a power-law-like behavior across modalities. Combining BoN with other black-box algorithms, such as optimized prefix attacks, can further enhance its efficacy, achieving up to a 35% increase in ASR. This work highlights the susceptibility of language models to seemingly innocuous input modifications, which can be exploited across different modalities.
Tags: #jailbreaking #AI-safety #black-box-attack #multi-modal-AI -
garak: A Framework for Security Probing Large Language Models Paper: Link
Summary: garak (Generative AI Red-teaming and Assessment Kit) is an open-source framework designed to identify vulnerabilities in large language models (LLMs) and dialogue systems. It systematically probes LLMs to uncover potential security weaknesses, providing detailed reports that inform discussions on model alignment and deployment policies.
Tags: #LLM #security #vulnerability-assessment #AI -
AdvSecureNet Paper: Link
Summary: AdvSecureNet provides a Python toolkit for adversarial attack simulation in PyTorch, designed to test and improve AI robustness.
Tags: #security #pytorch #AI -
Neural Fingerprints for Adversarial Attack Detection Paper: Link
Summary: This paper introduces a method to detect adversarial attacks on AI models by leveraging neural fingerprints. The approach involves creating unique identifiers for neural networks, enabling the identification of unauthorized modifications or attacks.
Tags: #adversarial-attack #detection #neural-fingerprints #AI -
Trustworthy Federated Learning: Privacy, Security, and Beyond Paper: Link Summary: This comprehensive survey examines the security and privacy challenges inherent in Federated Learning (FL). It highlights vulnerabilities in communication channels and potential cyber threats within decentralized networks. The authors discuss various defensive strategies to mitigate these risks, explore FL applications across different sectors, and propose future research directions to enhance the security and efficiency of FL systems. Tags: #federated-learning #privacy #security #decentralized-networks
-
IntellBot: Retrieval Augmented LLM Chatbot for Cyber Threat Knowledge Delivery Paper: Link
Summary: This paper introduces IntellBot, an advanced cybersecurity chatbot leveraging Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. IntellBot aggregates information from diverse sources to create a comprehensive knowledge base encompassing known vulnerabilities, recent cyberattacks, and emerging threats.
Tags: #cybersecurity #chatbot #LLM #RAG #threat-intelligence -
Analysing the Cultural Dimensions of Cybercriminal Groups: A Case Study on the Conti Ransomware Group Paper: Link
Summary: This study explores the cultural aspects of cybercriminal organizations, focusing on the Conti ransomware group. By analyzing leaked internal communications, the authors apply Hofstede's and Meyer's cultural dimension frameworks to understand the group's operational behaviors.
Tags: #cybersecurity #cultural-analysis #threat-intelligence #ransomware
-
Capture Artifacts for Deepfake Detection Paper: Link
Summary: This research introduces a method for detecting deepfakes by progressively disentangling and purifying blended identities, improving detection accuracy.
Tags: #deepfake #detection #AI -
AnomalyNCD Paper: Link
Summary: AnomalyNCD proposes a novel approach to anomaly detection, focusing on identifying new anomaly classes in AI systems.
Tags: #anomaly-detection #AI -
FairAdapter Paper: Link
Summary: FairAdapter addresses the challenge of detecting AI-generated images, which often exhibit inconsistencies across different content types due to overfitting in existing detection models. This framework enhances detection fairness by mitigating performance disparities across various image contents.
Tags: #AI-generated-images #detection-fairness #deep-learning #image-forensics
- NoiseAttack
Paper: Link
Summary: NoiseAttack discusses an evasive, sample-specific multi-domain attack strategy, highlighting vulnerabilities in AI systems.
Tags: #poisoning #security #AI
-CodeBreaker
Paper: Link
Summary: CodeBreaker presents an LLM-assisted, easy-to-trigger backdoor attack, emphasizing the need for robust security measures in AI.
Tags: #poisoning #security #AI
-
Human intelligence can safeguard against artificial intelligence Paper: Link
Summary: This study explores individual differences in discerning human-generated texts from AI-generated ones, highlighting the role of human intelligence in monitoring AI outputs.
Tags: #detection #monitoring #human-AI -
LLMSCAN Paper: Link
Summary: LLMSCAN introduces a causal scanning method to detect misbehavior in large language models, enhancing AI system reliability.
Tags: #detection #monitoring #AI -
From Imitation to Introspection Paper: Link
Summary: This paper explores the emergence of self-consciousness in language models, transitioning from mere imitation to introspective capabilities. It examines the extent to which these models can exhibit self-referential behaviors, delving into the philosophical implications of AI's evolving cognitive functions.
Tags: #self-consciousness #AI-philosophy #introspection #detection -
PARIS: A Practical, Adaptive Trace-Fetching and Real-Time Malicious Behavior Detection System Paper: Link
Summary: This paper introduces PARIS, a system designed for real-time detection of malicious behaviors on Windows platforms. Utilizing Event Tracing for Windows (ETW), PARIS adaptively monitors and collects data related to potentially malicious API calls and call stacks, significantly reducing data collection overhead. This efficiency enables the system to monitor a broader range of APIs, enhancing its ability to detect complex attack behaviors.
Tags: #malware-detection #real-time-monitoring #ETW #cybersecurity
-
Fair Summarization: Bridging Quality and Diversity in Extractive Summaries Paper: Link
Summary: This paper addresses the challenge of ensuring fairness in multi-document summarization, particularly when dealing with user-generated content from diverse social groups. The authors introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints.
Tags: #fairness #summarization #extractive-summarization #GPT-3.5 -
Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach Paper: Link
Summary: This paper addresses biases in pronoun usage by large language models (LLMs), particularly the inappropriate application of gendered pronouns when inclusive language is necessary. The authors introduce a collaborative agent framework designed to detect and correct such biases, enhancing inclusivity in AI-generated content.
Tags: #bias-mitigation #queer-representation #LLM #inclusive-language
- From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Paper: Link
Summary: This paper provides a comprehensive survey of the emerging "LLM-as-a-judge" paradigm, where Large Language Models (LLMs) are utilized for scoring, ranking, or selection across various tasks and applications. It introduces a detailed taxonomy exploring three dimensions: what to judge, how to judge, and where to judge.
Tags: #LLM #AI-evaluation #survey #AI
-
OSWorld Summary: OSWorld provides a comprehensive benchmarking environment for evaluating AI performance across various operating systems.
Tags: #benchmark #testing #OS -
AndroidWorld Summary: AndroidWorld offers a specialized platform for testing and benchmarking AI applications within the Android ecosystem.
Tags: #benchmark #testing #OS -
WindowsAgentArena Summary: WindowsAgentArena serves as a testing ground for assessing AI agents' performance in Windows operating environments.
Tags: #benchmark #testing #OS
-
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper: Link
Summary: ShowUI is a vision-language-action model designed to enhance human productivity by enabling AI agents to interact with graphical user interfaces (GUIs) in a manner similar to human perception. Unlike traditional language-based agents that depend on text-rich meta-information, ShowUI incorporates advanced UI-guided visual token selection and interleaved vision-language-action streaming.
Tags: #GUI-assistants #vision-language-action #AI #productivity -
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use Paper: Link
Summary: This study explores Claude 3.5 Computer Use, an AI model designed to function as a graphical user interface (GUI) agent. The authors curated a set of tasks across various domains and software to assess its capabilities.
Tags: #GUI-agent #AI #automation #Claude3.5
-
LLaMA-Factory Summary: LLaMA-Factory is an open-source infrastructure project aimed at streamlining large language model deployment.
Tags: #open-source #infrastructure #AI -
SWIFT Paper: Link
Summary: SWIFT provides scalable, lightweight infrastructure for efficient LLM deployment, designed to minimize resource consumption.
Tags: #open-source #infrastructure #AI
-
TangoFlux: Super Fast and Faithful Text-to-Audio Generation with Flow Matching and CLAP-Ranked Preference Optimization Paper: Link
Summary: TangoFlux is an efficient text-to-audio (TTA) generative model with 515 million parameters, capable of producing up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. It introduces CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. TangoFlux outperforms existing TTA models in both objective and subjective evaluations, providing superior audio quality with reduced inference times.
Tags: #text-to-audio #audio-generation #open-source #AI -
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Paper: Link
Summary: F5-TTS introduces a fully non-autoregressive text-to-speech (TTS) system utilizing flow matching with a Diffusion Transformer (DiT). By padding text inputs to match the length of speech inputs and performing denoising for speech generation, it eliminates the need for complex components like duration models, text encoders, and phoneme alignment. Enhancements include the integration of ConvNeXt for refined text representation and the implementation of a Sway Sampling strategy during inference, which significantly boosts performance and efficiency.
Tags: #text-to-speech #flow-matching #Diffusion-Transformer #AI -
Qwen2-Audio Summary: Qwen2-Audio is an advanced speech model designed to improve audio processing capabilities in AI systems.
Tags: #speech #audio #AI-models -
LLaMA-Omni Paper: Link
Summary: LLaMA-Omni offers seamless speech interaction with large language models, enhancing real-time communication capabilities.
Tags: #speech #audio #AI-models -
Moonshine Paper: Link
Summary: This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings.
Tags: #speech-recognition #live-transcription #voice-commands #AI
-
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper: Link
Summary: SynCamMaster introduces a plug-and-play module that enhances pre-trained text-to-video models for generating synchronized multi-camera videos from diverse viewpoints. It incorporates a multi-view synchronization module to maintain appearance and geometry consistency across different camera angles.
Tags: #video-generation #multi-camera #synchronization #AI -
HunyuanVideo: A Systematic Framework For Large Video Generative Models Paper: Link
Summary: HunyuanVideo is an open-source video foundation model that achieves performance in video generation comparable to, or surpassing, leading closed-source models. The framework integrates key components such as data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure for large-scale model training and inference.
Tags: #video-generation #open-source #AI #large-scale-models -
Open-Sora Plan: Open-Source Large Video Generation Model Paper: Link
Summary: The Open-Sora Plan introduces an open-source initiative aimed at developing a large-scale model capable of generating high-resolution, extended-duration videos based on diverse user inputs. The project encompasses several components integral to the video generation process, including Wavelet-Flow Variational Autoencoder and Joint Image-Video Skiparse Denoiser.
Tags: #video-generation #open-source #AI #Wavelet-Flow-VAE #Skiparse-Denoiser -
StableAnimator: High-Quality Identity-Preserving Human Image Animation Paper: Link
Summary: StableAnimator is an end-to-end video diffusion framework designed to synthesize high-quality human image animations while preserving identity (ID) consistency. It utilizes off-the-shelf extractors to compute image and face embeddings, refining them through a global content-aware Face Encoder.
Tags: #image-animation #identity-preservation #video-diffusion #AI -
Hallo2 Paper: Link
Summary: This paper introduces Hallo2, an advanced method for generating long-duration, high-resolution portrait image animations driven by audio inputs. The approach addresses challenges like appearance drift and temporal artifacts by implementing augmentation strategies within the image space of conditional motion frames.
Tags: #image-animation #audio-driven #high-resolution #long-duration
- FinRobot: AI Agent for Equity Research and Valuation with Large Language Models
Paper: Link
Summary: FinRobot introduces an AI agent framework tailored for equity research, integrating both quantitative and qualitative analyses to emulate human analyst reasoning. It comprises three specialized agents: Data-CoT Agent, Concept-CoT Agent, and Thesis-CoT Agent.
Tags: #AI-agent #equity-research #valuation #LLM
-
Reinforcement Learning for Quantum Tiq-Taq-Toe Paper: Link
Summary: This paper explores the application of reinforcement learning (RL) to Quantum Tiq-Taq-Toe, a quantum variant of the classic Tic-Tac-Toe game. The study highlights the challenges in representing quantum games classically due to partial observability and exponential state complexity.
Tags: #reinforcement-learning #quantum-computing #game-theory #AI -
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks Paper: Link
Summary: Kinetix introduces an open-ended framework for training general reinforcement learning (RL) agents across a vast array of 2D physics-based tasks. By procedurally generating tens of millions of environments—ranging from robotic locomotion and grasping to video games—the framework enables agents to develop robust physical reasoning capabilities.
Tags: #reinforcement-learning #physics-based-tasks #generalization #AI
-
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning Paper: Link
Summary: The Well is a comprehensive collection of datasets comprising numerical simulations across various spatiotemporal physical systems. It offers 15TB of data spanning 16 datasets, covering domains such as biological systems, fluid dynamics, acoustic scattering, and magneto-hydrodynamic simulations of extra-galactic fluids and supernova explosions. This resource aims to facilitate the development and evaluation of machine learning-based surrogate models in physics by providing a diverse range of physical behaviors.
Tags: #physics-simulations #machine-learning #datasets #open-source -
A Library for Learning Neural Operators Paper: Link
Summary: This paper introduces NeuralOperator, an open-source Python library designed to facilitate operator learning. Unlike traditional neural networks that map between finite-dimensional Euclidean spaces, neural operators are capable of learning mappings between function spaces, enabling them to handle inputs and outputs at various discretizations. Built on PyTorch, NeuralOperator offers tools for training and deploying neural operator models, as well as developing new ones, providing a user-friendly interface for both newcomers and experienced practitioners.
Tags: #operator-learning #neural-operators #open-source #Python-library -
Automating the Search for Artificial Life with Foundation Models Paper: Link
Summary: This paper introduces the Automated Search for Artificial Life (ASAL), a novel approach that leverages vision-language foundation models (FMs) to explore and discover lifelike simulations across various Artificial Life (ALife) substrates, including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. ASAL automates the identification of simulations that produce target phenomena, generate temporally open-ended novelty, and illuminate a diverse range of interesting simulations.
Tags: #ArtificialLife #FoundationModels #SimulationDiscovery #AI
-
Quantum Computers Explained – Limits of Human Technology Video: Link
Summary: This educational video provides an accessible overview of quantum computing, exploring its fundamental principles and potential to revolutionize technology. It delves into the limitations of classical computers and illustrates how quantum mechanics can overcome these barriers, enabling unprecedented computational capabilities.
Tags: #quantum-computing #technology #education #computing -
Every attention head explained Source: YouTube
Summary: This educational video provides an in-depth explanation of attention heads within transformer models, a fundamental component in modern machine learning architectures. It covers the mechanics of attention mechanisms, their role in processing input data, and how they contribute to the performance of models in tasks like natural language processing and computer vision.
Tags: #attention-heads #transformer-models #machine-learning #AI -
How CUDA Programming Works Source: YouTube
Summary: This YouTube video provides an overview of CUDA programming, explaining its functionality and applications in parallel computing.
Tags: #CUDA #computing
- Quantum Computing: Hype vs. Reality
Video: Link
Summary: This video critically examines the current state of quantum computing, distinguishing between the technology's realistic capabilities and the surrounding hype. It provides insights into the progress made in the field, the challenges that remain, and the practical implications of quantum computing advancements.
Tags: #quantum-computing #technology #analysis #reality-check
- The carbon emissions of writing and illustrating
Paper: Link
Summary: This study compares the carbon emissions associated with writing and illustrating tasks performed by AI systems versus humans. The findings indicate that AI systems produce significantly lower carbon emissions for these creative tasks.
Tags: #carbon-emissions #AI-efficiency #environment #sustainability