Abstract
Large Language Models (LLMs) have undergone remarkable growth in capability over recent years,
evolving from text-generation systems into autonomous agents capable of reasoning, planning, execution and performing complex tasks across diverse domains through the use of external tools, memory,
and multi-step interactions with their environment. However, this progress and widespread adoption
has also led to the expansion of associated risks and misuse. As LLMs become increasingly integrated
into real-world systems, ensuring their safety, robustness, and trustworthiness has emerged as a critical
challenge.
This thesis investigates the trustworthiness of LLM systems, with a particular focus on their safety,
reliability, and robustness across settings. First, we examine the fragility of AI-generated text detection
methods in low-resource non-English settings, specifically Hindi language. While existing detection
methods have largely focused on English, the rapid adoption of LLMs across multilingual settings raise
important concerns regarding the reliability of current detection frameworks beyond their original training distributions. To address this gap, we present a comprehensive study on Hindi AI-generated news
detection, introducing AGhi, a large-scale benchmark dataset consisting of human-written and LLMgenerated Hindi news articles and propose AI-Detectability Index for Hindi (ADIhi). Our findings
show that current methods fail to generalize effectively to non-English settings, revealing significant
limitations in the multilingual robustness of AI-generated text detection systems.
Second, we evaluate the ability of Small Language Models (SLMs) for function selection and generation, a critical capability for agentic systems. While large frontier models achieve strong function
calling performance, their deployment is often impractical in resource-constrained or privacy-sensitive
environments. This work presents a systematic evaluation of multiple SLMs on large-scale function generation tasks under zero-shot, few-shot, and fine-tuned settings, including adversarial prompt injection
scenarios. The findings show that many smaller models continue to struggle with consistent function
generation despite improvements from fine-tuning.
Next we investigate the safety and robustness of multi-agent LLM systems, where multiple agents
collaborate to solve complex tasks through communication and coordination. Existing safety evaluations primarily assume single-agent settings and therefore fail to capture the emergent risks introduced
by inter-agent interactions. To address this limitation, we introduce TAMAS (Threats and Attacks in
Multi-Agent Systems), a benchmark specifically designed to evaluate adversarial threats in multi-agent
LLM systems across diverse domains and attack scenarios. The benchmark consists of six attack types spanning five high-stakes domains, and reveals the failure modes of multi-agent systems under adversarial settings. The findings show that multi-agent systems inherit many vulnerabilities from single-agent
setups while also introducing new risks that emerge specifically from inter-agent communication and
coordination.
Collectively, this thesis highlights that the trustworthiness of modern LLM systems cannot be measured solely through improvements in capability or benchmark performance. As these systems become
increasingly integrated into real-world applications, they must be evaluated not only for what they can
achieve under ideal conditions, but also for how reliably and safely they behave under challenging,
adversarial, and diverse deployment settings.
By examining multiple dimensions of trustworthiness across different levels of the LLM ecosystem,
this thesis emphasizes the importance of rigorous evaluation methodologies for identifying hidden failure modes and advancing the development of safer, more robust, and dependable AI systems. In support
of this goal, the thesis contributes novel datasets, benchmarks, metrics, and evaluation frameworks for
systematically studying reliability and safety across diverse real-world deployment scenarios.