Abstract
Efficient traffic signal control is critical for reducing urban congestion, yet traditional rule-based and machine learning approaches fail to adapt to dynamic conditions. Reinforcement Learning (RL) provides a data-driven alternative, and this study evaluates classical exploration strategies Epsilon-Greedy, Thompson Sampling, Upper Confidence Bound (UCB), and Softmax alongside hybrid variants including
Softmax with Temperature Annealing (SMX-TA), Softmax with Entropy Regularization (SMX-ER), and their combinations with UCB and Thompson Sampling.
Our framework is based on value-based RL (Double Deep Q-Networks), chosen for scalability and stability compared to on-policy methods such as PPO and A2C, and implemented with parallel simulation in SUMO and CityFlow across synthetic grids ( 1 × 1 , 4 × 4 , 6 × 6 ) and real-world datasets (Hyderabad, New York). Benchmarking against classical baselines (FixedTime, MaxPressure) and recent RL models (FRAP, CoLight, GCN) shows that the proposed SMX-ER+UCB consistently yields the lowest average travel times and robust adaptability across traffic conditions. An ablation study confirms the complementary benefits of entropy regularization and UCB, while integration into scalable controllers such as CoLight demonstrates network-level generalization. These results highlight hybrid exploration as an effective and practical strategy for real-time, city-wide traffic optimization. Such large-scale
optimization requires high-performance computing (HPC) to support parallel simulations, GPU-accelerated training, and multi-agent coordination, underscoring the
necessity of HPC frameworks for intelligent transportation systems.