Mahmoud El Chamie, Dylan Janak, Behçet Açikmese
Markov decision processes (MDPs) have been used to formulate many decision-making problems in science and engineering. The objective is to synthesize the best decision (action selection) policies to maximize expected rewards (or minimize costs) for a stochastic dynamical system. In this paper, we introduce a new type of sensor measurement to the MDP model that provides additional information about the stochastic process, and hence that information can be incorporated in the decision policy to increase the performance. The new model is tailored for environments with high uncertainty. With the additional measurements, more refined information on the possible state transition is provided in real-time before taking an action. This new MDP model with sequential measurements is referred to as sequentially-observed MDP (SO-MDP). We show that the SO-MDP shares some similar properties with a standard MDP; among randomized history dependent policies, deterministic Markovian policies are still optimal. Optimal SO-MDP policies have the advantage of producing better total rewards than standard MDP policies due to the additional measurements, however computing these policies is more complex. We present two algorithms for solving the finite-horizon SO-MDP problem: the first algorithm is based on linear-programming, and the second algorithm is based on dynamic programming. We show that the complexity of computing optimal policies of the SO-MDP model with perfect sensors is the same as standard MDP. Simulations demonstrate that the SO-MDP model outperforms the standard MDP model in the presence of high environmental uncertainty.
© 2001-2025 Fundación Dialnet · Todos los derechos reservados