When moving to Grid Computing, parallel applications face several performance problems, The system characteristics are different in each execution and sometimes within the same execution. Remote resources share network links and in some cases, the processes share machines using per-core allocation. In such scenarios we propose to use automatic performance tuning techniques to help an application adapt itself: thus a system changes in order to overcome performance bottlenecks.
This thesis analyzes such problems of parallel application execution in Computational Grids, available tools for performance analysis and models to suit automatic dynamic tuning in such environments. From such an analysis, we propose system architecture for automatic dynamic tuning of parallel applications on computational Grids named GMATE. Its architecture includes several contributions. In cases where a Grid meta-scheduler decides application mapping, we propose two process tracking approaches that enable GMATE to locate where a Grid middleware maps application processes. One approach consists of the integration of GMATE components as Grid middleware. The other involves the need to embed a GMATE component inside application binaries. The first requires site administration privileges while the other increases the application binary which slows down application startup.
To obey organizational policies, all communications use the same application security certificates for authentication. The same communications are performed using Grid middleware API. That approach enables the monitoring and tuning process to adapt dynamically to organizational firewall restrictions and network usage policies.
To lower the communication needs of GMATE, we encapsulate part of the logic required to collect metrics and change application parameters in components that run inside the processing space. For metric collection, we create sensor components that reduce the communication by event inside the process space. Different from traditional instrumentation, sensors can postpone the metric communication and perform basic operations such as summarizations, timers, averages or threshold based metric generation. That reduces the communication requirements in cases where network bandwidth is expensive. We also encapsulate the modifications used to tune the application in components called actuators. Actuators may be installed at some point in the program flow execution and provide synchronization and low overhead control of application variables and function executions. As sensors and actuators can communicate with each other, we can perform simple tuning within process executions without the need for communication.
As the dynamic tuning is performance model-centric, we need a performance model that can be used on heterogeneous processors and network such Grid Systems. We propose a heuristic performance model to find the maximum number of workers and best grain size of a Master-Worker execution in such systems. We assume that some classes of application may be built capable of changing grain size at runtime and that change action can modify an application's compute-communication ratio. When users request a set of resources for a parallel execution, they may receive a multi-cluster configuration. The heuristic model allows for shrinking the set of resources without decreasing the application execution time. The idea is to reach the maximum number of workers the master can use, giving high priority to the faster ones.
When we change the number of workers in a Grid environment, we should perform changes in application parameters and in the system configuration. That is an example of multi-layer tuning. To grow or shrink the number of processors, we need to interact with Grid middleware in synchronization with application process reconfiguration. To accomplish that, we have actuators that interact with the Grid services.
We presented the results of the dynamic tuning of grain size and the number of workers in Master-Worker applications on Grid systems, lowering the total application execution time while raising system efficiency. We used the implementation of Matrix-Multiplication, N-Body and synthetic workloads to try out different compute-communication ratio changes in different grain size selections.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados