Improving the Performance of Service-based Applications by Dynamic
Service Execution
                         Hong Liu Xiaoning Wang Tian Luo Xiaosong Li Wei Li
        Institute of Computing Technology of Chinese Academy of Sciences, Beijing, China
             Graduate School of the Chinese Academy of Sciences, Beijing, P.R. China
                      {liuh, wangxiaoning, lixiaosong, luotian, liwei}@ict.ac.cn
                       Abstract                               inside this service. Furthermore, overlapping
                                                              invocations can also tolerate the latency brought by
     Distributed applications are increasingly being          the message passing between sites.
built by Web services. This paper explores a dynamic             The early practices of services-based developing
service execution technique and based on this                 and hosting environments, such as Apache Axis [2],
technique, we design and implement a virtual machine-         offer only synchronous interactions, primary over
based runtime environment called Abacus Virtual               HTTP/SOAP. Current service-based environments,
Machine (AVM). With the help of stateless property of         such as Apache Axis 2 [3], supports asynchronous
Web services and a runtime dependence analysis                Web services and asynchronous invocations using
technique, AVM determines the data dependence of a            non-blocking clients and transports, which greatly
service invocation and executes independent service           eases developers from writing complicated concurrent
invocations concurrently. In this way, it is able to          code to implement simultaneous service invocations.
exploit parallelism of service-based applications written     However, developers still have to explicitly manage the
in a sequential programming model, which is a                 control flow of concurrent invocations. When the
conventional model for developers. At the same time,          application’s complexity and scale increases, it will be
AVM also need to make balancing between the goal of           hard to optimize the performance of applications by
exploiting parallelism and the cost brought by the            manually managing control flows. In fact, if the hosting
dynamic service execution. Our experiments show that          environment can automatically execute service
the cost is no more than 9% of CPU time and 10%               invocations concurrently, writing service-based
memory cost. We believe that the penalty is acceptable        applications can be easier at the same time the
for the benefits that AVM provides.                           performance of applications can be improved. However,
                                                              there is little effort on offering implicit simultaneous
                                                              service invocations in most of current service-based
1. Introduction
                                                              related environments.
                                                                 Furthermore, from the standard definition of W3C
   Web Services [12], which is designed to support
                                                              [19], a Web service is a stateless entity that can perform
interoperable machine-to-machine interaction over a
                                                              specific functions. This feature make s service
network, has been an important enabling technology
                                                              computing different from other distributed systems
for distributed systems . Service-based applications are
                                                              such as distributed object system whose entities are
built by services that have a network-addressable
                                                              stateful in default. Although in some Web services
interface and communicate via standard protocols and
                                                              architectures [6] [8], state is a matter of concern. The
data formats.
                                                              hosting environments of service-based applications are
   Web services in service-based applications are
                                                              capable of knowing whether a service is stateful or
distributed in different hosts and processes of these
                                                              stateless. While in other distributed systems, for
services are inherently able to be executed in parallel. In
                                                              example the distributed object system, it is hard to
current web service programming models, a number of
                                                              analyze whether an invocation of a method will change
services need to interact with one or more other
                                                              the global state of the application.
services to complete its functionality. Therefore, it is
                                                                 This paper presents a dynamic service execution
feasible that the response time of a service can be
                                                              technique with the leverage of stateless property of
reduced by concurrently executing the invocations
                                                              Web services to improve the performance of service-
based applications implicitly. Based on this approach,                                .
we design and implement a distributed virtual machine
called Abacus Virtual Machine (AVM). Using AVM,
developers can write service-based programs in the
conventional sequential programming model. At the
same time, AVM can dynamically schedule service
execution concurrently to improve the performance
while maintaining the correct data flow of application.
   The rest of the paper is organized as follows. Section
2 further introduces the motivation of our work. Details
of design and implementation are presented in Section 3.
Section 4 evaluates our system. Section 5 and 6
provides the related work and concluding remarks
respectively.
2. Motivation
   Taking advantage of parallelism is one of the most
important methods for improving performance of
distributed applications. In current web service-based
applications, a service is a software agent that
distributed in network [19]. Therefore, service’s
operation is inherently capable of being processed in
parallel and the key is how to make the service
invocations executed concurrently at the caller side.
Currently, the most common approach is writing
concurrent programs by approaches such as
multithread or event-driven programming. That is,
developers should manually analyze which services can
be invoked simultaneously and explicitly write codes to
invoke services in parallel. Some applications,
especially the so-called “embarrassingly parallel”
applications, can fit the concurrent programming model
nicely. However, for applications that is not well
structured and regular, concurrent programming is               Figure 1. The Operation Sequence of the
usually a burdensome and error-prone job. Just as                        Travel Agent Web Service
Sutter and Larus observe [7], “Humans are quickly              Another approach is exploiting parallelism via the
overwhelmed by concurrency and find it much more            run-time system. For example, in computer hardware
difficult to reason about concurrent than sequential        design, a processor can execute several instructions in
code. Even careful people miss possible interleaving        pipeline. The success of the run-time approach
among even simple collections of partially ordered          depends on techniques to solve two problems , first is
operations”. Consequently, it’s a natural conclusion:       how to determine the parts of a program that can be
“Concurrency is fundamentally hard; avoid whenever          executed concurrently at the runtime; second is how to
possible”[10].                                              determine the dependence between concurrent units.
   How to avoid concurrent programming while still          For example, in CPU, the concurrent part is an
make the modules of programs run concurrently? One          instruction and CPU is able to check the dependence
possible solution [13] [16] is automatically exploiting     between instructions.
parallelism in sequential programs by compilers.               Therefore, in order to support the automatic
However, these automatic techniques mainly focus on         parallelism, we should solve the above two problems in
the “embarrassingly parallel”programs but not general-      service-based application. We now introduce a use
purpose programs .                                          case, travel agent application provided by W3C [23], to
                                                            explain the possibility of a successful run-time
                                                            approach.
   Figure 1 shows the operation sequence of the travel         l    Reducing cost: Run-time parallelism detecting
agent Web service. In this example OP1, OP2 and OP3                 and dynamic service execution will bring extra
can be executed concurrently. Because of the data                   processing and memory costs. So another
dependence, OP4 and OP5 can’t be processed until the                important design goal is to reducing costs
OP1, OP2 and OP3 finished. But the OP6, OP7, OP8 is                 brought by AVM when bring performance
able to be continued without the return of OP1, OP2                 benefits to service-based applications.
and OP3. Similarly, OPi9, OP10 can’t be executed until         3.2 Design Principles
the OP6, OP7 and OP8 returned. OP 11, OP12 and OP13               In Section 2, we summa rized three key issues of the
also have their depended operations.                           automatically concurrent run-time environment. First, it
   We can find out that in a service-based application,        is able to automatically determine the concurrent
the first problem is rather easy. A parallel part is exactly   granularity; second it is capable of detecting whether
a service invocation. The difficulty lies in how to            the concurrent unit can be executed. At last, it can
determine whether an invocation is able to be executed.        execute       independent         services    invocation
However, with the run-time data dependence analysis            simultaneously.
technique, the run-time environment is able to make the           For the first issue, AVM adopts a service invocation
decision. For example, the OP12 depends on the                 as the basic unit of parallelism. As mentioned before,
reserved ID and the Authorized ID. Therefore, the              the stateless property of Web service is the key for us
runtime environment must check that whether the OP 10          to implement implicit parallelism for service-based
and OP 11 has finished and decide whether OP12 can             applications. Therefore, AVM needs mechanisms to
be executed.                                                   recognize which entity is a service. In AVM, service is a
   In this use case, in order to automatically parallelize     fundamental abstraction, and operations on services,
the execution of service-based applications, a run-time        such as creation, invocation are virtualized to a normal
environment should first be able to determine the              instruction of virtual machine. The virtual machine just
parallel granularity for a service-based application. Next,    regards a service invocation as an ordinary operation
it should be able to analyze the dependence of the             that has long and uncertain latency.
parallel part, which in this case, is a service operation.        For the second issue, the dependence analysis
At last, when the program is running, it should be able        module of AVM can detect the dependence between
to dynamically execute independent services                    instructions. If instruction i depends on a long-running
simultaneously. These issues are the most important            instruction j, which is usually a service invocation
considerations when we design and implement the                instruction, AVM will delay the execution of i until j is
Abacus Virtual Machine.                                        finished. Using an appropriate dependence analysis
                                                               policy, the virtual machine can guarantee the
3. Design and Implementation of AVM                            correctness of data flow.
   In this section, we introduce design and                       For the third issue, as mentioned in Section 2, the
implementation of Abacus Virtual Machine, which is a           dynamic execution engine of AVM can execute two or
run-time environment for service-based application that        more service invocations simultaneously. Whether a
supports implicit parallelism via a dynamic service            service invocation depends on other instruction is
execution engine.                                              determined by dependence analysis module .
3.1 Design Objectives of AVM                                   3.3 Overview of AVM’s Implementation
We have distilled following design objectives central to           AVM is a language level virtual machine providing
the success of our work:                                       an     executing    environment      for   service-based
 l Sequential programming model for developers:                applications. In AVM , service is a fundamental machine
      The developers just write sequential program and         abstraction, which conforms to the Web services’
      need not explicitly write concurrent program. The        definition provided by W3C [19] and the operations on
      conventional sequential programming model is             services are abstracted to instructions of AVM.
      maintained.                                                  On top of AVM, we provide a service-oriented
 l Implicit parallelism at service level: For a                programming language called Abacus, which is
      sequential service-based application, at runtime,        designed as a language level developing tool. Details of
      AVM will automatically detect the data                   the language can be found in [20].The core concept of
      dependences between service invocations and              Abacus is the service construct. In fact, a service is a
      execute independent services concurrently.               first-class object in Abacus language. Due to the
intrinsic loosely-coupled nature of a service, Abacus        3.4 Dynamic Execution Engine
does not support inheritance and polymorphism                   The dynamic execution engine is the core of AVM
features of object-oriented programming languages.           system. Like other process-level virtual machine, AVM
Developers write Abacus Language source codes and            executes the instruction flow of application and
compile them to bytecode that is composed of AVM             maintains the run-time data structures such as PC
instructions.                                                register, heap and stack. AVM has a stack-based
   AVM is an innovative hosting environment for              architecture and every instruction operates on data on
service-based application. But it does not mean that we      stack.
have to build it from scratch. Since it is also a process-      The difference between the execution engine of
level virtual machine, the design of instructions of         AVM and other virtual machine is the dynamic
AVM refers to a more popular process-level virtual           execution engine which is able to rearrange the
machine, JVM. To support service operations, AVM             instruction execution. Since service invocation will
provides a set of service-related instructions. Each         bring long and uncertain latency, dynamic execution
instruction indicates a general operation on services. In    engine takes advantage of an asynchronous invocation
the prototype of AVM, we mainly take three operations        mechanism to make the invocation instruction return at
into account: deploying, invoking and destroying a           once. If an instruction depends on the data returned by
service. In addition, AVM is built on the top of the         a former invocation instruction, the execution engine
freely available Kaffe virtual machine [14], further         will issue the next instruction to execute. In this way,
information on AVM can be found in our website [1].          AVM allows the overlapping execution of two or more
                                                             invocation instructions.
                                                                The idea behind the dynamic execution engine is
                                                             similar to that behind the dynamic scheduling
                                                             processor. It also implies instruction out-of-order
                                                             execution, which introduces the possibility of WAR
                                                             and WAW hazards. Like sophisticate dynamic
                                                             scheduling techniques used in the design of processor,
                                                             for example, Tomasulo approach [25], AVM also use a
                                                             software reservation stations to avoid these hazards.
                                                             3.5 Dependence Analysis Module
      Figure 2. The AVM System Overview                         There are two main goals of the dependence
   As shown in Figure 2, AVM system is mainly                analysis module. First, it must ensure that the
composed of three parts: dynamic execution engine,           application’s data flow will not be destroyed by the
dependence analysis module and embedded service              instruction scheduling; second, with the help of
container. The embedded container has two major              compiler, it will try its best to find the permission of an
functions. One is to host services of AVM. When a            invocation instruction to be executed.
request arrives, it creates an execution thread to              To fulfill the first goal, AVM must analyze the data
process the request; the other is to provide an              dependence and the control dependence of every
asynchronous service invocation mechanism. When an           instruction and decide whether it can be executed. This
invocation instruction is executed, AVM sends the            technique has been widely used in the processor
message to the destination service and returns               architecture, and we implement a software version.
immediately. When the response comes back, the AVM           However, the question is, as the dependence analysis
receives the response message and notifies the               by hardware can be executed in pipeline, how the
corresponding thread. The asynchronous service               dependence analysis by software will be efficient
invocation is the foundation of dynamic execution            enough to not impair the performance. The answers to
engine. Due to their importance in this paper, the details   this question are: first, unlike the dependence analysis
of dynamic execution engine and dependency check             by hardware, in AVM, the dependence analysis only
module will be introduced in the following section           happens in the situation that reservation stations is not
separately.                                                  empty, which means that some invocations’results
                                                             haven’ t returned yet. Indeed the process of
                                                             dependence analysis process overlaps some invocation
                                                             instructions execution. Second, for most of instructions,
the dependence analysis just cost a few CPU cycles             In our implementation of the NGB programs, every
and does not seem to be a problem in practice. The          task is a service written by Abacus. We also write a
results of our e xperime nts validate above analysis.       NGBServer service. Clients can invoke a method of
   For the invoke instruction, AVM not only checks its      NGBServer service to perform a specified problem.
data dependence, but also checks whether it will            Appendix 1 gives the Abacus code of MB problems.
change the state of the application. In AVM, it’s
assumed that the virtual machine is able to know            4.2 Test Environme nts.
whether a service will use and change the data not             All measurements reported in this section were
contained in its input message. A service that don’t        performed in a cluster consists of 9 PC servers with
use the data not contained in the input message is          dual AMD Opteron 248 processors, 2GB of RAM. The
called a stateless service, and a contrary service is       servers are connected by a 1000Mb Ethernet. The
called stateful service.                                    operating systems on these servers are Redhat Linux
    Because AVM does not know which state that a            with kernel 2.4.21.
called service operation will change, the default              The four types of problems have different control
dependence analysis policy is that if an invocation of a    flow and data dependence between tasks. Only problem
stateful service hasn’t completed, AVM will not execute     ED is able to make use of all machines since each task
any other invocation instructions. With the help of the     of it have no dependence on one another. HC
abacus compiler, AVM is able to know whether an             represents a sequential problem and cannot be
operation will change other services’state. Therefore, it   executed in parallel. The VP and MB’      s degree of
is safe to execute two instructions or invoke               parallelism is in the middle and run three tasks
instructions simultaneously if these instructions will      simultaneously at most, so we assigned three servers to
not change other’s state, or these instructions will not    run the two problems.
invoke an identical service.
                                                            4.3 NGB Test and Result Analysis.
4. Evaluations                                                 In order to test AVM ’s ability to automatically
                                                            exploit parallelism in programs and the overhead
   We have listed the three objectives in Section3.1.       brought by our implementation, we developed two
The first one is a functional objective. In this section,   versions of execution engine for AVM, one supports
we will use a set of experiments to measure whether our     sequential service execution and the other supports
system achieves the last two objectives.                    dynamic service execution.
                                                               We also wrote a sequential NGB program and a
4.1 Evaluation Programs                                     multi-thread NGB program. For the multi-thread program,
   NGB (NAS Grid Benchmarks) [15] is a benchmark            we carefully wrote the service invocation and
suite for grid computing. It is evolved from NPB (NAS       synchronization code to exploit the maximal parallelism.
Parallel Benchmarks) [5], which is widely used on           Therefore, the multi-thread NGB program can be regard
parallel computing systems. In NGB, there are five basic    as the ideal result.
tasks (i.e. BT, FT, LU, MG, SP) on behalf of various
type of scientific computation, and these tasks             4.3.1 Performance Comparison
construct four problems representing four typical               In our experiments, we ran the sequential NGB
classes of grid applications, named respectively            program both on the sequential execution engine and
Embarrassingly Distributed (ED), Helical Chain (HC),        the dynamic execution engine and the multi-thread NGB
Visualization Pipe (VP), and Mixed Bag (MB). ED             program on the sequential execution engine to evaluate
requires no communication, and all tasks are executed       of our approach. The test has been done repeatedly
independently. It tests basic functionality of grids and    and the absolute deviation of each result never exceeds
does not tax their communication performance. HC is         1 second. The average results are showed in Fig. 3, in
totally sequential. Hence, any time spent on                which S-NGB stands for the sequential-version of NGB;
communicating data between two tasks is fully exposed.      T-NGB stands for the thread-version of NGB; SEE
VP requires specifying concurrent execution of tasks        stands for the sequential execution engine and DEE
with nontrivial dependences. It allows pipelining and       stands for the dynamic execution engine.
overlapping communication times with computational             As shown in Figure 3 and Table 1, the results of
work. MB is similar to VP, but tasks have different         execution time of S-NGB program on DEE VM is close
computational work and control flow.                        to the results of T-NGB program and is far less than the
     results of S-NGB on SEE VM. Due to the dynamic                     by dynamic execution engine is acceptable(less than
     execution engine, the DEE VM has great performance                 3% of CPU time). We still need to test the extra
     improvement and exploits nearly the same degree of                 resources consumed such as CPU time and memory.
     parallelism as the ideal result in our evaluation                     First, we used UNIX command “time”to measure the
     environment.                                                       CPU time of the NGB program. The goal of this test is to
         Considering the overhead of thread, it is not ama zing         reveal the extra CPU cycle used in dynamic execution
     that the execution time of ED and HC on DEE VM is a                engine than the sequential execution engine.
     little shorter than that of T-NGB on SEE VM. The data              Theoretically, the CPU time of thread-version NGB
     dependences between tasks of those two problems are                program is the CPU time of sequential-version NGB
     relative simple and just cost negligible CPU time to               program adds the CPU time of thread-related operation.
     detect them in dynamic execution engine.                           So we only compare the results of sequential-version of
                                                                        NGB program on DEE VM and those of SEE VM.
      800                                                                25
      700
                                                                         20
      600
                                                                         15                                                                             User time
      500                                                                                                                                               Sys time
                                                      S-NGB on SEE VM    10                                                                             Total time
      400                                             T-NGB on SEE VM
                                                      S-NGB on DEE VM
      300                                                                 5
      200
                                                                          0
                                                                              SEE        DEE   SEE        DEE   SEE        DEE        SEE        DEE
      100
                                                                                    ED               HC               VP                    MB
        0
             ED       HC         VP          MB                                   Figure 4. CPU Time Comparison
            Figure 3. Performance Comparison
                                                                          Second, we measure the maximal memory cost of the
                                                                        NGB program in both DEE VM and SEE VM.
          Table 1. Speedup of NGB Programs
       S-NGB on T-NGB on SEE VM    S-NGB on DEE                          250000
       SEE VM                          VM                                200000
        Execution    Execution        Speed       Execution     Spee     150000                                                                  DEE VM Mem(k)
        Time         Time             up          Time          d up     100000                                                                  SEE VM Mem(k)
                                                                          50000
ED      670.576      79.34            8.45        79.045        8.48
                                                                              0
HC      103.49       106.941          0.97        104.924       0.99                     ED          HC         VP               MB
VP      540.033      353.161          1.53        356.385       1.52          Figure 5. Memory Cost Comparison
MB      348.92       133.735          2.61        137.111       2.54
                                                                           Comparing to the SEE VM, the DEE VM use extra
                                                                        CPU time to dynamic analyze instruction’s dependence
        For programs that have complex control flow inside,
                                                                        and manage the reservation stations. In addition, DEE
     such as VP and MB, the execution time of DEE VM is
                                                                        VM allocates memory to hold the dependence of every
     about 1 to 2 percent more than the T-NGB’ s execution
                                                                        unit of the runtime stack and the data structure of
     time. Thread approach benefits from the manual
                                                                        reservation stations. The data above reveals that our
     configuration of concurrency, while our approach
                                                                        approach consumes from 4% to 9% total CPU time more
     needs to exploit parallelism at run time instead of
                                                                        than sequential execution engine. At the same time, the
     executing a pre -given configuration.
                                                                        extra memory cost of the dynamic execution engine is
                                                                        less than 10% of the total memory cost. We believe that
     4.3.2 Overhead Anal ysis
                                                                        the overhead is reasonably acceptable.
        The previous test shows that the dynamic execution
     engine is able to exploit a satisfactory degree of
     parallelism and it also seems that the overhead brought
5. Related Work                                                 RPC in pipeline. With the help of compiler, it can hide a
    To reduce the response time of service-based                portion of concurrent process from developers . But
applications, it is a feasible approach to invoke services      developers are still required to analyze the data
in an overlapping way. In the current Web services              dependence within program manually and write the
environments, there are two classes of approaches that          concurrent program explicitly.
can invoke services in parallel:
    The first class is the concurrent programming               6. Conclusions and Future Work
approach. WS-Eventing [18] and WS-notification [19]                In this article, we introduce a dynamic service
provides specifications for the asynchronous,                   execution technique to improve the performance of
message-based          interactions   in     service-based      service-based application. Using this approach,
interaction, by which, developers can invoke services           developers are able to benefit from the overlapping of
by using the event-driven model. I. Silva-Lepe [9]              services ’ invocation while avoiding the trouble of
provides an approach to integrate the message-based             writing tedious, error-prone concurrent programs. We
model to the Web services programming model. AXIS2              design and implement a distributed virtual machine run-
[3] also provides the asynchronous invocation APIs for          time environment to bring this technique into effect.
the development of service-based applications. Though           According to our evaluation, the goals of our technique
the above work can reduce the complexity of writing             are achieved and the cost is reasonable.
concurrent service-based programs to some extent, the              We have presented a first cut at dynamic service
development of event-driven programs is still far from          executing environment. In the future, with the help of
easy. Furthermore, event-based programming tends to             compiler, we will improve the dependence analysis
obfuscate the control flow of the application [21].             policy, which will further exploit parallelism. In addition,
    The second class is using business process                  how to manage services in our environment to reduce
execution engine, for example, BPEL [4], by which               the real execution time of service-based applications is
developers are able to allow several service invocations        also part of our future work.
be executed in parallel. M. Nanda [11] introduces a
decentralized approach to increase parallelism and              Acknowledgement
reduce the amount of network traffic required for a                This work is supported in part by the National
service-based        application.     However,       firstly,   Natural Science Foundation of China (Grant No.
programming in BPEL still requires developers to deal           60573102, 90412010), the 973 Program (Grant No.
with parallelism manually.          Second, BEPL is a           2003CB317000, 2005CB321800), and the Chinese
programming in large model of language. It mainly used          Academy of Sciences Hundred-Talents Project (Grant
to describe the interactions between services but not to        No. 20044040). The authors are very grateful for these
implement a service itself.                                     supports. We would like to thank Hao Wang, Peixu Li,
    Besides the service-based environment, there are            Rubao Li for their suggestions, ideas and enthusiastic
also considerable projects aimed at automatic                   support.
parallelization technology for distributed applications.
U. Banerjee [16] presents an overview of automatic
                                                                References
program      technique.       Their    technique     works
                                                                [1] AVM, http://abacus.ict.ac.cn/.
comparatively well for regular applications, but they
                                                                [2] AXIS, http://ws.apache.org/axis/.
handle only a few patterns of irregularity. The zJava
                                                                [3] AXIS2, http://ws.apache.org/axis2/.
system [24] tries to exploit parallelism among methods
                                                                [4] BPEL,
by creating an asynchronous thread of execution for
                                                                http://www.ibm.com/developerworks/library/ws-bpel/.
each method invocation in a program. Methods are
                                                                [5] D. Bailey, E. Barscz, The NAS Parallel Benchmarks.
analyzed to determine the data they access at compile
                                                                Technical Report NAS-94-007, NASA Advanced
time. Unlike in our system, in which the basic entity,
                                                                Supercomputing (NAS) Division, NASA Ames
service, is inherently stateless and distributed, in
                                                                Research Center, 1994.
method-level-parallel systems the potential for
                                                                [6] Globus Toolkit 4.0,
exploiting parallelism is limited and often need
                                                                http://www.globus.org/toolkit/docs/4.0/.
developers carefully choose the granularity of methods.
                                                                [7] H. Sutter and J. Larus, Software and the concurrency
    Promise pipeline technique [22] is used in actor-
                                                                revolution. ACM Queue, 3(7), 2005.
based system to reduce latency in distributed systems .
                                                                [8] I. Foster, J. Frey, S. Graham, S. Tuecke, K.
Using the abstraction of future, the system can execute
                                                                Czajkowski, D. Ferguson, F. Leymann, M. Nally, T.
Storey and S. Weerawaranna, Modeling Stateful                [24] B. Chan, T. Abdelrahman, Run-Time Support for
Resources with Web Services. Globus Alliance, 2004.          the Automatic Parallelization of Java Programs. The
[9] I. Silva-Lepe, M. Ward and and F. Curbera,               Journal of Supercomputing, 28, 2004.
Integrating Web Services and Messaging, IEEE                 [25] R. M. Tomasulo, “An efficient algorithm for
International Conference on Web Services, 2006.              exploiting multiple arithmetic units”. IBM J.Research
[10] J. K. Ousterhout, Why Threads Are A Bad Idea            and          Development           11:1,        1967.
(formost purposes). Presentation given at the 1996
Usenix Annual Technical Conference, January 1996.
[11] M. Nanda, S. Chandra, V. Sarkar, Decentralizing
Execution of Composite Web Service. Proceedings of
the 19th annual ACM SIGPLAN conference on Object-
oriented programming, systems, languages, and
applications
[12] O. Zimmermann, M. Tomlinson and S. Peuser,
Perspectives on Web Services. Springer, 2003.
[13] P. Wu and D. Padua. Containers on the
Parallelization of General-purpose Java Programs.
Proceedings of the International Conference on Parallel
Architectures and Compilation Techniques, October,
1999.
[14] Kaffe, http://www.kaffe.org/.
[15] R. Wijngaart and M. Frumkin, NAS Grid
Benchmarks Version 1.0. Technical Report NAS-02-005,
NASA Advanced Supercomputing (NAS) Division,
NASA Ames Research Center, 2002.
[16] U. Banerjee, R. Eigenmann, A. Nicolau, and D.
Padua. Automatic program parallelization. Proceedings
of the IEEE, 81(2):211–243, 1993.
[17] Web Services Eventing (WS-Eventing),
http://www.w3.org/Submission/WS-Eventing/
[18] Web Services Notification, http://www.oasis -
open.org/committees/tc_home.php?wg_abbrev=wsn.
[19] Web Services Architecture,
http://www.w3.org/TR/ws-arch/.
[20] X. Wang, L. Xiao, W. Li, H. Yu, Z. Xu, Abacus: A
Service-Oriented Programming Language for Grid
Applications. IEEE International Conference on
Services Computing, 2005.
[21] R. Behren, J. Condit, and E. Brewer, Why Events
are a Bad Idea (for high-concurrency servers).
Proceedings of the 10th Workshop on Hot Topics in
Operating Systems (HotOS IX), 2003
[22] B. Liskov and L. Shrira. Promises: linguistic support
for efficient asynchronous procedure calls in
distributed systems. Proceedings of the ACM
SIGPLAN 1988 conference on Programming Language
design and Imp lementation, 1988.
[23] Web Services Architecture Usage Scenarios,
http://www.w3.org/TR/2002/WD-ws-arch-scenarios-
20020730/.
Appendix 1                                                              task_input=new BMRequest[1];
   This section contains code of a primary service                       task_input [0]=task_result[2];
extracted from the implementation of MB problem of
                                                                    task_result[5]=task_manager.getExecutor(5).step(task_input);
NGB. The dataflow of MB problem is as follows [15]:
                                                                         task_input=new BMRequest[2];
                                                                         task_input[0]=task_result[3];
                                                                         task_input[1]=task_result[4];
                                                                    task_result[6]=task_manager.getExecutor(6).step(task_input);
                                                                         task_input=new BMRequest[3];
                                                                         task_input[0]=task_result[3];
                                                                         task_input[1]=task_result[4];
                                                                         task_input[2]=task_result[5];
                                                                    task_result[7]=task_manager.getExecutor(7).step(task_input);
                                                                         task_input=new BMRequest[1];
                                                                         task_input[0]=task_result[5];
                                                                    task_result[8]=task_manager.getExecutor(8).step(task_input);
        Figure 6. Dataflow of MB Problem
                                                                    //waits for all tasks finish ed and get s results
   The following code just sequentially describes the                     int finnum=0;
dataflow of MB problem. AVM is able to automatically                      BMResults ret[]=new BMResults[dfg.getnumNodes()];
execute this section of code in parallel and the speedup                  while(finnum<dfg_node_num){
is 2.609 on three machines, which is close to the ideal                      ss.sleep(1);
                                                                             for(i=0;i<dfg_node_num;i+=1){
speedup             of           this            problem.
                                                                                if(dfg.getnode(i).getattribute()==1) continue;
 //This service needs a taskmanager service that has already been               finnum+=1;
 deployed
                                                                                dfg.getnode(i).setattribute(1);
 extern TaskManager TaskManager.task_manager;
                                                                                dfg.getnode(i).setverified(ret[i].getverified());
                                                                                ret[i]=task_result[i].getResults();
 service NGBServer{
                                                                             }
    IOService ios=new IOService();
                                                                          }
   SystemService ss = new SystemService();
    published BMResults[] solve_mb_ problem(BMRequest req){
      int i;
      DGraph dfg=req.getdfg();
      int dfg_node_num = dfg.getnumNodes();
 // Initialization
        for(i=0;i<dfg_node_num;i+=1){
           DGNode nd=dfg.getnode(i);
           nd.setverified(1);
           dfg.getnode(i).setattribute(0);
        }
    //invokes each task services in sequnce
    BMRequest[] task_input = null;
    BMRequest[]            task_result           =          new
 BMRequest[dfg_node_num];
 task_result[0]=task_manager.getExecutor(0).step(task_input);
 t ask_result[1]=task_manager.getExecutor(1).step(task_input);
 task_result[2]=task_manager.getExecutor(2).step(task_input);
      task_input=new BMRequest[2];
      task_input[0]=task_result[0];