New Technique Significantly Speeds Up Computer Programs Without Fear of Errors
Computer scientists have developed a new system capable of accelerating the execution of computer programs, while guaranteeing precision.
Researchers have developed a technique that can dramatically speed up certain types of computer programs automatically, while ensuring that program results remain accurate.
Their system increases the speed of programs that run in the Unix shell, a ubiquitous programming environment that was created 50 years ago and is still widely used today. Their method parallelizes these programs, which means that it divides the program components into chunks that can be executed simultaneously on multiple computer processors.
This allows programs to perform tasks such as web indexing, natural language processing, or data analysis in a fraction of their original runtime.
“There are so many people who use these types of programs, like data scientists, biologists, engineers and economists. Now they can automatically speed up their programs without worrying about getting incorrect results,” says Nikos Vasilakis, a researcher at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of MIT.
The system also makes it easier for programmers to develop tools used by data scientists, biologists, engineers and others. They don’t need to make any special adjustments to their program commands to enable this automatic, error-free parallelization, adds Vasilakis, who chairs a committee of researchers from around the world who have been working on this system for nearly two years.
Vasilakis is lead author of the group’s latest research paper, which includes MIT co-author and CSAIL graduate student Tammam Mustafa, and will be presented at the USENIX Symposium on Operating Systems Design and Implementation . Co-authors include lead author Konstantinos Kallas, a graduate student at the University of Pennsylvania; Jan Bielak, student at Staszic high school in Warsaw; Dimitris Karnikis, software engineer at Aarno Labs; Thurston HY Dang, a former MIT postdoc who is now a software engineer at Google; and Michael Greenberg, assistant professor of computer science at Stevens Institute of Technology.
A decades-old problem
This new system, known as PaSh, focuses on programs or scripts that run in the Unix shell. A script is a sequence of commands that instructs a computer to perform a calculation. The correct and automatic parallelization of shell scripts is a thorny problem that researchers have faced for decades.
The Unix shell remains popular, in part because it is the only programming environment that allows a script to be composed of functions written in multiple programming languages. Different programming languages are better suited to specific tasks or data types; if a developer uses the right language, solving a problem can be much easier.
“People also like to develop in different programming languages, so composing all of these components in one program is something that happens very frequently,” Vasilakis adds.
While the Unix shell allows for multilingual scripting, its flexible and dynamic structure makes such scripts difficult to parallelize using traditional methods.
Parallelizing a program is usually tricky because some parts of the program depend on other parts. This determines the order in which the components should run; get the command wrong and the program fails.
When a program is written in a single language, developers have explicit information about its functionality and the language that helps them determine which components can be parallelized. But these tools do not exist for scripts in the Unix shell. Users cannot easily see what is going on inside components or extract information that would aid parallelization.
A just-in-time solution
To overcome this problem, PaSh uses a preprocessing step that inserts simple annotations on program components that it thinks might be parallelizable. Then PaSh attempts to parallelize these parts of the script while the program is running, at the exact moment it reaches each component.
This avoids another problem in shell programming – it’s impossible to predict a program’s behavior in advance.
By parallelizing the program components “just in time”, the system avoids this problem. It is able to efficiently accelerate many more components than traditional methods that attempt to perform parallelization in advance.
Just-in-time parallelization also ensures that the accelerated program always returns accurate results. If PaSh happens to a program component that cannot be parallelized (maybe it depends on a component that hasn’t been executed yet), it just executes the original version and avoids causing an error.
“No matter the performance benefits – if you promise to get something working in a second instead of a year – if there’s a chance of returning incorrect results, no one will use your method,” says Vasilakis.
Users do not need to make any changes to use PaSh; they can simply add the tool to their existing Unix shell and tell their scripts to use it.
Acceleration and precision
Researchers have tested PaSh on hundreds of scripts, from classic to modern programs, and it hasn’t broken a single one. The system was able to run programs six times faster, on average, compared to unprecedented scripts, and it reached a peak speedup of almost 34 times.
It also increased the speed of scripts that other approaches were unable to parallelize.
“Our system is the first that shows this type of transformation quite correct, but there is also an indirect benefit. The way our system is designed allows other researchers and industry users to build on this work,” says Vasilakis.
He is excited to get additional feedback from users and see how they improve the system. The open-source project joined the Linux Foundation last year, making it widely available to users in industry and academia.
In the future, Vasilakis wants to use PaSh to solve the problem of distribution – dividing a program to run on multiple computers, rather than multiple processors within a single computer. It also seeks to improve the annotation scheme so that it is more user-friendly and can better describe the complex components of the program.
“Unix shell scripts play a key role in data analysis and software engineering tasks. These scripts could run faster by making the various programs they invoke use the multiple processing units available in modern processors. However, the dynamic nature of the hull makes it difficult
design parallel execution plans in advance,” says Diomidis Spinellis, professor of software engineering at the Athens University of Economics and Business and professor of software analysis at the Technical University of Delft, who did not participate in this research. “Through just-in-time parsing, PaSh-JIT successfully conquers the dynamic complexity of the shell and thus reduces script execution times while maintaining the accuracy of the corresponding results.”
“As a direct replacement for a regular shell that orchestrates steps, but doesn’t reorder or divide them, PaSh provides a simple way to improve the performance of big data processing jobs,” adds Douglas McIlroy, Assistant Professor in the Department of Computer Science at Dartmouth College, which previously ran the Computer Science Research Department at Bell Laboratories (which was the birthplace of the Unix operating system). “Manual optimization to exploit parallelism must be done at a level for which ordinary programming languages (including shells) do not provide abstractions of their own. The resulting code mixes questions of logic and efficiency. It is difficult to read and difficult to maintain in the face of changing requirements. PaSh intelligently intervenes at this level, preserving the original logic on the surface while achieving efficiency when the program is running.
Reference: “Practically Correct, Just-in-Time Shell Script Parallelization” by Konstantinos Kallas, Tammam Mustafa, Jan Bielak, Dimitris Karnikis, Thurston HY Dang, Michael Greenberg and Nikos Vasilakis.
This work was supported, in part, by the Defense Advanced Research Projects Agency and the National Science Foundation.