|
|
# Forking an input stream to several programs
|
|
|
|
|
|
This describes how to have N arbitrary programs read a copy of a single input stream.
|
|
|
|
|
|
If you have big data on the shared central storage it is sometimes most efficient to have it read by a number of different programs on the same node at the same time. For example, if your input is a large file and you want to know the number of lines, grep out lines containing `SOMETHING` and compute the MD5 checksum. These are three IO bound operations that need sequential access to the input. The naive method will transfer the file three times from the shared storage to the worker node(s):
|
|
|
|
|
|
|
|
|
wc -l input.txt > linecount.txt
|
|
|
grep SOMETHING input.txt > something.txt
|
|
|
md5sum input.txt > md5sum.txt
|
|
|
|
|
|
|
|
|
Using bash's process output substitution operator we can give this an almost three-time speedup by only transferring the file to the worker node once:
|
|
|
|
|
|
|
|
|
tee < input.txt >(wc -l > linecount.txt) >(grep SOMETHING > something.txt) >(md5sum > md5sum.txt)
|
|
|
|
|
|
|
|
|
More complex schemes with multiple input and output files are of course possible (add examples here if you have them). Of course, this construct is only of use if your processes need sequential access to the same input and are IO bound on it. |
|
|
\ No newline at end of file |