David Mathog
2018-11-01 22:28:01 UTC
Greetings,
I would like to request a "pre-sorted" (p?) modifier for keys. I
frequently sort very large files with one or two of the sort keys
already in the correct order as a consequence of how those files were
generated. Sort currently has no way of accepting this information so
that
sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile
pulls the entire file into a buffer and sorts it. That is slow, takes
up a lot of memory, and is mostly a waste of CPU cycles. The suggested
syntax of:
sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile
would only sort those small groups of lines where indicated presorted
keys are all the same. In typical usage cases the majority of such
groups have only 1 or a small number of members. The single line groups
need not be sorted at all, just emitted, and the other groups would sort
quickly. Memory requirements would typically be greatly reduced as well
- a 100M line file might have a maximum group size of only 10.
This was discussed previously in this thread:
https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered
where external methods to chop up the input and spoon feed it to
multiple runs of sort are shown. None of these are going to be as fast
as it would be if sort itself handled this.
Of course if the input data is not in the expected order that would be
detected and a fatal error would occur, with the program exiting with an
error status. A partial output file may be created in that case. If
that partial file would be a problem the issue can be avoided by running
sort with its (existing) "check" option using only the pre-sorted keys.
Thank you,
David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
I would like to request a "pre-sorted" (p?) modifier for keys. I
frequently sort very large files with one or two of the sort keys
already in the correct order as a consequence of how those files were
generated. Sort currently has no way of accepting this information so
that
sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile
pulls the entire file into a buffer and sorts it. That is slow, takes
up a lot of memory, and is mostly a waste of CPU cycles. The suggested
syntax of:
sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile
would only sort those small groups of lines where indicated presorted
keys are all the same. In typical usage cases the majority of such
groups have only 1 or a small number of members. The single line groups
need not be sorted at all, just emitted, and the other groups would sort
quickly. Memory requirements would typically be greatly reduced as well
- a 100M line file might have a maximum group size of only 10.
This was discussed previously in this thread:
https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered
where external methods to chop up the input and spoon feed it to
multiple runs of sort are shown. None of these are going to be as fast
as it would be if sort itself handled this.
Of course if the input data is not in the expected order that would be
detected and a fatal error would occur, with the program exiting with an
error status. A partial output file may be created in that case. If
that partial file would be a problem the issue can be avoided by running
sort with its (existing) "check" option using only the pre-sorted keys.
Thank you,
David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech