Request for feature, sort

Discussion:

David Mathog

2018-11-01 22:28:01 UTC

Greetings,

I would like to request a "pre-sorted" (p?) modifier for keys. I
frequently sort very large files with one or two of the sort keys
already in the correct order as a consequence of how those files were
generated. Sort currently has no way of accepting this information so
that

sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile

pulls the entire file into a buffer and sorts it. That is slow, takes
up a lot of memory, and is mostly a waste of CPU cycles. The suggested
syntax of:

sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile

would only sort those small groups of lines where indicated presorted
keys are all the same. In typical usage cases the majority of such
groups have only 1 or a small number of members. The single line groups
need not be sorted at all, just emitted, and the other groups would sort
quickly. Memory requirements would typically be greatly reduced as well
- a 100M line file might have a maximum group size of only 10.

This was discussed previously in this thread:

https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered

where external methods to chop up the input and spoon feed it to
multiple runs of sort are shown. None of these are going to be as fast
as it would be if sort itself handled this.

Of course if the input data is not in the expected order that would be
detected and a fatal error would occur, with the program exiting with an
error status. A partial output file may be created in that case. If
that partial file would be a problem the issue can be avoided by running
sort with its (existing) "check" option using only the pre-sorted keys.

Thank you,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Pádraig Brady

2018-11-04 07:45:21 UTC

Permalink

Post by David Mathog
Greetings,
I would like to request a "pre-sorted" (p?) modifier for keys. I
frequently sort very large files with one or two of the sort keys
already in the correct order as a consequence of how those files were
generated. Sort currently has no way of accepting this information so
that
sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile
pulls the entire file into a buffer and sorts it. That is slow, takes
up a lot of memory, and is mostly a waste of CPU cycles. The suggested
sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile
would only sort those small groups of lines where indicated presorted
keys are all the same. In typical usage cases the majority of such
groups have only 1 or a small number of members. The single line groups
need not be sorted at all, just emitted, and the other groups would sort
quickly. Memory requirements would typically be greatly reduced as well
- a 100M line file might have a maximum group size of only 10.
https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered
where external methods to chop up the input and spoon feed it to
multiple runs of sort are shown. None of these are going to be as fast
as it would be if sort itself handled this.
Of course if the input data is not in the expected order that would be
detected and a fatal error would occur, with the program exiting with an
error status. A partial output file may be created in that case. If
that partial file would be a problem the issue can be avoided by running
sort with its (existing) "check" option using only the pre-sorted keys.
Thank you,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech

I like this idea.
I'm not sure how frequent this partially sorted data pattern is,
though given the complexity improvements it seems worthwhile.

So 'p' would be a modifier applicable to each key.
When the data in the key changes we could do
the full key comparison to ensure the order is correct,
and if so output all current sorted data for the just presorted key.
'p' modified keys would need to come before any non 'p' keys.

cheers,
Pádraig

David Mathog

2018-11-05 17:34:48 UTC

Permalink

Post by PÃ¡draig Brady
I'm not sure how frequent this partially sorted data pattern is,
though given the complexity improvements it seems worthwhile.

It is extremely common in Biology applications. For instance, search a
set of DNA sequences (with pre-ordered names like contig_1 -> contig_N,
or frequently just 1 -> N) against a target database and the output will
(for most applications) have clusters of results for 1, then 2, ... then
N.

There are some programs like that which when run multithreaded would
emit those same blocks of results but with the block order shuffled over
a window corresponding to the number of threads. That is, they just
emit each block when it finishes without doing anything to keep the
blocks in order. Taking advantage of "block order", but not "absolute
order", in sort would require yet another key modifier. While this is a
very closely related problem I'm not sure it is common enough to warrant
the extra modifier. That said, I think the only modification needed
from an "absolute order" implementation to a "block order" one is that
the program not consider an out of sorted order key change to be an
error.

Post by PÃ¡draig Brady
So 'p' would be a modifier applicable to each key.
When the data in the key changes we could do
the full key comparison to ensure the order is correct,
and if so output all current sorted data for the just presorted key.
'p' modified keys would need to come before any non 'p' keys.

Exactly.

Conceivably there might be situations where the input file was already
sorted on keys 1 and 3, but not on key 2. However, retaining the order
information for the 3rd key while sorting the 2nd is likely more effort
than it is worth. Certainly diminishing returns set in very quickly in
that arena - consider how difficult it would be to retain the order
information for "-k 1,1 -k 2,2 -k 3,3p" for an input file of a billion
lines. Anyway, requiring all "p" keys to precede all non "p" keys is
not an onerous restriction.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech