Compressing an array of shorts

I've got an array of shorts I want to compress. The probability of one short being close to the next short is high and there will rarely be the same short 2 times in a row. Compression needs to be efficient to compress and decompress. I could use pkzip or LZMA2 but in one case this is applied after the first compression and I'd rather have something cheap and not have to do multiple data passes (although that's an option).

I was thinking of compressing it into deltas (probably as bytes) and in the rare cases storing zero and then writing the full short after.

ie: uncompressed:
(short)500 (short)506 (short)1000 (short)1007 (short)1100 (short)1050 (short)1200
size = 14 bytes

compressed:
(short)500 (byte)6 (byte)0 (short)1000 (byte) 7 (byte) 100 (byte) -50 (byte) 150
size = 10 bytes

(Another data point... the size of these lists range from 1 item to a 100 items).

I could store the data a 7 or 6 bit if I determine that's a good fit for the data however of course bytes are easier to work with. I was wondering if perhaps there was a cheap way to have variable bit sized compression so I could store the size of the delta better?

Anyway I thought it was an interesting problems and was wondering if anyone knew or could think of a good compression algorithm to improve this.
Last edited on
It seems like there's very little variation between pairs of shorts. How about splitting the array into two arrays, one for the low bytes, and the other for the high bytes. Then you append one to the other (it doesn't matter in which order) and compress that. That should give slightly a better compression ratio, I think.
Thanks for the suggestion Helios. How would u suggest that I maintain order since that is also important.
That's kind of a dumb question. Simply make your read and write routines mutually consistent.
Sorry I think I miss understood what you were saying. So what compression algorithm would you suggest for the compression. Perhaps RLE or count compression? I guess it would make sense to RLE the high bits and not run-length encode the low bits (since they are rarely repetitive). So:


uncompressed:

lowerbits (hex) -> F4, FA, E8, EF, 40, 28, 80
highbits -> 1, 1, 3, 3, 2, 2, 2

compressed:

(bytes) F4, FA, E8, EF, 40, 28, 80
(bytes) (count)2 (value)1 (count)2 (value) 3 (count) 3 (value) 2

13 bytes.

But its a good idea because maybe there's some more stuff we can do with this to get it down lower... like since we know that the maximum is like 100 items the counts don't need to be bytes, although that would still require more then 10bytes in the contrived test case example above.

One thing I worry about this approach is that on byte boundaries its likely to switch more radically the first approach. Although how common that is would be a matter of experimentation with the data.

Keep the ideas coming :p

Last edited on
Topic archived. No new replies allowed.