-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add ALP (Adaptive Lossless floating-Point) encoding support #3390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression. New files: - AlpConstants.java: Constants for ALP encoding - AlpEncoderDecoder.java: Core encoding/decoding logic - AlpValuesWriter.java: Writer implementation - AlpValuesReaderForFloat/Double.java: Reader implementations Includes comprehensive unit tests and interop test infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP encoding is not yet part of the parquet-format Thrift specification, so it cannot be converted to org.apache.parquet.format.Encoding. Skip it in the testEnumEquivalence test and add a clear error message in the converter for when ALP conversion is attempted. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| @@ -0,0 +1,136 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a recent comment on AI generated code on Arrow, if we aren't doing a lot of edits I'm not sure how it should be licensed.
| // ========== Vector Constants ========== | ||
|
|
||
| /** Default number of elements per compressed vector (2^10 = 1024) */ | ||
| public static final int ALP_VECTOR_SIZE = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is default but it should be configurable.
| public static final int ALP_VECTOR_SIZE = 1024; | ||
|
|
||
| /** Log2 of the default vector size */ | ||
| public static final int ALP_VECTOR_SIZE_LOG = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is less important to make configurable then vector size as we are specifically calling it out as configurable in the spec we should make sure we generate some data at different bit-widths.
| // ========== Sampling Constants ========== | ||
|
|
||
| /** Number of values sampled per vector */ | ||
| public static final int SAMPLER_SAMPLES_PER_VECTOR = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if these should be configurable somehow? Probably OK if not.
can these be package private?
|
|
||
| // Try encoding and check for round-trip failure | ||
| float multiplier = FLOAT_POW10[exponent]; | ||
| if (factor > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the > 0 check, need to examine encoding logic in more detail, is this just an optimization to avoid dividing by 1?
|
|
||
| // Try encoding and check for round-trip failure | ||
| double multiplier = DOUBLE_POW10[exponent]; | ||
| if (factor > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call encoding here?
| multiplier /= DOUBLE_POW10[factor]; | ||
| } | ||
|
|
||
| double scaled = value * multiplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just call decode. I imaging JIT inlining should be able to do the optimizations for lookup.
| */ | ||
| public static long encodeDouble(double value, int exponent, int factor) { | ||
| double multiplier = DOUBLE_POW10[exponent]; | ||
| if (factor > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-optimization: would it be faster to just have 1.0 in the array at index 0 (also just to check I believe this is numerically stable but not 100% sure).
| * Calculate the bit width needed to store unsigned values up to maxDelta. | ||
| * | ||
| * @param maxDelta the maximum delta value (unsigned) | ||
| * @return the number of bits needed (0-32 for int, 0-64 for long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long is not covered here.
| if (maxDelta == 0) { | ||
| return 0; | ||
| } | ||
| return 32 - Integer.numberOfLeadingZeros(maxDelta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Integer.SIZE?
| if (maxDelta == 0) { | ||
| return 0; | ||
| } | ||
| return 64 - Long.numberOfLeadingZeros(maxDelta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Long.SIZE
| * @param maxDelta the maximum delta value (unsigned) | ||
| * @return the number of bits needed (0-32 for int, 0-64 for long) | ||
| */ | ||
| public static int bitWidth(int maxDelta) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about accidental type coercion that might call this method instead of the Long method. Maybe have an explicit name here
| * @param value the float value to check | ||
| * @return true if the value is an exception | ||
| */ | ||
| public static boolean isFloatException(float value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please audit the visibily for these methods, it seems most could be at least package private?
| bestFactor = f; | ||
| bestExceptions = exceptions; | ||
| // Early exit if we found perfect encoding | ||
| if (bestExceptions == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to read the paper, but does the order we are exploring the exceptions and factors make sense (is there any benefit to exploring from largest to smallest).
| if (integerEncoding != ALP_INTEGER_ENCODING_FOR) { | ||
| throw new ParquetDecodingException("Unsupported ALP integer encoding: " + integerEncoding); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets check 0 <= logVectorScale <= 16
| int compressionMode = headerBuf.get() & 0xFF; | ||
| int integerEncoding = headerBuf.get() & 0xFF; | ||
| int logVectorSize = headerBuf.get() & 0xFF; | ||
| int numElements = headerBuf.getInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check numElements <= valuesCount ?
| LOG.debug("init from page at offset {} for length {}", stream.position(), stream.available()); | ||
|
|
||
| // Read and validate header | ||
| ByteBuffer headerBuf = stream.slice(ALP_HEADER_SIZE).order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this call does a copy? i wonder if we are better off using DataInputStream and reversing the bytes for numElements?
| this.currentIndex = 0; | ||
|
|
||
| // Read AlpInfo array | ||
| ByteBuffer alpInfoBuf = stream.slice(ALP_INFO_SIZE * numVectors).order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above on avoiding the slice.
| } | ||
|
|
||
| // Read ForInfo array | ||
| ByteBuffer forInfoBuf = stream.slice(DOUBLE_FOR_INFO_SIZE * numVectors).order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might as well slice this together with AlpInfo we keep the slices.
| } | ||
| // else: all deltas are 0 (bitWidth == 0) | ||
|
|
||
| // Reverse FOR and decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spell out FOR?
| * Note: This method reads from the buffer's current position, using absolute indexing | ||
| * relative to that position. | ||
| */ | ||
| private void unpackLongs(ByteBuffer packed, long[] values, int count, int bitWidth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there already be a method someplace that does this?
| bitWidths[v] = forInfoBuf.get() & 0xFF; | ||
| } | ||
|
|
||
| // Decode each vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want do this all up-front? Don't we only want to decode one vector at a time?
| @@ -0,0 +1,201 @@ | |||
| /* | |||
| * Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not reviewing this directly but all comments on double apply here I think.
| public FloatAlpValuesWriter(int initialCapacity, int pageSize, ByteBufferAllocator allocator) { | ||
| super(initialCapacity, pageSize, allocator); | ||
| // Initial buffer size - will grow as needed | ||
| this.buffer = new float[Math.max(ALP_VECTOR_SIZE, initialCapacity / Float.BYTES)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be pageSize?
|
|
||
| @Override | ||
| public void skip(int n) { | ||
| if (n < 0 || currentIndex + n > totalCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we move to incremental decoding this should skip by forwarding avoiding decoding vector data for intermediate results.
|
|
||
| @Override | ||
| public long getBufferedSize() { | ||
| // Estimate: each float value contributes roughly 2-4 bytes after compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be flushing incrementally to byte buffers for the encoded data to get a more accurate value. For unflushed data the size heuristic is maybe OK for an initial version.
|
|
||
| @Override | ||
| public void writeFloat(float v) { | ||
| if (count >= buffer.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be growing here to all floats. Once we reach vector-size we should be encoding the page.
| int vectorLen = vectorEnd - vectorStart; | ||
|
|
||
| // Find best encoding parameters | ||
| AlpEncoderDecoder.EncodingParams params = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to cache encodings and use those first (and stop searching after a while)?
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked at tests but left a bunch of comments. The most important one is we should be incrementally reading and writing, not decoding everything at once I think.
For incremental reads, I left a comment on the spec doc, but I think we should probably be interleaving the headers on writes to avoid unnecessary buffering.
Following Micah's suggestion yesterday, I took a stab at using Claude to produce a java implementation of ALP based on Prateek's spec and c++ implementation.
Bear in mind that I haven't closely reviewed it yet, it is fairly experimental but it seems promising.
Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression.
New files:
Includes comprehensive unit tests and interop test infrastructure.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?