Skip to content

Conversation

@julienledem
Copy link
Member

@julienledem julienledem commented Jan 22, 2026

Following Micah's suggestion yesterday, I took a stab at using Claude to produce a java implementation of ALP based on Prateek's spec and c++ implementation.
Bear in mind that I haven't closely reviewed it yet, it is fairly experimental but it seems promising.

Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression.

New files:

  • AlpConstants.java: Constants for ALP encoding
  • AlpEncoderDecoder.java: Core encoding/decoding logic
  • AlpValuesWriter.java: Writer implementation
  • AlpValuesReaderForFloat/Double.java: Reader implementations

Includes comprehensive unit tests and interop test infrastructure.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

julienledem and others added 5 commits January 22, 2026 08:44
Implements ALP encoding for FLOAT and DOUBLE types, which converts
floating-point values to integers using decimal scaling, then applies
Frame of Reference (FOR) encoding and bit-packing for compression.

New files:
- AlpConstants.java: Constants for ALP encoding
- AlpEncoderDecoder.java: Core encoding/decoding logic
- AlpValuesWriter.java: Writer implementation
- AlpValuesReaderForFloat/Double.java: Reader implementations

Includes comprehensive unit tests and interop test infrastructure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore original comment indentation that was accidentally changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Escape <= characters as &lt;= in javadoc comments to avoid
malformed HTML errors during documentation generation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP encoding is not yet part of the parquet-format Thrift specification,
so it cannot be converted to org.apache.parquet.format.Encoding. Skip it
in the testEnumEquivalence test and add a clear error message in the
converter for when ALP conversion is attempted.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@@ -0,0 +1,136 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a recent comment on AI generated code on Arrow, if we aren't doing a lot of edits I'm not sure how it should be licensed.

// ========== Vector Constants ==========

/** Default number of elements per compressed vector (2^10 = 1024) */
public static final int ALP_VECTOR_SIZE = 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is default but it should be configurable.

public static final int ALP_VECTOR_SIZE = 1024;

/** Log2 of the default vector size */
public static final int ALP_VECTOR_SIZE_LOG = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is less important to make configurable then vector size as we are specifically calling it out as configurable in the spec we should make sure we generate some data at different bit-widths.

// ========== Sampling Constants ==========

/** Number of values sampled per vector */
public static final int SAMPLER_SAMPLES_PER_VECTOR = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if these should be configurable somehow? Probably OK if not.

can these be package private?


// Try encoding and check for round-trip failure
float multiplier = FLOAT_POW10[exponent];
if (factor > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the > 0 check, need to examine encoding logic in more detail, is this just an optimization to avoid dividing by 1?


// Try encoding and check for round-trip failure
double multiplier = DOUBLE_POW10[exponent];
if (factor > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call encoding here?

multiplier /= DOUBLE_POW10[factor];
}

double scaled = value * multiplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just call decode. I imaging JIT inlining should be able to do the optimizations for lookup.

*/
public static long encodeDouble(double value, int exponent, int factor) {
double multiplier = DOUBLE_POW10[exponent];
if (factor > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro-optimization: would it be faster to just have 1.0 in the array at index 0 (also just to check I believe this is numerically stable but not 100% sure).

* Calculate the bit width needed to store unsigned values up to maxDelta.
*
* @param maxDelta the maximum delta value (unsigned)
* @return the number of bits needed (0-32 for int, 0-64 for long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long is not covered here.

if (maxDelta == 0) {
return 0;
}
return 32 - Integer.numberOfLeadingZeros(maxDelta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Integer.SIZE?

if (maxDelta == 0) {
return 0;
}
return 64 - Long.numberOfLeadingZeros(maxDelta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Long.SIZE

* @param maxDelta the maximum delta value (unsigned)
* @return the number of bits needed (0-32 for int, 0-64 for long)
*/
public static int bitWidth(int maxDelta) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about accidental type coercion that might call this method instead of the Long method. Maybe have an explicit name here

* @param value the float value to check
* @return true if the value is an exception
*/
public static boolean isFloatException(float value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please audit the visibily for these methods, it seems most could be at least package private?

bestFactor = f;
bestExceptions = exceptions;
// Early exit if we found perfect encoding
if (bestExceptions == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to read the paper, but does the order we are exploring the exceptions and factors make sense (is there any benefit to exploring from largest to smallest).

if (integerEncoding != ALP_INTEGER_ENCODING_FOR) {
throw new ParquetDecodingException("Unsupported ALP integer encoding: " + integerEncoding);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets check 0 <= logVectorScale <= 16

int compressionMode = headerBuf.get() & 0xFF;
int integerEncoding = headerBuf.get() & 0xFF;
int logVectorSize = headerBuf.get() & 0xFF;
int numElements = headerBuf.getInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check numElements <= valuesCount ?

LOG.debug("init from page at offset {} for length {}", stream.position(), stream.available());

// Read and validate header
ByteBuffer headerBuf = stream.slice(ALP_HEADER_SIZE).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this call does a copy? i wonder if we are better off using DataInputStream and reversing the bytes for numElements?

this.currentIndex = 0;

// Read AlpInfo array
ByteBuffer alpInfoBuf = stream.slice(ALP_INFO_SIZE * numVectors).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above on avoiding the slice.

}

// Read ForInfo array
ByteBuffer forInfoBuf = stream.slice(DOUBLE_FOR_INFO_SIZE * numVectors).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might as well slice this together with AlpInfo we keep the slices.

}
// else: all deltas are 0 (bitWidth == 0)

// Reverse FOR and decode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spell out FOR?

* Note: This method reads from the buffer's current position, using absolute indexing
* relative to that position.
*/
private void unpackLongs(ByteBuffer packed, long[] values, int count, int bitWidth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there already be a method someplace that does this?

bitWidths[v] = forInfoBuf.get() & 0xFF;
}

// Decode each vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want do this all up-front? Don't we only want to decode one vector at a time?

@@ -0,0 +1,201 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not reviewing this directly but all comments on double apply here I think.

public FloatAlpValuesWriter(int initialCapacity, int pageSize, ByteBufferAllocator allocator) {
super(initialCapacity, pageSize, allocator);
// Initial buffer size - will grow as needed
this.buffer = new float[Math.max(ALP_VECTOR_SIZE, initialCapacity / Float.BYTES)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be pageSize?


@Override
public void skip(int n) {
if (n < 0 || currentIndex + n > totalCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we move to incremental decoding this should skip by forwarding avoiding decoding vector data for intermediate results.


@Override
public long getBufferedSize() {
// Estimate: each float value contributes roughly 2-4 bytes after compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be flushing incrementally to byte buffers for the encoded data to get a more accurate value. For unflushed data the size heuristic is maybe OK for an initial version.


@Override
public void writeFloat(float v) {
if (count >= buffer.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be growing here to all floats. Once we reach vector-size we should be encoding the page.

int vectorLen = vectorEnd - vectorStart;

// Find best encoding parameters
AlpEncoderDecoder.EncodingParams params =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to cache encodings and use those first (and stop searching after a while)?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at tests but left a bunch of comments. The most important one is we should be incrementally reading and writing, not decoding everything at once I think.

For incremental reads, I left a comment on the spec doc, but I think we should probably be interleaving the headers on writes to avoid unnecessary buffering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants