-
Notifications
You must be signed in to change notification settings - Fork 110
GH-343: Fix ListVector offset buffer not properly serialized for nested empty arrays #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-343: Fix ListVector offset buffer not properly serialized for nested empty arrays #967
Conversation
This comment has been minimized.
This comment has been minimized.
|
@Yicong-Huang can you please rebase the PR ? Thanks ! |
8b09237 to
7dbdcc4
Compare
vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
Outdated
Show resolved
Hide resolved
vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
Outdated
Show resolved
Hide resolved
vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
Outdated
Show resolved
Hide resolved
I think you are referring the writers in Spark. It is out of context here and not related to the root cause. We should update the description to explain the issue clearly. The offset buffers are actually allocated properly. But during IPC serialization, they are ignored. public long readableBytes() {
return writerIndex - readerIndex;
}So when ListVector.setReaderAndWriterIndex() sets writerIndex(0) and readerIndex(0), readableBytes() returns 0 - 0 = 0. Then when MessageSerializer.writeBatchBuffers() calls WriteChannel.write(buffer), it writes 0 bytes. So the flow is:
|
| validityBuffer.writerIndex(BitVectorHelper.getValidityBufferSizeFromCount(valueCount)); | ||
| offsetBuffer.writerIndex((valueCount + 1) * OFFSET_WIDTH); | ||
| } | ||
| validityBuffer.writerIndex(BitVectorHelper.getValidityBufferSizeFromCount(valueCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When valueCount == 0, I think validity buffer writer index should be validityBuffer.writerIndex(0);.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe when valueCount==0, BitVectorHelper.getValidityBufferSizeFromCount(valueCount) also returns 0, so it is equivalent. The current version might be simpler. If you prefer an if branch to handle it separately, I can also apply it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the if branch back to handle valueCount==0 case. But I still think it is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if BitVectorHelper.getValidityBufferSizeFromCount(valueCount) returns 0 for valueCount == 0, then it is okay.
vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
Show resolved
Hide resolved
| // Allocate outer only - simulates case where inner is never written to | ||
| outerList.allocateNew(); | ||
| outerList.setValueCount(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think innerList should also call allocateNew? Not allocated is different to not written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
| // Only allocate level0 - simulates case where all nested levels are empty | ||
| level0.allocateNew(); | ||
| level0.setValueCount(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
| try (LargeListVector outerList = LargeListVector.empty("outer", allocator)) { | ||
| // Setup LargeList<LargeList<Int>> | ||
| outerList.addOrGetVector(FieldType.nullable(MinorType.LARGELIST.getType())); | ||
| LargeListVector innerList = (LargeListVector) outerList.getDataVector(); | ||
| innerList.addOrGetVector(FieldType.nullable(MinorType.INT.getType())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked these tests again. I think this nested structure is not necessary for the unit tests here. It only matter for our usage at Spark side (on the ArrowWriters). But for here, we just need to make sure that a ListVector/LargeListVector has meaningful and correct readableBytes value after they are allocated.
Maybe we can simplify these tests.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's Changed
Fix
ListVector/LargeListVectorIPC serialization whenvalueCountis 0.Problem
When
valueCount == 0,setReaderAndWriterIndex()was settingoffsetBuffer.writerIndex(0), which meansreadableBytes() == 0. IPC serializer usesreadableBytes()to determine buffer size, so 0 bytes were written to the IPC stream. This crashes IPC readers in other libraries because Arrow spec requires offset buffer to have at least one entry[0].@viirya:
Fix
Simplify
setReaderAndWriterIndex()to always use(valueCount + 1) * OFFSET_WIDTHfor offset buffer'swriterIndex. WhenvalueCount == 0, this correctly setswriterIndextoOFFSET_WIDTH, ensuringoffset[0]is included in serialization.Testing
Added tests for nested empty lists verifying offset buffer has correct
readableBytes().Closes #343.