-
Notifications
You must be signed in to change notification settings - Fork 523
[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData #2398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…extending on APreAgg like ColGroupDDC for easier implementation. Idea: store only compressed version of _data vector and important metadata. If decompression is needed we reconstruct the _data vector using the metadata and the compressed _data vector. Decompression takes place at most once. This is just an idea and theres other ways of implementing.
…ng von Decompress
…and decompress and its used data structures compatible.
…lgorithms and try made them compatible.
…DC test for ColGroupDDCTest. Improved compress/decompress methods in LZW class.
…lemted from its Interface.
…mapping This commit adds an initial implementation of ColGroupDDCLZW, a new column group that stores the mapping vector in LZW-compressed form instead of materializing MapToData explicitly. The design focuses on enabling sequential access directly on the compressed representation, while complex access patterns are intended to fall back to DDC. No cache or lazy decompression mechanism is introduced at this stage.
…press(). Decompress will now return an empty map if the index is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I left some comments in the code.
In general, please use tabs instead of spaces to make the diff more readable (can be done by importing the codestyle xml). It would be good if we are able to create the column group similar to this:
CompressionSettingsBuilder csb = new CompressionSettingsBuilder().setSamplingRatio(1.0)
.setValidCompressions(EnumSet.of(AColGroup.CompressionType.DDCLZW))
.setTransposeInput("false");
CompressionSettings cs = csb.create();
final CompressedSizeInfoColGroup cgi = new ComEstExact(mbt, cs).getColGroupInfo(colIndexes);
CompressedSizeInfo csi = new CompressedSizeInfo(cgi);
AColGroup cg = ColGroupFactory.compressColGroups(mbt, csi, cs, 1).get(0);So corresponding features / methods to support this should be implemented.
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, all not properly implemented methods should throw a NotImplementedException.
Also, you should implement some of the operations that can be done on the compressed representation (e.g., scalar ops, unary, ...). Further, getExactSizeOnDisk() should be implemented
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| protected int[] getCounts(int[] out) { | ||
| return new int[0]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not properly implemented, throw NotImplementedException
| if (_nUnique <= dict.getNumberOfValues(colIndexes.size())) | ||
| throw new DMLCompressionException("Invalid map to dict Map has:" + _nUnique + " while dict has " | ||
| + dict.getNumberOfValues(colIndexes.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All implemented methods must be covered by tests
| @Override | ||
| protected void decompressToDenseBlockDenseDictionary(DenseBlock db, int rl, int ru, int offR, int offC, double[] values) { | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decompression to dense block should be supported
|
Please add some more tests to really verify correctness. For example, you should do a full compression and then decompress it again. Then it should be compared to the original data |
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java
Show resolved
Hide resolved
src/test/java/org/apache/sysds/test/component/compress/colgroup/ColGroupDDCTest.java
Show resolved
Hide resolved
…GroupDDCTest back to correct formatting. Added LZWMappingIterator to decompress values on the fly without having to allocate full compression map [WIP]. Added Test class ColGroupDDCLZWTest.
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
|
Added new unit tests for ColGroupDDCLZW (they're subject to change and only an initial draft). They might include redundant/unnecessary checks. The rest of the methods are also untested. I'll do it later and possibly refactor the helper functions for the tests. |
…ded decompressToDenseBlockDenseDictionary [WIP] needs to be tested further. Added fallbacks to ddc for variouos functions. Added scalar and unary ops and various other simple methods from ddc.
…erns. Added append and appendNInternal, recompress and various other functions that needed to be implemented. No tests yet.
Baunsgaard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good progress, i have left some comments.
I would love to see some performance numbers.
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java
Show resolved
Hide resolved
| import org.apache.sysds.runtime.matrix.data.MatrixBlock; | ||
|
|
||
| public abstract class DDCLZWScheme extends DDCScheme { | ||
| // TODO: private int nUnique; Zu Datenspezifisch, überhaupt sinnvoll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, not so meaningfull to implement specialization for the Scheme class.
The main goal of this is serialization and applying similar schemes to other groups. For the project of LZW, it is out of scope. so in my opinion you can ignore all Scheme parts.
| return (((long) prefixCode) << 32) | (nextSymbol & 0xffffffffL); | ||
| } | ||
|
|
||
| // Compresses a mapping (AMapToData) into an LZW-compressed byte/integer/? array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to compress into a byte[] array, or if you want to bit shift a bit, pack into a long[] array.
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Show resolved
Hide resolved
src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public void leftMultByMatrixNoPreAgg(MatrixBlock matrix, MatrixBlock result, int rl, int ru, int cl, int cu) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the cool one to support! It is a bit hard, but will probably pay of with LZW.
You can keep a soft reference to a hashmap mapping different rl to offsets into your data structure. That would make it possible to skip the initial scan until rl. Furthermore, the hashmap's growth would be limited, since the callers to these rl interfaces typically are bounded by cpu cores. You can use the same trick in some other functions where you scan until rl.
src/main/java/org/apache/sysds/runtime/compress/colgroup/scheme/DDCLZWSchemeMC.java
Show resolved
Hide resolved
…mapping sequentially. reverted ColGroupDDC formatting again. Reverted CompressedSizeInfoColGroup formatting and adding DDCLZW part for testing. Added various tests for which functionality in the testing pipeline need to be added in order to work.
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
|
Added a few benchmarks that mostly compare memory as well as operation times for methods (so far, only for Right now, the comparison is only done for There are sizable memory savings for datasets with repeating patterns or large datasets: ================================================================================
Benchmark: benchmarkRandomData
================================================================================
Size: 1 | DDC: 61 bytes | DDCLZW: 67 bytes | Memory reduction: -9.84% | De-/Compression speedup: 0.09/0.00 times
Size: 10 | DDC: 70 bytes | DDCLZW: 95 bytes | Memory reduction: -35.71% | De-/Compression speedup: 0.04/0.00 times
Size: 100 | DDC: 160 bytes | DDCLZW: 299 bytes | Memory reduction: -86.87% | De-/Compression speedup: 0.01/0.00 times
Size: 1000 | DDC: 1060 bytes | DDCLZW: 1551 bytes | Memory reduction: -46.32% | De-/Compression speedup: 0.00/0.00 times
Size: 10000 | DDC: 10060 bytes | DDCLZW: 10487 bytes | Memory reduction: -4.24% | De-/Compression speedup: 0.00/0.00 times
Size: 100000 | DDC: 100060 bytes | DDCLZW: 78783 bytes | Memory reduction: 21.26% | De-/Compression speedup: 0.00/0.00 timesI also added the I also added a benchmark for the slides, but it doesn't look too useful at the moment: ================================================================================
Benchmark: benchmarkSlice
================================================================================
Size: 1 | Slice[ 0: 0] | DDC: 0 ms | DDCLZW: 1 ms | Slowdown: 37.09 times
Size: 10 | Slice[ 2: 7] | DDC: 0 ms | DDCLZW: 20 ms | Slowdown: 1141.72 times
Size: 100 | Slice[ 25: 75] | DDC: 0 ms | DDCLZW: 3 ms | Slowdown: 169.34 times
Size: 1000 | Slice[ 250: 750] | DDC: 0 ms | DDCLZW: 3 ms | Slowdown: 348.98 times
Size: 10000 | Slice[ 2500: 7500] | DDC: 0 ms | DDCLZW: 6 ms | Slowdown: 483.40 times
Size: 100000 | Slice[25000:75000] | DDC: 0 ms | DDCLZW: 24 ms | Slowdown: 325.22 timesThe file might be in a wrong directory as well and wrongly labeled as a "test". We wouldn't want benchmarks running on every GitHub Actions trigger etc. Would it make more sense to refactor it into a |
|
@LukaDeka
|
…cheme according to guidelines.
|
Status update: Many methods that operate sequentially on the original mapping have been implemented using partial on the fly decoding of the compressed LZW mapping via an iterator. Methods with more complex or non sequential access patterns are not yet handled in this way (for example leftMultByMatrixNoPreAgg) and currently fall back to DDC. These will be addressed in follow-up work. Most decompression paths now rely on partial decoding of the LZW mapping rather than full materialization. Scalar and unary operations have also been implemented. Several previously reported issues have been fixed. I have reverted the unintended formatting changes in the affected files and ensured alignment with the existing code style. I will continue working on the remaining improvements suggested by @Baunsgaard and @janniklinde. What is still missing at this point are more dedicated tests for the individual methods to ensure correctness which @LukaDeka is working on. Thanks for the detailed feedback and reviews, they were very helpfull! |
|
When you process some of the comments feel free to mark them as resolved! |
I wanted to before, but I think I don't have the permission in GitHub to do that. Not sure if Florian has it. |
Alternatively if you do not have permissions, make a comment saying resolved. Then when we go though the PR, it is cleaner. |
… it into the compression pipeline and serialization framework.
Summary
This PR introduces a new column group
ColGroupDDCLZWthat stores the mapping vector in LZW-compressed form.Key design points
MapToDatais not stored explicitly; only the compressed LZW representation is kept._dataLZWwithout full decompression.Current status
Feedback on design and integration is very welcome.