[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData #2398

florian-jobs · 2026-01-13T13:16:31Z

Summary

This PR introduces a new column group ColGroupDDCLZW that stores the mapping vector in LZW-compressed form.

Key design points

MapToData is not stored explicitly; only the compressed LZW representation is kept.
Operations that allow sequential access operate directly on _dataLZW without full decompression.
For complex or random-access patterns, the implementation falls back to DDC (uncompressed).

Current status

Core data structure and compression/decompression are in place.
Work in progress on operations that can be implemented via sequential decoding without full materialization.
Work in progress on Performance.

Feedback on design and integration is very welcome.

…extending on APreAgg like ColGroupDDC for easier implementation. Idea: store only compressed version of _data vector and important metadata. If decompression is needed we reconstruct the _data vector using the metadata and the compressed _data vector. Decompression takes place at most once. This is just an idea and theres other ways of implementing.

* - DDCLZW stores the mapping vector exclusively in compressed form. * - No persistent MapToData cache is maintained. * - Sequential operations decode on-the-fly, while operations requiring random access explicitly materialize and fall back to DDC. */

…ng von Decompress

…and decompress and its used data structures compatible.

…lgorithms and try made them compatible.

…DC test for ColGroupDDCTest. Improved compress/decompress methods in LZW class.

…lemted from its Interface.

…mapping This commit adds an initial implementation of ColGroupDDCLZW, a new column group that stores the mapping vector in LZW-compressed form instead of materializing MapToData explicitly. The design focuses on enabling sequential access directly on the compressed representation, while complex access patterns are intended to fall back to DDC. No cache or lazy decompression mechanism is introduced at this stage.

…press(). Decompress will now return an empty map if the index is zero.

janniklinde

Thank you for the PR. I left some comments in the code.

In general, please use tabs instead of spaces to make the diff more readable (can be done by importing the codestyle xml). It would be good if we are able to create the column group similar to this:

CompressionSettingsBuilder csb = new CompressionSettingsBuilder().setSamplingRatio(1.0)
	.setValidCompressions(EnumSet.of(AColGroup.CompressionType.DDCLZW))
		.setTransposeInput("false");
CompressionSettings cs = csb.create();

final CompressedSizeInfoColGroup cgi = new ComEstExact(mbt, cs).getColGroupInfo(colIndexes);
CompressedSizeInfo csi = new CompressedSizeInfo(cgi);
AColGroup cg = ColGroupFactory.compressColGroups(mbt, csi, cs, 1).get(0);

So corresponding features / methods to support this should be implemented.

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

janniklinde · 2026-01-16T09:41:17Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

In general, all not properly implemented methods should throw a NotImplementedException.
Also, you should implement some of the operations that can be done on the compressed representation (e.g., scalar ops, unary, ...). Further, getExactSizeOnDisk() should be implemented

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

janniklinde · 2026-01-16T10:29:56Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

+    @Override
+    protected int[] getCounts(int[] out) {
+        return new int[0];
+    }


If not properly implemented, throw NotImplementedException

janniklinde · 2026-01-16T10:38:39Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

+            if (_nUnique <= dict.getNumberOfValues(colIndexes.size()))
+                throw new DMLCompressionException("Invalid map to dict Map has:" + _nUnique + " while dict has "
+                        + dict.getNumberOfValues(colIndexes.size()));


janniklinde · 2026-01-16T10:39:18Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

All implemented methods must be covered by tests

janniklinde · 2026-01-16T10:41:17Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

+    @Override
+    protected void decompressToDenseBlockDenseDictionary(DenseBlock db, int rl, int ru, int offR, int offC, double[] values) {
+
+    }


Decompression to dense block should be supported

janniklinde · 2026-01-16T11:00:25Z

Please add some more tests to really verify correctness. For example, you should do a full compression and then decompress it again. Then it should be compared to the original data

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java

src/test/java/org/apache/sysds/test/component/compress/colgroup/ColGroupDDCTest.java

…GroupDDCTest back to correct formatting. Added LZWMappingIterator to decompress values on the fly without having to allocate full compression map [WIP]. Added Test class ColGroupDDCLZWTest.

…tting again.

Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>

LukaDeka · 2026-01-18T18:57:34Z

Added new unit tests for ColGroupDDCLZW (they're subject to change and only an initial draft).

They might include redundant/unnecessary checks.

The rest of the methods are also untested. I'll do it later and possibly refactor the helper functions for the tests.

…ded decompressToDenseBlockDenseDictionary [WIP] needs to be tested further. Added fallbacks to ddc for variouos functions. Added scalar and unary ops and various other simple methods from ddc.

…erns. Added append and appendNInternal, recompress and various other functions that needed to be implemented. No tests yet.

Baunsgaard

good progress, i have left some comments.

I would love to see some performance numbers.

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java

Baunsgaard · 2026-01-21T00:25:52Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/scheme/DDCLZWScheme.java

+import org.apache.sysds.runtime.matrix.data.MatrixBlock;
+
+public abstract class DDCLZWScheme extends DDCScheme {
+    // TODO: private int nUnique; Zu Datenspezifisch, überhaupt sinnvoll


probably, not so meaningfull to implement specialization for the Scheme class.

The main goal of this is serialization and applying similar schemes to other groups. For the project of LZW, it is out of scope. so in my opinion you can ignore all Scheme parts.

Baunsgaard · 2026-01-21T00:27:28Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

+		return (((long) prefixCode) << 32) | (nextSymbol & 0xffffffffL);
+	}
+
+	// Compresses a mapping (AMapToData) into an LZW-compressed byte/integer/? array.


you probably want to compress into a byte[] array, or if you want to bit shift a bit, pack into a long[] array.

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

Baunsgaard · 2026-01-21T00:39:15Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDCLZW.java

+	}
+
+	@Override
+	public void leftMultByMatrixNoPreAgg(MatrixBlock matrix, MatrixBlock result, int rl, int ru, int cl, int cu) {


This is the cool one to support! It is a bit hard, but will probably pay of with LZW.

You can keep a soft reference to a hashmap mapping different rl to offsets into your data structure. That would make it possible to skip the initial scan until rl. Furthermore, the hashmap's growth would be limited, since the callers to these rl interfaces typically are bounded by cpu cores. You can use the same trick in some other functions where you scan until rl.

src/main/java/org/apache/sysds/runtime/compress/colgroup/scheme/DDCLZWSchemeMC.java

…mapping sequentially. reverted ColGroupDDC formatting again. Reverted CompressedSizeInfoColGroup formatting and adding DDCLZW part for testing. Added various tests for which functionality in the testing pipeline need to be added in order to work.

Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>

LukaDeka · 2026-01-21T20:41:20Z

Added a few benchmarks that mostly compare memory as well as operation times for methods (so far, only for getIdx).

Right now, the comparison is only done for DDCLZW with DDC.

There are sizable memory savings for datasets with repeating patterns or large datasets:

================================================================================
Benchmark: benchmarkRandomData
================================================================================

Size:       1 | DDC:       61 bytes | DDCLZW:       67 bytes | Memory reduction:  -9.84% | De-/Compression speedup: 0.09/0.00 times
Size:      10 | DDC:       70 bytes | DDCLZW:       95 bytes | Memory reduction: -35.71% | De-/Compression speedup: 0.04/0.00 times
Size:     100 | DDC:      160 bytes | DDCLZW:      299 bytes | Memory reduction: -86.87% | De-/Compression speedup: 0.01/0.00 times
Size:    1000 | DDC:     1060 bytes | DDCLZW:     1551 bytes | Memory reduction: -46.32% | De-/Compression speedup: 0.00/0.00 times
Size:   10000 | DDC:    10060 bytes | DDCLZW:    10487 bytes | Memory reduction:  -4.24% | De-/Compression speedup: 0.00/0.00 times
Size:  100000 | DDC:   100060 bytes | DDCLZW:    78783 bytes | Memory reduction:  21.26% | De-/Compression speedup: 0.00/0.00 times

I also added the De-/Compression speedup field to compare other compression types with each other as well.

I also added a benchmark for the slides, but it doesn't look too useful at the moment:

================================================================================
Benchmark: benchmarkSlice
================================================================================

Size:       1 | Slice[    0:    0] | DDC:      0 ms | DDCLZW:      1 ms | Slowdown: 37.09 times
Size:      10 | Slice[    2:    7] | DDC:      0 ms | DDCLZW:     20 ms | Slowdown: 1141.72 times
Size:     100 | Slice[   25:   75] | DDC:      0 ms | DDCLZW:      3 ms | Slowdown: 169.34 times
Size:    1000 | Slice[  250:  750] | DDC:      0 ms | DDCLZW:      3 ms | Slowdown: 348.98 times
Size:   10000 | Slice[ 2500: 7500] | DDC:      0 ms | DDCLZW:      6 ms | Slowdown: 483.40 times
Size:  100000 | Slice[25000:75000] | DDC:      0 ms | DDCLZW:     24 ms | Slowdown: 325.22 times

The file might be in a wrong directory as well and wrongly labeled as a "test". We wouldn't want benchmarks running on every GitHub Actions trigger etc.

Would it make more sense to refactor it into a main function?

Baunsgaard · 2026-01-21T23:32:17Z

@LukaDeka
Good to see some numbers. However, the ones you have reported are a bit unfortunate. I have a few points you should consider:

Random data is not very compressible, and in actuality, truly random data would tend to make DDC superior for your use case. What you are looking for is to control the entropy of your data. If the entropy is low, you should get more benefits from LZW; if it is high, then your compression ratio should tend towards DDC.
As an additional experiment, you can generate data that has exploitable patterns specific to LZW. Try to generate some data that is in the "best" possible structure. This should ideally show scaling close to (O(sqrt{n})) of the input size with standard LZW, while DDC, being a dense format, always has (O(n)).
Do not worry about input data that is smaller than 100 elements for these experiments. For instance, experiments with 1 input row trivially show that other encodings can perform better than DDC. It starts getting interesting at larger sizes.
Control and explicitly mention the number of distinct items you have as a parameter for your experiment. Additionally, calculate the entropy and use that as an additional measure of compressibility of the data. These two changes will improve the experiments.

…cheme according to guidelines.

florian-jobs · 2026-01-22T16:03:12Z

Status update:

Many methods that operate sequentially on the original mapping have been implemented using partial on the fly decoding of the compressed LZW mapping via an iterator.

Methods with more complex or non sequential access patterns are not yet handled in this way (for example leftMultByMatrixNoPreAgg) and currently fall back to DDC. These will be addressed in follow-up work.

Most decompression paths now rely on partial decoding of the LZW mapping rather than full materialization. Scalar and unary operations have also been implemented.

Several previously reported issues have been fixed. I have reverted the unintended formatting changes in the affected files and ensured alignment with the existing code style.

I will continue working on the remaining improvements suggested by @Baunsgaard and @janniklinde.

What is still missing at this point are more dedicated tests for the individual methods to ensure correctness which @LukaDeka is working on.

Thanks for the detailed feedback and reviews, they were very helpfull!

Baunsgaard · 2026-01-22T16:10:55Z

When you process some of the comments feel free to mark them as resolved!

LukaDeka · 2026-01-22T16:15:23Z

When you process some of the comments feel free to mark them as resolved!

I wanted to before, but I think I don't have the permission in GitHub to do that. Not sure if Florian has it.

Baunsgaard · 2026-01-22T16:51:08Z

When you process some of the comments feel free to mark them as resolved!

I wanted to before, but I think I don't have the permission in GitHub to do that. Not sure if Florian has it.

Alternatively if you do not have permissions, make a comment saying resolved. Then when we go though the PR, it is cleaner.

… it into the compression pipeline and serialization framework.

florian-jobs and others added 14 commits January 7, 2026 13:39

Idea:

007611c

* - DDCLZW stores the mapping vector exclusively in compressed form. * - No persistent MapToData cache is maintained. * - Sequential operations decode on-the-fly, while operations requiring random access explicitly materialize and fall back to DDC. */

More TODOS written and cleaned up project.

b1bf906

Dictionary initialisierung für Compress und rudimentäre Implementieru…

8027458

…ng von Decompress

Uebersichtlichkeit verbessert

ef3b834

Minor error fixing. Redesigned compress method.

9886821

Added red/write methods to serialize and deserialize from stream.

e0d5d75

Commented code, error handling for compress. next step make compress …

beb4613

…and decompress and its used data structures compatible.

Added first stages of tests. improved compression and decompression a…

620e03a

…lgorithms and try made them compatible.

Added convertToDDCLZW() method to ColGroupDDC Class. Added convertToD…

b7911d7

…DC test for ColGroupDDCTest. Improved compress/decompress methods in LZW class.

Started working on ColGroupDDCLZW's other methods that need to be imp…

1dfe91e

…lemted from its Interface.

test commit

3156863

[SYSTEMDS-3779] Added new Compression and ColGroup Types DDCLZW.

10d5776

github-project-automation bot added this to SystemDS PR Queue Jan 13, 2026

github-project-automation bot moved this to In Progress in SystemDS PR Queue Jan 13, 2026

florian-jobs changed the title ~~Add ColGroupDDCLZW with LZW-compressed MapToData~~ [SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData Jan 13, 2026

Annika Lehmann added 2 commits January 15, 2026 13:18

Decompression to a specific index

a8df1fe

slice Rows

96cb6e9

janniklinde self-requested a review January 16, 2026 08:26

[SYSTEMDS-3779] Add imemdiate stop after index certain index in decom…

a30cc91

…press(). Decompress will now return an empty map if the index is zero.

janniklinde requested changes Jan 16, 2026

View reviewed changes

github-project-automation bot moved this from In Progress to In Review in SystemDS PR Queue Jan 16, 2026

Baunsgaard reviewed Jan 16, 2026

View reviewed changes

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java Show resolved Hide resolved

Baunsgaard reviewed Jan 16, 2026

View reviewed changes

src/test/java/org/apache/sysds/test/component/compress/colgroup/ColGroupDDCTest.java Show resolved Hide resolved

florian-jobs and others added 4 commits January 16, 2026 16:26

[SYSTEMDS-3779] Reverted formatting of ColGroupDDC,ColGroupDDCLZW,Col…

d39fad0

…GroupDDCTest back to correct formatting. Added LZWMappingIterator to decompress values on the fly without having to allocate full compression map [WIP]. Added Test class ColGroupDDCLZWTest.

[SYSTEMDS-3779] Intermediate DDCLZW Scheme

9e2cf11

[SYSTEMDS-3779] Added getIdx using LZWMappingIterator. Reverted forma…

7de7f1d

…tting again.

[SYSTEMDS-3779] Fixed out of bounds logic

4f3f413

Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>

[SYSTEMDS-3779] Added new tests for ColGroupDDCLZW (draft)

ca7e6ff

Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>

florian-jobs added 2 commits January 19, 2026 14:20

[SYSTEMDS-3779] Increased sliceRows performance by using iterator. Ad…

ddd2a8b

…ded decompressToDenseBlockDenseDictionary [WIP] needs to be tested further. Added fallbacks to ddc for variouos functions. Added scalar and unary ops and various other simple methods from ddc.

Added various fallbacks to ddc for functions with complex access patt…

a8735e1

…erns. Added append and appendNInternal, recompress and various other functions that needed to be implemented. No tests yet.

Baunsgaard reviewed Jan 21, 2026

View reviewed changes

florian-jobs and others added 2 commits January 21, 2026 10:51

[SYSTEMDS-3779] Added benchmark 'tests' with helpers for DDCLZW vs DDC

8e6ebcf

Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>

florian-jobs added 2 commits January 22, 2026 15:55

[MINOR] Removed unnessecary imports and * imports. Reformated DDCLZWS…

3cc492c

…cheme according to guidelines.

[MINOR] Added License to DDCLZWScheme.

2bc1392

[SYSTEMDS-3779] Introduce DDCLZW as a new ColGroup type and integrate…

1726021

… it into the compression pipeline and serialization framework.

[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData #2398

Are you sure you want to change the base?

[SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData #2398

Conversation

florian-jobs commented Jan 13, 2026

Summary

Key design points

Current status

Uh oh!

janniklinde left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janniklinde commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

LukaDeka commented Jan 18, 2026

Uh oh!

Baunsgaard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LukaDeka commented Jan 21, 2026

Uh oh!

Baunsgaard commented Jan 21, 2026

Uh oh!

florian-jobs commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baunsgaard commented Jan 22, 2026

Uh oh!

LukaDeka commented Jan 22, 2026

Uh oh!

Baunsgaard commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janniklinde left a comment •

edited

Loading

florian-jobs commented Jan 22, 2026 •

edited

Loading