Skip to content

Conversation

@florian-jobs
Copy link

Summary

This PR introduces a new column group ColGroupDDCLZW that stores the mapping vector in LZW-compressed form.

Key design points

  • MapToData is not stored explicitly; only the compressed LZW representation is kept.
  • Operations that allow sequential access operate directly on _dataLZW without full decompression.
  • For complex or random-access patterns, the implementation falls back to DDC (uncompressed).

Current status

  • Core data structure and compression/decompression are in place.
  • Work in progress on operations that can be implemented via sequential decoding without full materialization.
  • Work in progress on Performance.

Feedback on design and integration is very welcome.

florian-jobs and others added 14 commits January 7, 2026 13:39
…extending on APreAgg like ColGroupDDC for easier implementation. Idea: store only compressed version of _data vector and important metadata. If decompression is needed we reconstruct the _data vector using the metadata and the compressed _data vector. Decompression takes place at most once. This is just an idea and theres other ways of implementing.
 * - DDCLZW stores the mapping vector exclusively in compressed form.
 * - No persistent MapToData cache is maintained.
 * - Sequential operations decode on-the-fly, while operations requiring random access explicitly materialize and fall back to DDC.
 */
…and decompress and its used data structures compatible.
…DC test for ColGroupDDCTest. Improved compress/decompress methods in LZW class.
…mapping

This commit adds an initial implementation of ColGroupDDCLZW, a new column
group that stores the mapping vector in LZW-compressed form instead of
materializing MapToData explicitly.

The design focuses on enabling sequential access directly on the compressed
representation, while complex access patterns are intended to fall back to
DDC. No cache or lazy decompression mechanism is introduced at this stage.
@github-project-automation github-project-automation bot moved this to In Progress in SystemDS PR Queue Jan 13, 2026
@florian-jobs florian-jobs changed the title Add ColGroupDDCLZW with LZW-compressed MapToData [SYSTEMDS-3779] Add ColGroupDDCLZW with LZW-compressed MapToData Jan 13, 2026
@janniklinde janniklinde self-requested a review January 16, 2026 08:26
…press(). Decompress will now return an empty map if the index is zero.
Copy link
Contributor

@janniklinde janniklinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I left some comments in the code.

In general, please use tabs instead of spaces to make the diff more readable (can be done by importing the codestyle xml). It would be good if we are able to create the column group similar to this:

CompressionSettingsBuilder csb = new CompressionSettingsBuilder().setSamplingRatio(1.0)
	.setValidCompressions(EnumSet.of(AColGroup.CompressionType.DDCLZW))
		.setTransposeInput("false");
CompressionSettings cs = csb.create();

final CompressedSizeInfoColGroup cgi = new ComEstExact(mbt, cs).getColGroupInfo(colIndexes);
CompressedSizeInfo csi = new CompressedSizeInfo(cgi);
AColGroup cg = ColGroupFactory.compressColGroups(mbt, csi, cs, 1).get(0);

So corresponding features / methods to support this should be implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, all not properly implemented methods should throw a NotImplementedException.
Also, you should implement some of the operations that can be done on the compressed representation (e.g., scalar ops, unary, ...). Further, getExactSizeOnDisk() should be implemented

Comment on lines 622 to 625
@Override
protected int[] getCounts(int[] out) {
return new int[0];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not properly implemented, throw NotImplementedException

Comment on lines 321 to 323
if (_nUnique <= dict.getNumberOfValues(colIndexes.size()))
throw new DMLCompressionException("Invalid map to dict Map has:" + _nUnique + " while dict has "
+ dict.getNumberOfValues(colIndexes.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All implemented methods must be covered by tests

Comment on lines 497 to 500
@Override
protected void decompressToDenseBlockDenseDictionary(DenseBlock db, int rl, int ru, int offR, int offC, double[] values) {

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decompression to dense block should be supported

@github-project-automation github-project-automation bot moved this from In Progress to In Review in SystemDS PR Queue Jan 16, 2026
@janniklinde
Copy link
Contributor

Please add some more tests to really verify correctness. For example, you should do a full compression and then decompress it again. Then it should be compared to the original data

florian-jobs and others added 4 commits January 16, 2026 16:26
…GroupDDCTest back to correct formatting. Added LZWMappingIterator to decompress values on the fly without having to allocate full compression map [WIP]. Added Test class ColGroupDDCLZWTest.
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
@LukaDeka
Copy link

Added new unit tests for ColGroupDDCLZW (they're subject to change and only an initial draft).

They might include redundant/unnecessary checks.

The rest of the methods are also untested. I'll do it later and possibly refactor the helper functions for the tests.

…ded decompressToDenseBlockDenseDictionary [WIP] needs to be tested further. Added fallbacks to ddc for variouos functions. Added scalar and unary ops and various other simple methods from ddc.
…erns. Added append and appendNInternal, recompress and various other functions that needed to be implemented. No tests yet.
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress, i have left some comments.

I would love to see some performance numbers.

import org.apache.sysds.runtime.matrix.data.MatrixBlock;

public abstract class DDCLZWScheme extends DDCScheme {
// TODO: private int nUnique; Zu Datenspezifisch, überhaupt sinnvoll
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, not so meaningfull to implement specialization for the Scheme class.

The main goal of this is serialization and applying similar schemes to other groups. For the project of LZW, it is out of scope. so in my opinion you can ignore all Scheme parts.

return (((long) prefixCode) << 32) | (nextSymbol & 0xffffffffL);
}

// Compresses a mapping (AMapToData) into an LZW-compressed byte/integer/? array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to compress into a byte[] array, or if you want to bit shift a bit, pack into a long[] array.

}

@Override
public void leftMultByMatrixNoPreAgg(MatrixBlock matrix, MatrixBlock result, int rl, int ru, int cl, int cu) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cool one to support! It is a bit hard, but will probably pay of with LZW.

You can keep a soft reference to a hashmap mapping different rl to offsets into your data structure. That would make it possible to skip the initial scan until rl. Furthermore, the hashmap's growth would be limited, since the callers to these rl interfaces typically are bounded by cpu cores. You can use the same trick in some other functions where you scan until rl.

florian-jobs and others added 2 commits January 21, 2026 10:51
…mapping sequentially. reverted ColGroupDDC formatting again. Reverted CompressedSizeInfoColGroup formatting and adding DDCLZW part for testing. Added various tests for which functionality in the testing pipeline need to be added in order to work.
Signed-off-by: Luka Dekanozishvili <luka.dekanozishvili1@gmail.com>
@LukaDeka
Copy link

Added a few benchmarks that mostly compare memory as well as operation times for methods (so far, only for getIdx).

Right now, the comparison is only done for DDCLZW with DDC.

There are sizable memory savings for datasets with repeating patterns or large datasets:

================================================================================
Benchmark: benchmarkRandomData
================================================================================

Size:       1 | DDC:       61 bytes | DDCLZW:       67 bytes | Memory reduction:  -9.84% | De-/Compression speedup: 0.09/0.00 times
Size:      10 | DDC:       70 bytes | DDCLZW:       95 bytes | Memory reduction: -35.71% | De-/Compression speedup: 0.04/0.00 times
Size:     100 | DDC:      160 bytes | DDCLZW:      299 bytes | Memory reduction: -86.87% | De-/Compression speedup: 0.01/0.00 times
Size:    1000 | DDC:     1060 bytes | DDCLZW:     1551 bytes | Memory reduction: -46.32% | De-/Compression speedup: 0.00/0.00 times
Size:   10000 | DDC:    10060 bytes | DDCLZW:    10487 bytes | Memory reduction:  -4.24% | De-/Compression speedup: 0.00/0.00 times
Size:  100000 | DDC:   100060 bytes | DDCLZW:    78783 bytes | Memory reduction:  21.26% | De-/Compression speedup: 0.00/0.00 times

I also added the De-/Compression speedup field to compare other compression types with each other as well.

I also added a benchmark for the slides, but it doesn't look too useful at the moment:

================================================================================
Benchmark: benchmarkSlice
================================================================================

Size:       1 | Slice[    0:    0] | DDC:      0 ms | DDCLZW:      1 ms | Slowdown: 37.09 times
Size:      10 | Slice[    2:    7] | DDC:      0 ms | DDCLZW:     20 ms | Slowdown: 1141.72 times
Size:     100 | Slice[   25:   75] | DDC:      0 ms | DDCLZW:      3 ms | Slowdown: 169.34 times
Size:    1000 | Slice[  250:  750] | DDC:      0 ms | DDCLZW:      3 ms | Slowdown: 348.98 times
Size:   10000 | Slice[ 2500: 7500] | DDC:      0 ms | DDCLZW:      6 ms | Slowdown: 483.40 times
Size:  100000 | Slice[25000:75000] | DDC:      0 ms | DDCLZW:     24 ms | Slowdown: 325.22 times

The file might be in a wrong directory as well and wrongly labeled as a "test". We wouldn't want benchmarks running on every GitHub Actions trigger etc.

Would it make more sense to refactor it into a main function?

@Baunsgaard
Copy link
Contributor

@LukaDeka
Good to see some numbers. However, the ones you have reported are a bit unfortunate. I have a few points you should consider:

  1. Random data is not very compressible, and in actuality, truly random data would tend to make DDC superior for your use case. What you are looking for is to control the entropy of your data. If the entropy is low, you should get more benefits from LZW; if it is high, then your compression ratio should tend towards DDC.

  2. As an additional experiment, you can generate data that has exploitable patterns specific to LZW. Try to generate some data that is in the "best" possible structure. This should ideally show scaling close to (O(sqrt{n})) of the input size with standard LZW, while DDC, being a dense format, always has (O(n)).

  3. Do not worry about input data that is smaller than 100 elements for these experiments. For instance, experiments with 1 input row trivially show that other encodings can perform better than DDC. It starts getting interesting at larger sizes.

  4. Control and explicitly mention the number of distinct items you have as a parameter for your experiment. Additionally, calculate the entropy and use that as an additional measure of compressibility of the data. These two changes will improve the experiments.

@florian-jobs
Copy link
Author

florian-jobs commented Jan 22, 2026

Status update:

Many methods that operate sequentially on the original mapping have been implemented using partial on the fly decoding of the compressed LZW mapping via an iterator.

Methods with more complex or non sequential access patterns are not yet handled in this way (for example leftMultByMatrixNoPreAgg) and currently fall back to DDC. These will be addressed in follow-up work.

Most decompression paths now rely on partial decoding of the LZW mapping rather than full materialization. Scalar and unary operations have also been implemented.

Several previously reported issues have been fixed. I have reverted the unintended formatting changes in the affected files and ensured alignment with the existing code style.

I will continue working on the remaining improvements suggested by @Baunsgaard and @janniklinde.

What is still missing at this point are more dedicated tests for the individual methods to ensure correctness which @LukaDeka is working on.

Thanks for the detailed feedback and reviews, they were very helpfull!

@Baunsgaard
Copy link
Contributor

When you process some of the comments feel free to mark them as resolved!

@LukaDeka
Copy link

When you process some of the comments feel free to mark them as resolved!

I wanted to before, but I think I don't have the permission in GitHub to do that. Not sure if Florian has it.

@Baunsgaard
Copy link
Contributor

When you process some of the comments feel free to mark them as resolved!

I wanted to before, but I think I don't have the permission in GitHub to do that. Not sure if Florian has it.

Alternatively if you do not have permissions, make a comment saying resolved. Then when we go though the PR, it is cleaner.

… it into the compression

pipeline and serialization framework.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants