Optimized version of sliding window for semantic chunking

Hi Greg,

Thanks a lot for you work!

I want to share with more optimized version of your function `combine_sentences` from the [tutorial about text splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb).
Instead of this function:
```python
def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences
```

We can use generators and Python standard libraries to generate windows more efficiently:
```python
from collections import deque
from itertools import islice

def sliding_window(sentences, buffer_size=1):
    window = deque(islice(sentences, buffer_size), maxlen=2*buffer_size+1)
    
    for sentence in sentences[buffer_size:]:
        window.append(sentence)
        yield tuple(window)
        
    while len(window) > buffer_size + 1:
        window.popleft()
        yield tuple(window)

for i, window_sentences in enumerate(sliding_window(single_sentences_list, buffer_size=2)):
    sentence_dicts[i]['combined_sentence'] = ' '.join(window_sentences)
```

By the way, I found that splitting by punctuation symbols is not working in the tutorial if there are no spaces after punctuation symbol before the next sentence (because of regex `(?<=[.?!])\s+`), could do you please tell did you explore more sophisticated methods of text splitting by sentences, and the influence of these methods on overall quality of semantic chunking?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized version of sliding window for semantic chunking #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimized version of sliding window for semantic chunking #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions