Skip to content

Optimized version of sliding window for semantic chunking #5

@labdmitriy

Description

@labdmitriy

Hi Greg,

Thanks a lot for you work!

I want to share with more optimized version of your function combine_sentences from the tutorial about text splitting.
Instead of this function:

def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences

We can use generators and Python standard libraries to generate windows more efficiently:

from collections import deque
from itertools import islice

def sliding_window(sentences, buffer_size=1):
    window = deque(islice(sentences, buffer_size), maxlen=2*buffer_size+1)
    
    for sentence in sentences[buffer_size:]:
        window.append(sentence)
        yield tuple(window)
        
    while len(window) > buffer_size + 1:
        window.popleft()
        yield tuple(window)

for i, window_sentences in enumerate(sliding_window(single_sentences_list, buffer_size=2)):
    sentence_dicts[i]['combined_sentence'] = ' '.join(window_sentences)

By the way, I found that splitting by punctuation symbols is not working in the tutorial if there are no spaces after punctuation symbol before the next sentence (because of regex (?<=[.?!])\s+), could do you please tell did you explore more sophisticated methods of text splitting by sentences, and the influence of these methods on overall quality of semantic chunking?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions