-
Notifications
You must be signed in to change notification settings - Fork 235
Open
Description
Hi Greg,
Thanks a lot for you work!
I want to share with more optimized version of your function combine_sentences from the tutorial about text splitting.
Instead of this function:
def combine_sentences(sentences, buffer_size=1):
# Go through each sentence dict
for i in range(len(sentences)):
# Create a string that will hold the sentences which are joined
combined_sentence = ''
# Add sentences before the current one, based on the buffer size.
for j in range(i - buffer_size, i):
# Check if the index j is not negative (to avoid index out of range like on the first one)
if j >= 0:
# Add the sentence at index j to the combined_sentence string
combined_sentence += sentences[j]['sentence'] + ' '
# Add the current sentence
combined_sentence += sentences[i]['sentence']
# Add sentences after the current one, based on the buffer size
for j in range(i + 1, i + 1 + buffer_size):
# Check if the index j is within the range of the sentences list
if j < len(sentences):
# Add the sentence at index j to the combined_sentence string
combined_sentence += ' ' + sentences[j]['sentence']
# Then add the whole thing to your dict
# Store the combined sentence in the current sentence dict
sentences[i]['combined_sentence'] = combined_sentence
return sentencesWe can use generators and Python standard libraries to generate windows more efficiently:
from collections import deque
from itertools import islice
def sliding_window(sentences, buffer_size=1):
window = deque(islice(sentences, buffer_size), maxlen=2*buffer_size+1)
for sentence in sentences[buffer_size:]:
window.append(sentence)
yield tuple(window)
while len(window) > buffer_size + 1:
window.popleft()
yield tuple(window)
for i, window_sentences in enumerate(sliding_window(single_sentences_list, buffer_size=2)):
sentence_dicts[i]['combined_sentence'] = ' '.join(window_sentences)By the way, I found that splitting by punctuation symbols is not working in the tutorial if there are no spaces after punctuation symbol before the next sentence (because of regex (?<=[.?!])\s+), could do you please tell did you explore more sophisticated methods of text splitting by sentences, and the influence of these methods on overall quality of semantic chunking?
Thank you.
ivanbaldo
Metadata
Metadata
Assignees
Labels
No labels