Open
Description
This type of sentence spilitting won't work with phrases that contain the dot as a means of shortenning the phrase.
Example:
"In December 2021, Green Fields Agriculture Ltd. made a major investment of $15 million in Project Greenhouse."
will be
"In December 2021, Green Fields Agriculture Ltd."
Losing all the relevance for the model.
A way to fix this is by checking if the next token following the " " is a capital character.
This is the simplest approach and the one i have been using however you could also use nltk to do the spliting.
this is the function im using
def custom_sentence_split(text):
sentences = re.split(r'(?<=[.])\s+(?=[A-Z])', text)
return [s.strip() for s in sentences if s.strip()]
Metadata
Metadata
Assignees
Labels
No labels