Skip to content

ExtractiveRefiner batch_run setence splitter. #175

Open
@Jose-Ribeir

Description

@Jose-Ribeir

This type of sentence spilitting won't work with phrases that contain the dot as a means of shortenning the phrase.

Example:
"In December 2021, Green Fields Agriculture Ltd. made a major investment of $15 million in Project Greenhouse."
will be
"In December 2021, Green Fields Agriculture Ltd."

Losing all the relevance for the model.

A way to fix this is by checking if the next token following the " " is a capital character.
This is the simplest approach and the one i have been using however you could also use nltk to do the spliting.

this is the function im using
def custom_sentence_split(text):
sentences = re.split(r'(?<=[.])\s+(?=[A-Z])', text)
return [s.strip() for s in sentences if s.strip()]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions