模型:
1-800-BAD-CODE/sentence_boundary_detection_multilang
This model performs text sentence boundary detection (SBD) with 49 common languages.
This model segments a long, punctuated text into one or more constituent sentences.
A key feature is that the model is multi-lingual and language-agnostic at inference time. Therefore, language tags do not need to be used and a single batch can contain multiple languages.
The model inputs should be punctuated texts.
For each input subword t , this model predicts the probability that t is the final token of a sentence (i.e., a sentence boundary).
The easiest way to use this model is to install punctuators :
$ pip install punctuatorsExample Usage
from typing import List from punctuators.models import SBDModelONNX # Instantiate this model # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory. m = SBDModelONNX.from_pretrained("sbd_multi_lang") input_texts: List[str] = [ # English (with a lot of acronyms) "the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.", # Chinese "魔鬼兵團都死了?但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。", # Spanish "él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.", # Thai "พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน", # Ukrainian "розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.", # Polish "szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.", ] # Run inference results: List[List[str]] = m.infer(input_texts) # Print each input and it's segmented outputs for input_text, output_texts in zip(input_texts, results): print(f"Input: {input_text}") print(f"Outputs:") for text in output_texts: print(f"\t{text}") print()Expected outputs
Input: the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive. Outputs: the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive. Input: 魔鬼兵團都死了?但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。 Outputs: 魔鬼兵團都死了? 但是如果这让你不快乐就别做了。 您就不能发个电报吗。 我們都準備好了。 Input: él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena. Outputs: él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena. Input: พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน Outputs: พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน Input: розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет. Outputs: розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет. Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz. Outputs: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
This is a data-driven approach to SBD. The model uses a SentencePiece tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
Given that this is a relatively-easy NLP task, the model contains ~9M parameters (~8.2M of which are embeddings). This makes the model very fast and cheap at inference time, as SBD should be.
The BERT encoder is based on the following configuration:
This model was trained on a personal fork of NeMo , specifically this sbd branch.
Model was trained for several hundred thousand steps with ~1M lines of texts per language (~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.
This model was trained on OpenSubtitles data.
Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
Automatically-segmented corpora are undesirable for at least two reasons:
Heuristics were used to attempt to clean the data before training. Some examples of the cleaning are:
To create examples for the model, we
For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries). The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
This model uses a maximum sequence length of 256, which for OpenSubtitles is relatively long. If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
50% of input texts were lower-cased for both the tokenizer and classification models. This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing. Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
The training data was pre-processed for language-specific punctuation and spacing rules.
The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
This model was trained on OpenSubtitles , data which is notoriously noisy. The model may have learned some bad habits from this data.
An assumption made during training is that every input line is exactly one sentence. However, that's not always the case. So the model might have some false negatives which are explained by the training data having multiple sentences on some lines.
As discussed in a previous section, each language should be formatted and punctuated per that languages rules.
E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.
In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.
It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line. In reality, the data sets used thus far are noisy and often contain more than one sentence per line.
Metrics are not published for now, and evaluation is limited to manual spot-checking.
Sufficient test sets for this analytic are being looked into.