模型:
ckiplab/bert-base-han-chinese-ws
This model provides word segmentation for the ancient Chinese language. Our training dataset covers four eras of the Chinese language.
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
Using our model in your script
from transformers import (
  AutoTokenizer,
  AutoModel,
)
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese-ws")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese-ws")
  Using our model for inference
>>> from transformers import pipeline
>>> classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
>>> classifier("帝堯曰放勳")
# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]