🎹 发言人区分

依赖于pyannote.audio 2.0：详见 installation instructions 。

TL;DR

# load the pipeline from Hugginface Hub
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")

# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

高级用法

如果事先知道发言人数量，可以使用 num_speakers 选项：

diarization = pipeline("audio.wav", num_speakers=2)

还可以使用 min_speakers 和 max_speakers 选项提供发言人数量的下限和/或上限：

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

如果你感到冒险，可以尝试调整各种流水线超参数。例如，可以通过增加 segmentation_onset 阈值来使用更积极的语音活动检测：

hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)

基准测试

实时因子

在一个Nvidia Tesla V100 SXM2 GPU（用于神经推理部分）和一个Intel Cascade Lake 6248 CPU（用于聚类部分）上，实时因子约为5%。

换句话说，处理一个小时的对话大约需要3分钟。

准确性

这个流水线在一个不断增长的数据集合上进行基准测试。

处理是完全自动的：

没有手动语音活动检测（如有时在文献中所述）
没有手动指定发言人数量（虽然可以提供给流水线）
不调整内部模型的微调和流水线超参数来适应每个数据集

... 使用最宽容的发言人区分错误率（DER）设置（在 this paper 中命名为“Full”）：

没有宽容时间段
评估重叠说话

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File-level evaluation
1238321	14.61	3.31	4.35	6.95	RTTM	eval
1239321 12310321	18.21	3.28	11.07	3.87	RTTM	eval
12311321 12310321	29.00	2.71	21.61	4.68	RTTM	eval
12313321 12314321	30.24	3.71	16.86	9.66	RTTM	eval
12315321	20.99	4.25	10.74	6.00	RTTM	eval
12316321	12.62	1.55	3.30	7.76	RTTM	eval
12317321	12.76	3.45	3.85	5.46	RTTM	eval

支持

如需商业咨询和科学咨询，请联系我。有关 technical questions 和 bug reports ，请查看 pyannote.audio 的GitHub存储库。

引文

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

作者:

Philipp Schmid

数据集大小:

11.32 MB