英文

? 发言人区分

依赖于pyannote.audio 2.0:详见 installation instructions

TL;DR

# load the pipeline from Hugginface Hub
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")

# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

高级用法

如果事先知道发言人数量,可以使用 num_speakers 选项:

diarization = pipeline("audio.wav", num_speakers=2)

还可以使用 min_speakers 和 max_speakers 选项提供发言人数量的下限和/或上限:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

如果你感到冒险,可以尝试调整各种流水线超参数。例如,可以通过增加 segmentation_onset 阈值来使用更积极的语音活动检测:

hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)

基准测试

实时因子

在一个Nvidia Tesla V100 SXM2 GPU(用于神经推理部分)和一个Intel Cascade Lake 6248 CPU(用于聚类部分)上,实时因子约为5%。

换句话说,处理一个小时的对话大约需要3分钟。

准确性

这个流水线在一个不断增长的数据集合上进行基准测试。

处理是完全自动的:

  • 没有手动语音活动检测(如有时在文献中所述)
  • 没有手动指定发言人数量(虽然可以提供给流水线)
  • 不调整内部模型的微调和流水线超参数来适应每个数据集

... 使用最宽容的发言人区分错误率(DER)设置(在 this paper 中命名为“Full”):

  • 没有宽容时间段
  • 评估重叠说话
Benchmark DER% FA% Miss% Conf% Expected output File-level evaluation
1238321 14.61 3.31 4.35 6.95 RTTM eval
1239321 12310321 18.21 3.28 11.07 3.87 RTTM eval
12311321 12310321 29.00 2.71 21.61 4.68 RTTM eval
12313321 12314321 30.24 3.71 16.86 9.66 RTTM eval
12315321 20.99 4.25 10.74 6.00 RTTM eval
12316321 12.62 1.55 3.30 7.76 RTTM eval
12317321 12.76 3.45 3.85 5.46 RTTM eval

支持

如需商业咨询和科学咨询,请联系我。有关 technical questions bug reports ,请查看 pyannote.audio 的GitHub存储库。

引文

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}
@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}