模型:
rinna/japanese-hubert-base
This is a Japanese HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers) model trained by rinna Co., Ltd.
This model was traind using a large-scale Japanese audio dataset, ReazonSpeech corpus.
import torch from transformers import HubertModel model = HubertModel.from_pretrained("rinna/japanese-hubert-base") model.eval() wav_input_16khz = torch.randn(1, 10000) outputs = model(wav_input_16khz) print(f"Input: {wav_input_16khz.size()}") # [1, 10000] print(f"Output: {outputs.last_hidden_state.size()}") # [1, 31, 768]
The model architecture is the same as the original HuBERT base model , which contains 12 transformer layers with 8 attention heads. The model was trained using code from the official repository , and the detailed training configuration can be found in the same repository and the original paper .
A fairseq checkpoint file can also be available here .
The model was trained on approximately 19,000 hours of ReazonSpeech corpus.
@article{hubert2021hsu, author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, year={2021}, volume={29}, number={}, pages={3451-3460}, doi={10.1109/TASLP.2021.3122291} }