Merge branch 'main' into patch-2

This commit is contained in:
源文雨 2023-04-24 20:22:08 +08:00 committed by GitHub
commit d6db821ac7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
36 changed files with 2062 additions and 763 deletions

View File

@ -2,18 +2,20 @@ name: pull format
on: [pull_request] on: [pull_request]
permissions:
contents: write
jobs: jobs:
pull_format: pull_format:
permissions:
actions: write
checks: write
contents: write
runs-on: ubuntu-latest runs-on: ubuntu-latest
continue-on-error: true continue-on-error: true
steps: steps:
- uses: actions/checkout@v3 - name: checkout
continue-on-error: true
uses: actions/checkout@v3
with: with:
ref: ${{ github.head_ref }} ref: ${{ github.head_ref }}
fetch-depth: 0
- name: Set up Python ${{ matrix.python-version }} - name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4 uses: actions/setup-python@v4

46
.github/workflows/push_format.yml vendored Normal file
View File

@ -0,0 +1,46 @@
name: push format
on:
push:
branches:
- main
permissions:
contents: write
pull-requests: write
jobs:
push_format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
ref: ${{github.ref_name}}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install Black
run: pip install black
- name: Run Black
# run: black $(git ls-files '*.py')
run: black .
- name: Commit Back
continue-on-error: true
id: commitback
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git add --all
git commit -m "Format code"
- name: Create Pull Request
if: steps.commitback.outcome == 'success'
continue-on-error: true
uses: peter-evans/create-pull-request@v4
with:
body: Apply Code Formatter Change
commit-message: Automatic code format

View File

@ -1,15 +1,23 @@
20230409 ### 20230409
- 修正训练参数提升显卡平均利用率A100最高从25%提升至90%左右V100:50%->90%左右2060S:60%->85%左右P40:25%->95%左右,训练速度显著提升
- 修正参数总batch_size改为每张卡的batch_size
- 修正total_epoch最大限制100解锁至1000默认10提升至默认20
- 修复ckpt提取识别是否带音高错误导致推理异常的问题
- 修复分布式训练每个rank都保存一次ckpt的问题
- 特征提取进行nan特征过滤
- 修复静音输入输出随机辅音or噪声的问题老版模型需要重做训练集重训
修正训练参数提升显卡平均利用率A100最高从25%提升至90%左右V100:50%->90%左右2060S:60%->85%左右P40:25%->95%左右,训练速度显著提升 ### 20230416更新
- 新增本地实时变声迷你GUI双击go-realtime-gui.bat启动
- 训练推理均对<50Hz的频段进行滤波过滤
- 训练推理音高提取pyworld最低音高从默认80下降至50,50-80hz间的男声低音不会哑
- WebUI支持根据系统区域变更语言现支持en_USja_JPzh_CNzh_HKzh_SGzh_TW不支持的默认en_US
- 修正部分显卡识别例如V100-16G识别失败P4识别失败
修正参数总batch_size改为每张卡的batch_size ### 后续计划:
- 收集呼吸wav加入训练集修正呼吸变声电音的问题
修正total_epoch最大限制100解锁至1000默认10提升至默认20 - 研究更优的默认faiss索引配置计划将索引打包进weights/xxx.pth中取消推理界面的 特征/检索库 选择
- 根据显存情况和显卡架构自动给到最优配置batch size训练集切块推理音频长度相关的config训练是否fp16未来所有>=4G显存的>=pascal架构的显卡都可以训练或推理<4G显存的显卡不会进行支持
修复ckpt提取识别是否带音高错误导致推理异常的问题 - 我们正在训练增加了歌声训练集的底模,未来会公开
- 推理音高识别选项加入"是否开启中值滤波"
修复分布式训练每个rank都保存一次ckpt的问题 - 增加选项:每次epoch保存的小模型均进行提取; 增加选项:设置默认测试集音频每次保存的小模型均在保存后对其进行推理导出用户可试听来选择哪个中间epoch最好
特征提取进行nan特征过滤
修复静音输入输出随机辅音or噪声的问题老版模型需要重做训练集重训

View File

@ -60,7 +60,7 @@ poetry install
你也可以通过pip来安装依赖 你也可以通过pip来安装依赖
**注意**: `MacOS``faiss 1.7.2`版本会导致抛出段错误,请将`requirements.txt`的对应条目改为`faiss-cpu==1.7.0` **注意**: `MacOS``faiss 1.7.2`版本会导致抛出段错误,在手动安装时请使用命令`pip install faiss-cpu==1.7.0`指定使用`1.7.0`版本
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt

View File

@ -30,7 +30,7 @@ parser.add_argument(
cmd_opts = parser.parse_args() cmd_opts = parser.parse_args()
python_cmd = cmd_opts.pycmd python_cmd = cmd_opts.pycmd
listen_port = cmd_opts.port listen_port = cmd_opts.port if 0 <= cmd_opts.port <= 65535 else 7865
iscolab = cmd_opts.colab iscolab = cmd_opts.colab
noparallel = cmd_opts.noparallel noparallel = cmd_opts.noparallel
noautoopen = cmd_opts.noautoopen noautoopen = cmd_opts.noautoopen

View File

@ -24,6 +24,9 @@ An easy-to-use SVC framework based on VITS.<br><br>
> Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer) > Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
> The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source dataset.
> High quality licensed song datasets will be added to training-set one after another for your use, without worrying about copyright infringement.
## Summary ## Summary
This repository has the following features: This repository has the following features:
+ Reduce tone leakage by replacing source feature to training-set feature using top1 retrieval; + Reduce tone leakage by replacing source feature to training-set feature using top1 retrieval;
@ -32,7 +35,6 @@ This repository has the following features:
+ Supporting model fusion to change timbres (using ckpt processing tab->ckpt merge); + Supporting model fusion to change timbres (using ckpt processing tab->ckpt merge);
+ Easy-to-use Webui interface; + Easy-to-use Webui interface;
+ Use the UVR5 model to quickly separate vocals and instruments. + Use the UVR5 model to quickly separate vocals and instruments.
+ The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source dataset, and high quality licensed song datasets will be added to training-set one after another for your use, without worrying about copyright infringement.
## Preparing the environment ## Preparing the environment
We recommend you install the dependencies through poetry. We recommend you install the dependencies through poetry.
@ -43,8 +45,7 @@ The following commands need to be executed in the environment of Python version
pip install torch torchvision torchaudio pip install torch torchvision torchaudio
#For Windows + Nvidia Ampere Architecture(RTX30xx), you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21 #For Windows + Nvidia Ampere Architecture(RTX30xx), you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21
#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
# Install the Poetry dependency management tool, skip if installed # Install the Poetry dependency management tool, skip if installed
# Reference: https://python-poetry.org/docs/#installation # Reference: https://python-poetry.org/docs/#installation
@ -55,7 +56,7 @@ poetry install
``` ```
You can also use pip to install the dependencies You can also use pip to install the dependencies
**Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please change corresponding line in `requirements.txt` to `faiss-cpu==1.7.0` **Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please use `pip install faiss-cpu==1.7.0` if you use pip to install it manually.
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
@ -83,12 +84,16 @@ python infer-web.py
``` ```
If you are using Windows, you can download and extract `RVC-beta.7z` to use RVC directly and use `go-web.bat` to start Webui. If you are using Windows, you can download and extract `RVC-beta.7z` to use RVC directly and use `go-web.bat` to start Webui.
We will develop an English version of the WebUI in 2 weeks.
There's also a tutorial on RVC in Chinese and you can check it out if needed. There's also a tutorial on RVC in Chinese and you can check it out if needed.
## Credits ## Credits
+ [ContentVec](https://github.com/auspicious3000/contentvec/)
+ [VITS](https://github.com/jaywalnut310/vits)
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
+ [Gradio](https://github.com/gradio-app/gradio)
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
## Thanks to all contributors for their efforts ## Thanks to all contributors for their efforts
<a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank"> <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">

View File

@ -21,58 +21,58 @@ VITSに基づく使いやすい音声変換voice changerframework<br><br>
[**English**](./README.en.md) | [**中文简体**](../README.md) | [**日本語**](./README.ja.md) [**English**](./README.en.md) | [**中文简体**](../README.md) | [**日本語**](./README.ja.md)
> デモ動画は[こちら](https://www.bilibili.com/video/BV1pm4y1z7Gm/)でご覧ください > デモ動画は[こちら](https://www.bilibili.com/video/BV1pm4y1z7Gm/)でご覧ください
> RVCによるリアルタイム音声変換: [w-okada/voice-changer](https://github.com/w-okada/voice-changer) > RVCによるリアルタイム音声変換: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
> 基底modelを訓練(training)したのは、約50時間の高品質なオープンソースのデータセット。著作権侵害を心配することなく使用できるように > 著作権侵害を心配することなく使用できるように、基底モデルは約50時間の高品質なオープンソースデータセットで訓練されています
> 今後は次々と使用許可のある高品質歌声資料集を追加し、基底modelを訓練する > 今後も、次々と使用許可のある高品質な歌声の資料集を追加し、基底モデルを訓練する予定です
## はじめに ## はじめに
repoは下記の特徴があります リポジトリには下記の特徴があります。
+ 調子(tone)の漏洩が下がれるためtop1検索で源特徴量を訓練集特徴量に置換 + Top1検索を用いることで、生の特徴量を訓練用データセット特徴量に変換し、トーンリーケージを削減します。
+ 古い又は安いGPUでも高速に訓練できる + 比較的貧弱なGPUでも、高速かつ簡単に訓練できます。
+ 小さい訓練集でもかなりいいmodelを得られる(10分以上の低noise音声を推奨) + 少量のデータセットからでも、比較的良い結果を得ることができます。10分以上のイズの少ない音声を推奨します。
+ modelを融合し音色をmergeできる(ckpt processing->ckpt mergeで使用) + モデルを融合することで、音声を混ぜることができます。ckpt processingタブの、ckpt mergeを使用します。
+ 使いやすいWebUI + 使いやすいWebUI
+ UVR5 Modelも含めるため人声とBGMを素早く分離できる + UVR5 Modelも含んでいるため、人の声とBGMを素早く分離できます。
## 環境構築 ## 環境構築
poetryで依存関係をinstallすることをお勧めします。 Poetryで依存関係をインストールすることをお勧めします。
下記のcommandsは、Python3.8以上の環境で実行する必要があります: 下記のコマンドは、Python3.8以上の環境で実行する必要があります:
```bash ```bash
# PyTorch関連の依存関係をinstall。install済の場合はskip # PyTorch関連の依存関係をインストール。インストール済の場合は省略。
# 参照先: https://pytorch.org/get-started/locally/ # 参照先: https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio pip install torch torchvision torchaudio
#Windows Nvidia Ampere Architecture(RTX30xx)の場合、 #21 に従い、pytorchに対応するcuda versionを指定する必要があります。 #Windows Nvidia Ampere Architecture(RTX30xx)の場合、 #21 に従い、pytorchに対応するcuda versionを指定する必要があります。
#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 #pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
# PyTorch関連の依存関係をinstall。install済の場合はskip # PyTorch関連の依存関係をインストール。インストール済の場合は省略。
# 参照先: https://python-poetry.org/docs/#installation # 参照先: https://python-poetry.org/docs/#installation
curl -sSL https://install.python-poetry.org | python3 - curl -sSL https://install.python-poetry.org | python3 -
# Poetry経由で依存関係をinstall # Poetry経由で依存関係をインストール
poetry install poetry install
``` ```
pipでも依存関係のinstallが可能です: pipでも依存関係のインストールが可能です:
**注意**:`faiss 1.7.2``macOS``Segmentation Fault: 11`を起こすので、`requirements.txt`の該当行を `faiss-cpu==1.7.0`に変更してください。 **注意**:`faiss 1.7.2``macOS``Segmentation Fault: 11`を起こすので、マニュアルインストールする場合は、 `pip install faiss-cpu==1.7.0`を実行してください。
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
## 基底modelsを準備 ## 基底modelsを準備
RVCは推論/訓練のために、様々な事前訓練を行った基底modelsが必要です。 RVCは推論/訓練のために、様々な事前訓練を行った基底モデルを必要とします。
modelsは[Hugging Face space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)からダウンロードできます。 modelsは[Hugging Face space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)からダウンロードできます。
以下は、RVCに必要な基底modelsやその他のfilesの一覧です。 以下は、RVCに必要な基底モデルやその他のファイルの一覧です。
```bash ```bash
hubert_base.pt hubert_base.pt
@ -80,16 +80,16 @@ hubert_base.pt
./uvr5_weights ./uvr5_weights
# ffmpegがすでにinstallされている場合はskip # ffmpegがすでにinstallされている場合は省略
./ffmpeg ./ffmpeg
``` ```
その後、下記のcommandでWebUIを起動 その後、下記のコマンドでWebUIを起動します。
```bash ```bash
python infer-web.py python infer-web.py
``` ```
Windowsをお使いの方は、直接`RVC-beta.7z`をダウンロード後に展開し、`go-web.bat`clickでWebUIを起動。(7zipが必要です) Windowsをお使いの方は、直接`RVC-beta.7z`をダウンロード後に展開し、`go-web.bat`クリックすることで、WebUIを起動することができます。(7zipが必要です。)
また、repoに[小白简易教程.doc](./小白简易教程.doc)がありますので、参考にしてください(中国語版のみ)。 また、リポジトリに[小白简易教程.doc](./小白简易教程.doc)がありますので、参考にしてください(中国語版のみ)。
## 参考プロジェクト ## 参考プロジェクト
+ [ContentVec](https://github.com/auspicious3000/contentvec/) + [ContentVec](https://github.com/auspicious3000/contentvec/)
@ -100,7 +100,7 @@ Windowsをお使いの方は、直接に`RVC-beta.7z`をダウンロード後に
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui) + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
+ [audio-slicer](https://github.com/openvpi/audio-slicer) + [audio-slicer](https://github.com/openvpi/audio-slicer)
## 貢献者(contributer)の皆様の尽力に感謝します ## 貢献者(contributor)の皆様の尽力に感謝します
<a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank"> <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=liujing04/Retrieval-based-Voice-Conversion-WebUI" /> <img src="https://contrib.rocks/image?repo=liujing04/Retrieval-based-Voice-Conversion-WebUI" />
</a> </a>

146
docs/faiss_tips_en.md Normal file
View File

@ -0,0 +1,146 @@
faiss tuning TIPS
==================
# about faiss
faiss is a library of neighborhood searches for dense vectors, developed by facebook research, which efficiently implements many approximate neighborhood search methods.
Approximate Neighbor Search finds similar vectors quickly while sacrificing some accuracy.
## faiss in RVC
In RVC, for the embedding of features converted by HuBERT, we search for embeddings similar to the embedding generated from the training data and mix them to achieve a conversion that is closer to the original speech. However, since this search takes time if performed naively, high-speed conversion is realized by using approximate neighborhood search.
# implementation overview
In '/logs/your-experiment/3_feature256' where the model is located, features extracted by HuBERT from each voice data are located.
From here we read the npy files in order sorted by filename and concatenate the vectors to create big_npy. (This vector has shape [N, 256].)
After saving big_npy as /logs/your-experiment/total_fea.npy, train it with faiss.
As of 2023/04/18, IVF based on L2 distance is used using the index factory function of faiss.
The number of IVF divisions (n_ivf) is N//39, and n_probe uses int(np.power(n_ivf, 0.3)). (Look around train_index in infer-web.py.)
In this article, I will first explain the meaning of these parameters, and then write advice for developers to create a better index.
# Explanation of the method
## index factory
An index factory is a unique faiss notation that expresses a pipeline that connects multiple approximate neighborhood search methods as a string.
This allows you to try various approximate neighborhood search methods simply by changing the index factory string.
In RVC it is used like this:
```python
index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
```
Among the arguments of index_factory, the first is the number of dimensions of the vector, the second is the index factory string, and the third is the distance to use.
For more detailed notation
https://github.com/facebookresearch/faiss/wiki/The-index-factory
## index for distance
There are two typical indexes used as similarity of embedding as follows.
- Euclidean distance (METRIC_L2)
- inner product (METRIC_INNER_PRODUCT)
Euclidean distance takes the squared difference in each dimension, sums the differences in all dimensions, and then takes the square root. This is the same as the distance in 2D and 3D that we use on a daily basis.
The inner product is not used as an index of similarity as it is, and the cosine similarity that takes the inner product after being normalized by the L2 norm is generally used.
Which is better depends on the case, but cosine similarity is often used in embedding obtained by word2vec and similar image retrieval models learned by ArcFace. If you want to do l2 normalization on vector X with numpy, you can do it with the following code with eps small enough to avoid 0 division.
```python
X_normed = X / np.maximum(eps, np.linalg.norm(X, ord=2, axis=-1, keepdims=True))
```
Also, for the index factory, you can change the distance index used for calculation by choosing the value to pass as the third argument.
```python
index = faiss.index_factory(dimention, text, faiss.METRIC_INNER_PRODUCT)
```
## IVF
IVF (Inverted file indexes) is an algorithm similar to the inverted index in full-text search.
During learning, the search target is clustered with kmeans, and Voronoi partitioning is performed using the cluster center. Each data point is assigned a cluster, so we create a dictionary that looks up the data points from the clusters.
For example, if clusters are assigned as follows
|index|Cluster|
|-----|-------|
|1|A|
|2|B|
|3|A|
|4|C|
|5|B|
The resulting inverted index looks like this:
|cluster|index|
|-------|-----|
|A|1, 3|
|B|2, 5|
|C|4|
When searching, we first search n_probe clusters from the clusters, and then calculate the distances for the data points belonging to each cluster.
# recommend parameter
There are official guidelines on how to choose an index, so I will explain accordingly.
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
For datasets below 1M, 4bit-PQ is the most efficient method available in faiss as of April 2023.
Combining this with IVF, narrowing down the candidates with 4bit-PQ, and finally recalculating the distance with an accurate index can be described by using the following index factory.
```python
index = faiss.index_factory(256, "IVF1024,PQ128x4fs,RFlat")
```
## Recommended parameters for IVF
Consider the case of too many IVFs. For example, if coarse quantization by IVF is performed for the number of data, this is the same as a naive exhaustive search and is inefficient.
For 1M or less, IVF values are recommended between 4*sqrt(N) ~ 16*sqrt(N) for N number of data points.
Since the calculation time increases in proportion to the number of n_probes, please consult with the accuracy and choose appropriately. Personally, I don't think RVC needs that much accuracy, so n_probe = 1 is fine.
## FastScan
FastScan is a method that enables high-speed approximation of distances by Cartesian product quantization by performing them in registers.
Cartesian product quantization performs clustering independently for each d dimension (usually d = 2) during learning, calculates the distance between clusters in advance, and creates a lookup table. At the time of prediction, the distance of each dimension can be calculated in O(1) by looking at the lookup table.
So the number you specify after PQ usually specifies half the dimension of the vector.
For a more detailed description of FastScan, please refer to the official documentation.
https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-codes-(FastScan)
## RFlat
RFlat is an instruction to recalculate the rough distance calculated by FastScan with the exact distance specified by the third argument of index factory.
When getting k neighbors, k*k_factor points are recalculated.
# Techniques for embedding
## alpha query expansion
Query expansion is a technique used in searches, for example in full-text searches, where a few words are added to the entered search sentence to improve search accuracy. Several methods have also been proposed for vector search, among which α-query expansion is known as a highly effective method that does not require additional learning. In the paper, it is introduced in [Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019), etc., and [2nd place solution of kaggle shopee competition](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook).
α-query expansion can be done by summing a vector with neighboring vectors with weights raised to the power of similarity. How to paste the code example. Replace big_npy with α query expansion.
```python
alpha = 3.
index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat")
original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
big_npy /= original_norm
index.train(big_npy)
index.add(big_npy)
dist, neighbor = index.search(big_npy, num_expand)
expand_arrays = []
ixs = np.arange(big_npy.shape[0])
for i in range(-(-big_npy.shape[0]//batch_size)):
ix = ixs[i*batch_size:(i+1)*batch_size]
weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha)
expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1))
big_npy = np.concatenate(expand_arrays, axis=0)
# normalize index version
big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
```
This is a technique that can be applied both to the query that does the search and to the DB being searched.
## Compress embedding with MiniBatch KMeans
If total_fea.npy is too large, it is possible to shrink the vector using KMeans.
Compression of embedding is possible with the following code. Specify the size you want to compress for n_clusters, and specify 256 * number of CPU cores for batch_size to fully benefit from CPU parallelization.
```python
import multiprocessing
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random")
kmeans.fit(big_npy)
sample_npy = kmeans.cluster_centers_
```

146
docs/faiss_tips_ja.md Normal file
View File

@ -0,0 +1,146 @@
faiss tuning TIPS
==================
# about faiss
faissはfacebook researchの開発する、密なベクトルに対する近傍探索をまとめたライブラリで、多くの近似近傍探索の手法を効率的に実装しています。
近似近傍探索はある程度精度を犠牲にしながら高速に類似するベクトルを探します。
## faiss in RVC
RVCではHuBERTで変換した特徴量のEmbeddingに対し、学習データから生成されたEmbeddingと類似するものを検索し、混ぜることでより元の音声に近い変換を実現しています。ただ、この検索は愚直に行うと時間がかかるため、近似近傍探索を用いることで高速な変換を実現しています。
# 実装のoverview
モデルが配置されている '/logs/your-experiment/3_feature256'には各音声データからHuBERTで抽出された特徴量が配置されています。
ここからnpyファイルをファイル名でソートした順番で読み込み、ベクトルを連結してbig_npyを作成します。(このベクトルのshapeは[N, 256]です。)
big_npyを/logs/your-experiment/total_fea.npyとして保存した後、faissを学習させます。
2023/04/18時点ではfaissのindex factoryの機能を用いて、L2距離に基づくIVFを用いています。
IVFの分割数(n_ivf)はN//39で、n_probeはint(np.power(n_ivf, 0.3))が採用されています。(infer-web.pyのtrain_index周りを探してください。)
本Tipsではまずこれらのパラメータの意味を解説し、その後よりよいindexを作成するための開発者向けアドバイスを書きます。
# 手法の解説
## index factory
index factoryは複数の近似近傍探索の手法を繋げるパイプラインをstringで表記するfaiss独自の記法です。
これにより、index factoryの文字列を変更するだけで様々な近似近傍探索の手法を試せます。
RVCでは以下のように使われています。
```python
index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
```
index_factoryの引数のうち、1つ目はベクトルの次元数、2つ目はindex factoryの文字列で、3つ目には用いる距離を指定することができます。
より詳細な記法については
https://github.com/facebookresearch/faiss/wiki/The-index-factory
## 距離指標
embeddingの類似度として用いられる代表的な指標として以下の二つがあります。
- ユークリッド距離(METRIC_L2)
- 内積(METRIC_INNER_PRODUCT)
ユークリッド距離では各次元において二乗の差をとり、全次元の差を足してから平方根をとります。これは日常的に用いる2次元、3次元での距離と同じです。
内積はこのままでは類似度の指標として用いず、一般的にはL2ルムで正規化してから内積をとるコサイン類似度を用います。
どちらがよいかは場合によりますが、word2vec等で得られるembeddingやArcFace等で学習した類似画像検索のモデルではコサイン類似度が用いられることが多いです。ベクトルXに対してl2正規化をnumpyで行う場合は、0 divisionを避けるために十分に小さな値をepsとして以下のコードで可能です。
```python
X_normed = X / np.maximum(eps, np.linalg.norm(X, ord=2, axis=-1, keepdims=True))
```
また、index factoryには第3引数に渡す値を選ぶことで計算に用いる距離指標を変更できます。
```python
index = faiss.index_factory(dimention, text, faiss.METRIC_INNER_PRODUCT)
```
## IVF
IVF(Inverted file indexes)は全文検索における転置インデックスと似たようなアルゴリズムです。
学習時には検索対象に対してkmeansでクラスタリングを行い、クラスタ中心を用いてボロイ分割を行います。各データ点には一つずつクラスタが割り当てられるので、クラスタからデータ点を逆引きする辞書を作成します。
例えば以下のようにクラスタが割り当てられた場合
|index|クラスタ|
|-----|-------|
|1|A|
|2|B|
|3|A|
|4|C|
|5|B|
作成される転置インデックスは以下のようになります。
|クラスタ|index|
|-------|-----|
|A|1, 3|
|B|2, 5|
|C|4|
検索時にはまずクラスタからn_probe個のクラスタを検索し、次にそれぞれのクラスタに属するデータ点について距離を計算します。
# 推奨されるパラメータ
indexの選び方については公式にガイドラインがあるので、それに準じて説明します。
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
1M以下のデータセットにおいては4bit-PQが2023年4月時点ではfaissで利用できる最も効率的な手法です。
これをIVFと組み合わせ、4bit-PQで候補を絞り、最後に正確な指標で距離を再計算するには以下のindex factoryを用いることで記載できます。
```python
index = faiss.index_factory(256, "IVF1024,PQ128x4fs,RFlat")
```
## IVFの推奨パラメータ
IVFの数が多すぎる場合、たとえばデータ数の数だけIVFによる粗量子化を行うと、これは愚直な全探索と同じになり効率が悪いです。
1M以下の場合ではIVFの値はデータ点の数Nに対して4*sqrt(N) ~ 16*sqrt(N)に推奨しています。
n_probeはn_probeの数に比例して計算時間が増えるので、精度と相談して適切に選んでください。個人的にはRVCにおいてそこまで精度は必要ないと思うのでn_probe = 1で良いと思います。
## FastScan
FastScanは直積量子化で大まかに距離を近似するのを、レジスタ内で行うことにより高速に行うようにした手法です。
直積量子化は学習時にd次元ごと(通常はd=2)に独立してクラスタリングを行い、クラスタ同士の距離を事前計算してlookup tableを作成します。予測時はlookup tableを見ることで各次元の距離をO(1)で計算できます。
そのため、PQの次に指定する数字は通常ベクトルの半分の次元を指定します。
FastScanに関するより詳細な説明は公式のドキュメントを参照してください。
https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-codes-(FastScan)
## RFlat
RFlatはFastScanで計算した大まかな距離を、index factoryの第三引数で指定した正確な距離で再計算する指示です。
k個の近傍を取得する際は、k*k_factor個の点について再計算が行われます。
# Embeddingに関するテクニック
## alpha query expansion
クエリ拡張は検索で使われるテクニックで、例えば全文検索では入力された検索文に単語を幾つか追加することで検索精度を上げることがあります。ベクトル検索にもいくつか提唱されていて、その内追加の学習がいらず効果が高い手法としてα-query expansionが知られています。論文では[Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019)などで紹介されていて、[kaggleのshopeeコンペの2位の解法](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook)にも用いられていました。
α-query expansionはあるベクトルに対し、近傍のベクトルを類似度のα乗した重みで足し合わせることでできます。いかにコードの例を張ります。big_npyをα query expansionしたものに置き換えます。
```python
alpha = 3.
index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat")
original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
big_npy /= original_norm
index.train(big_npy)
index.add(big_npy)
dist, neighbor = index.search(big_npy, num_expand)
expand_arrays = []
ixs = np.arange(big_npy.shape[0])
for i in range(-(-big_npy.shape[0]//batch_size)):
ix = ixs[i*batch_size:(i+1)*batch_size]
weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha)
expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1))
big_npy = np.concatenate(expand_arrays, axis=0)
# normalize index version
big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
```
これは、検索を行うクエリにも、検索対象のDBにも適応可能なテクニックです。
## MiniBatch KMeansによるembeddingの圧縮
total_fea.npyが大きすぎる場合、KMeansを用いてベクトルを小さくすることが可能です。
以下のコードで、embeddingの圧縮が可能です。n_clustersは圧縮したい大きさを指定し、batch_sizeは256 * CPUのコア数を指定することでCPUの並列化の恩恵を十分に得ることができます。
```python
import multiprocessing
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random")
kmeans.fit(big_npy)
sample_npy = kmeans.cluster_centers_
```

52
docs/training_tips_en.md Normal file
View File

@ -0,0 +1,52 @@
Instructions and tips for RVC training
======================================
This TIPS explains how data training is done.
# Training flow
I will explain along the steps in the training tab of the GUI.
## step1
Set the experiment name here. You can also set here whether the model should take pitch into account.
Data for each experiment is placed in `/logs/experiment name/`.
## step2a
Loads and preprocesses audio.
### load audio
If you specify a folder with audio, the audio files in that folder will be read automatically.
For example, if you specify `C:Users\hoge\voices`, `C:Users\hoge\voices\voice.mp3` will be loaded, but `C:Users\hoge\voices\dir\voice.mp3` will Not loaded.
Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically.
After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.
### denoising
The audio is smoothed by scipy's filtfilt.
### Audio Split
First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to `/logs/experiment name/0_gt_wavs` and then convert it to 16k sampling rate to `/logs/experiment name/1_16k_wavs ` as a wav file.
## step2b
### Extract pitch
Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in `/logs/experiment name/2a_f0`. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in `/logs/experiment name/2b-f0nsf`.
### Extract feature_print
Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in `/logs/experiment name/1_16k_wavs`, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in `/logs/experiment name/3_feature256`.
## step3
train the model.
### Glossary for Beginners
In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.
Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.
### Specify pretrained model
RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset. By default it loads `rvc-location/pretrained/f0G40k.pth` and `rvc-location/pretrained/f0D40k.pth`. When learning, model parameters are saved in `logs/experiment name/G_{}.pth` and `logs/experiment name/D_{}.pth` for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.
### learning index
RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance.
For index learning, we use the approximate neighborhood search library faiss. Read the feature value of `/logs/experiment name/3_feature256`, save the combined feature value as `/logs/experiment name/total_fea.npy`, and use it to learn the index `/logs/experiment name Save it as /add_XXX.index`.
### Button description
- Train model: After executing step2b, press this button to train the model.
- Train feature index: After training the model, perform index learning.
- One-click training: step2b, model training and feature index training all at once.

53
docs/training_tips_ja.md Normal file
View File

@ -0,0 +1,53 @@
RVCの訓練における説明、およびTIPS
===============================
本TIPSではどのようにデータの訓練が行われているかを説明します。
# 訓練の流れ
GUIの訓練タブのstepに沿って説明します。
## step1
実験名の設定を行います。また、モデルにピッチを考慮させるかもここで設定できます。
各実験のデータは`/logs/実験名/`に配置されます。
## step2a
音声の読み込みと前処理を行います。
### load audio
音声のあるフォルダを指定すると、そのフォルダ内にある音声ファイルを自動で読み込みます。
例えば`C:Users\hoge\voices`を指定した場合、`C:Users\hoge\voices\voice.mp3`は読み込まれますが、`C:Users\hoge\voices\dir\voice.mp3`は読み込まれません。
音声の読み込みには内部でffmpegを利用しているので、ffmpegで対応している拡張子であれば自動的に読み込まれます。
ffmpegでint16に変換した後、float32に変換し、-1 ~ 1の間に正規化されます。
### denoising
音声についてscipyのfiltfiltによる平滑化を行います。
### 音声の分割
入力した音声はまず、一定期間(max_sil_kept=5秒?)より長く無音が続く部分を検知して音声を分割します。無音で音声を分割した後は、0.3秒のoverlapを含む4秒ごとに音声を分割します。4秒以内に区切られた音声は、音量の正規化を行った後wavファイルを`/logs/実験名/0_gt_wavs`に、そこから16kのサンプリングレートに変換して`/logs/実験名/1_16k_wavs`にwavファイルで保存します。
## step2b
### ピッチの抽出
wavファイルからピッチ(音の高低)の情報を抽出します。parselmouthやpyworldに内蔵されている手法でピッチ情報(=f0)を抽出し、`/logs/実験名/2a_f0`に保存します。その後、ピッチ情報を対数で変換して1~255の整数に変換し、`/logs/実験名/2b-f0nsf`に保存します。
### feature_printの抽出
HuBERTを用いてwavファイルを事前にembeddingに変換します。`/logs/実験名/1_16k_wavs`に保存したwavファイルを読み込み、HuBERTでwavファイルを256次元の特徴量に変換し、npy形式で`/logs/実験名/3_feature256`に保存します。
## step3
モデルのトレーニングを行います。
### 初心者向け用語解説
深層学習ではデータセットを分割し、少しずつ学習を進めていきます。一回のモデルの更新(step)では、batch_size個のデータを取り出し予測と誤差の修正を行います。これをデータセットに対して一通り行うと一epochと数えます。
そのため、学習時間は 1step当たりの学習時間 x (データセット内のデータ数 ÷ バッチサイズ) x epoch数 かかります。一般にバッチサイズを大きくするほど学習は安定し、(1step当たりの学習時間÷バッチサイズ)は小さくなりますが、その分GPUのメモリを多く使用します。GPUのRAMはnvidia-smiコマンド等で確認できます。実行環境のマシンに合わせてバッチサイズをできるだけ大きくするとより短時間で学習が可能です。
### pretrained modelの指定
RVCではモデルの訓練を0からではなく、事前学習済みの重みから開始するため、少ないデータセットで学習を行えます。デフォルトでは`RVCのある場所/pretrained/f0G40k.pth``RVCのある場所/pretrained/f0D40k.pth`を読み込みます。学習時はsave_every_epochごとにモデルのパラメータが`logs/実験名/G_{}.pth``logs/実験名/D_{}.pth`に保存されますが、このパスを指定することで学習を再開したり、もしくは違う実験で学習したモデルの重みから学習を開始できます。
### indexの学習
RVCでは学習時に使われたHuBERTの特徴量を保存し、推論時は学習時の特徴量から近い特徴量を探してきて推論を行います。この検索を高速に行うために事前にindexの学習を行います。
indexの学習には近似近傍探索ライブラリのfaissを用います。`/logs/実験名/3_feature256`の特徴量を読み込み、全て結合させた特徴量を`/logs/実験名/total_fea.npy`として保存、それを用いて学習したindexを`/logs/実験名/add_XXX.index`として保存します。
### ボタンの説明
- モデルのトレーニング: step2bまでを実行した後、このボタンを押すとモデルの学習を行います。
- 特徴インデックスのトレーニング: モデルのトレーニング後、indexの学習を行います。
- ワンクリックトレーニング: step2bまでとモデルのトレーニング、特徴インデックスのトレーニングを一括で行います。

View File

@ -1,28 +1,34 @@
from infer_pack.models_onnx import SynthesizerTrnMs256NSFsid from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
from infer_pack.models_onnx import SynthesizerTrnMs256NSFsidO
import torch import torch
person = "Shiroha/shiroha.pth" if __name__ == '__main__':
exported_path = "model.onnx" MoeVS = True #模型是否为MoeVoiceStudio原MoeSS使用
ModelPath = "Shiroha/shiroha.pth" #模型路径
ExportedPath = "model.onnx" #输出路径
hidden_channels = 256 # hidden_channels为768Vec做准备
cpt = torch.load(ModelPath, map_location="cpu")
cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk
print(*cpt["config"])
cpt = torch.load(person, map_location="cpu") test_phone = torch.rand(1, 200, hidden_channels) # hidden unit
cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk test_phone_lengths = torch.tensor([200]).long() # hidden unit 长度(貌似没啥用)
print(*cpt["config"]) test_pitch = torch.randint(size=(1, 200), low=5, high=255) # 基频(单位赫兹)
net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=False) test_pitchf = torch.rand(1, 200) # nsf基频
net_g.load_state_dict(cpt["weight"], strict=False) test_ds = torch.LongTensor([0]) # 说话人ID
test_rnd = torch.rand(1, 192, 200) # 噪声(加入随机因子)
test_phone = torch.rand(1, 200, 256) device = "cpu" #导出时设备(不影响使用模型)
test_phone_lengths = torch.tensor([200]).long()
test_pitch = torch.randint(size=(1, 200), low=5, high=255) if MoeVS:
test_pitchf = torch.rand(1, 200) net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False) # fp32导出C++要支持fp16必须手动将内存重新排列所以暂时不用fp16
test_ds = torch.LongTensor([0]) net_g.load_state_dict(cpt["weight"], strict=False)
test_rnd = torch.rand(1, 192, 200) input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"] output_names = [
output_names = [
"audio", "audio",
] ]
device = "cpu" torch.onnx.export(
torch.onnx.export(
net_g, net_g,
( (
test_phone.to(device), test_phone.to(device),
@ -32,7 +38,7 @@ torch.onnx.export(
test_ds.to(device), test_ds.to(device),
test_rnd.to(device), test_rnd.to(device),
), ),
exported_path, ExportedPath,
dynamic_axes={ dynamic_axes={
"phone": [1], "phone": [1],
"pitch": [1], "pitch": [1],
@ -44,4 +50,32 @@ torch.onnx.export(
verbose=False, verbose=False,
input_names=input_names, input_names=input_names,
output_names=output_names, output_names=output_names,
) )
else:
net_g = SynthesizerTrnMs256NSFsidO(*cpt["config"], is_half=False) # fp32导出C++要支持fp16必须手动将内存重新排列所以暂时不用fp16
net_g.load_state_dict(cpt["weight"], strict=False)
input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds"]
output_names = [
"audio",
]
torch.onnx.export(
net_g,
(
test_phone.to(device),
test_phone_lengths.to(device),
test_pitch.to(device),
test_pitchf.to(device),
test_ds.to(device),
),
ExportedPath,
dynamic_axes={
"phone": [1],
"pitch": [1],
"pitchf": [1],
},
do_constant_folding=False,
opset_version=16,
verbose=False,
input_names=input_names,
output_names=output_names,
)

47
export_onnx_old.py Normal file
View File

@ -0,0 +1,47 @@
from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
import torch
person = "Shiroha/shiroha.pth"
exported_path = "model.onnx"
cpt = torch.load(person, map_location="cpu")
cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk
print(*cpt["config"])
net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False)
net_g.load_state_dict(cpt["weight"], strict=False)
test_phone = torch.rand(1, 200, 256)
test_phone_lengths = torch.tensor([200]).long()
test_pitch = torch.randint(size=(1, 200), low=5, high=255)
test_pitchf = torch.rand(1, 200)
test_ds = torch.LongTensor([0])
test_rnd = torch.rand(1, 192, 200)
input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
output_names = [
"audio",
]
device = "cpu"
torch.onnx.export(
net_g,
(
test_phone.to(device),
test_phone_lengths.to(device),
test_pitch.to(device),
test_pitchf.to(device),
test_ds.to(device),
test_rnd.to(device),
),
exported_path,
dynamic_axes={
"phone": [1],
"pitch": [1],
"pitchf": [1],
"rnd": [2],
},
do_constant_folding=False,
opset_version=16,
verbose=False,
input_names=input_names,
output_names=output_names,
)

View File

@ -33,7 +33,9 @@ class FeatureInput(object):
self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700) self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
def compute_f0(self, path, f0_method): def compute_f0(self, path, f0_method):
x, sr = librosa.load(path, self.fs) # default resample type of librosa.resample is "soxr_hq".
# Quality: soxr_vhq > soxr_hq
x, sr = librosa.load(path, self.fs)#, res_type='soxr_vhq'
p_len = x.shape[0] // self.hop p_len = x.shape[0] // self.hop
assert sr == self.fs assert sr == self.fs
if f0_method == "pm": if f0_method == "pm":

View File

@ -28,3 +28,4 @@ process("gui.py")
# Save as a JSON file # Save as a JSON file
with open("./i18n/zh_CN.json", "w", encoding="utf-8") as f: with open("./i18n/zh_CN.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=4) json.dump(data, f, ensure_ascii=False, indent=4)
f.write("\n")

2
go-realtime-gui.bat Normal file
View File

@ -0,0 +1,2 @@
runtime\python.exe gui.py
pause

115
gui.py
View File

@ -1,18 +1,22 @@
import os, sys
now_dir = os.getcwd()
sys.path.append(now_dir)
import PySimpleGUI as sg import PySimpleGUI as sg
import sounddevice as sd import sounddevice as sd
import noisereduce as nr import noisereduce as nr
import numpy as np import numpy as np
from fairseq import checkpoint_utils from fairseq import checkpoint_utils
import librosa, torch, parselmouth, faiss, time, threading import librosa, torch, pyworld, faiss, time, threading
import torch.nn.functional as F import torch.nn.functional as F
import torchaudio.transforms as tat import torchaudio.transforms as tat
import scipy.signal as signal
# import matplotlib.pyplot as plt # import matplotlib.pyplot as plt
from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono
from i18n import I18nAuto from i18n import I18nAuto
i18n = I18nAuto() i18n = I18nAuto()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@ -23,16 +27,20 @@ class RVC:
""" """
初始化 初始化
""" """
try:
self.f0_up_key = key self.f0_up_key = key
self.time_step = 160 / 16000 * 1000 self.time_step = 160 / 16000 * 1000
self.f0_min = 50 self.f0_min = 50
self.f0_max = 1100 self.f0_max = 1100
self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700) self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700) self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
self.sr = 16000
self.window = 160
if index_rate != 0:
self.index = faiss.read_index(index_path) self.index = faiss.read_index(index_path)
self.index_rate = index_rate
"""NOT YET USED"""
self.big_npy = np.load(npy_path) self.big_npy = np.load(npy_path)
print("index search enabled")
self.index_rate = index_rate
model_path = hubert_path model_path = hubert_path
print("load model(s) from {}".format(model_path)) print("load model(s) from {}".format(model_path))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task( models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
@ -46,8 +54,8 @@ class RVC:
cpt = torch.load(pth_path, map_location="cpu") cpt = torch.load(pth_path, map_location="cpu")
tgt_sr = cpt["config"][-1] tgt_sr = cpt["config"][-1]
cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk
if_f0 = cpt.get("f0", 1) self.if_f0 = cpt.get("f0", 1)
if if_f0 == 1: if self.if_f0 == 1:
self.net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=True) self.net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=True)
else: else:
self.net_g = SynthesizerTrnMs256NSFsid_nono(*cpt["config"]) self.net_g = SynthesizerTrnMs256NSFsid_nono(*cpt["config"])
@ -55,38 +63,46 @@ class RVC:
print(self.net_g.load_state_dict(cpt["weight"], strict=False)) print(self.net_g.load_state_dict(cpt["weight"], strict=False))
self.net_g.eval().to(device) self.net_g.eval().to(device)
self.net_g.half() self.net_g.half()
except Exception as e:
print(e)
def get_f0_coarse(self, f0): def get_f0(self, x, f0_up_key, inp_f0=None):
x_pad=1
f0_min = 50
f0_max = 1100
f0_mel_min = 1127 * np.log(1 + f0_min / 700)
f0_mel_max = 1127 * np.log(1 + f0_max / 700)
f0, t = pyworld.harvest(
x.astype(np.double),
fs=self.sr,
f0_ceil=f0_max,
f0_floor=f0_min,
frame_period=10,
)
f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
f0 = signal.medfilt(f0, 3)
f0 *= pow(2, f0_up_key / 12)
# with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
tf0 = self.sr // self.window # 每秒f0点数
if inp_f0 is not None:
delta_t = np.round(
(inp_f0[:, 0].max() - inp_f0[:, 0].min()) * tf0 + 1
).astype("int16")
replace_f0 = np.interp(
list(range(delta_t)), inp_f0[:, 0] * 100, inp_f0[:, 1]
)
shape = f0[x_pad * tf0 : x_pad * tf0 + len(replace_f0)].shape[0]
f0[x_pad * tf0 : x_pad * tf0 + len(replace_f0)] = replace_f0[:shape]
# with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
f0bak = f0.copy()
f0_mel = 1127 * np.log(1 + f0 / 700) f0_mel = 1127 * np.log(1 + f0 / 700)
f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * 254 / ( f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (
self.f0_mel_max - self.f0_mel_min f0_mel_max - f0_mel_min
) + 1 ) + 1
f0_mel[f0_mel <= 1] = 1 f0_mel[f0_mel <= 1] = 1
f0_mel[f0_mel > 255] = 255 f0_mel[f0_mel > 255] = 255
# f0_mel[f0_mel > 188] = 188
f0_coarse = np.rint(f0_mel).astype(np.int) f0_coarse = np.rint(f0_mel).astype(np.int)
return f0_coarse return f0_coarse, f0bak # 1-0
def get_f0(self, x, p_len, f0_up_key=0):
f0 = (
parselmouth.Sound(x, 16000)
.to_pitch_ac(
time_step=self.time_step / 1000,
voicing_threshold=0.6,
pitch_floor=self.f0_min,
pitch_ceiling=self.f0_max,
)
.selected_array["frequency"]
)
pad_size = (p_len - len(f0) + 1) // 2
if pad_size > 0 or p_len - len(f0) - pad_size > 0:
f0 = np.pad(f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant")
f0 *= pow(2, f0_up_key / 12)
# f0=suofang(f0)
f0bak = f0.copy()
f0_coarse = self.get_f0_coarse(f0)
return f0_coarse, f0bak
def infer(self, feats: torch.Tensor) -> np.ndarray: def infer(self, feats: torch.Tensor) -> np.ndarray:
""" """
@ -107,11 +123,7 @@ class RVC:
feats = self.model.final_proj(logits[0]) feats = self.model.final_proj(logits[0])
####索引优化 ####索引优化
if ( if hasattr(self, "index") and hasattr(self, "big_npy") and self.index_rate != 0:
isinstance(self.index, type(None)) == False
and isinstance(self.big_npy, type(None)) == False
and self.index_rate != 0
):
npy = feats[0].cpu().numpy().astype("float32") npy = feats[0].cpu().numpy().astype("float32")
_, I = self.index.search(npy, 1) _, I = self.index.search(npy, 1)
npy = self.big_npy[I.squeeze()].astype("float16") npy = self.big_npy[I.squeeze()].astype("float16")
@ -119,30 +131,42 @@ class RVC:
torch.from_numpy(npy).unsqueeze(0).to(device) * self.index_rate torch.from_numpy(npy).unsqueeze(0).to(device) * self.index_rate
+ (1 - self.index_rate) * feats + (1 - self.index_rate) * feats
) )
else:
print("index search FAIL or disabled")
feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1) feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
torch.cuda.synchronize() torch.cuda.synchronize()
# p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存
p_len = min(feats.shape[1], 12000) #
print(feats.shape) print(feats.shape)
pitch, pitchf = self.get_f0(audio, p_len, self.f0_up_key) if(self.if_f0==1):
p_len = min(feats.shape[1], 12000, pitch.shape[0]) # 太大了爆显存 pitch, pitchf = self.get_f0(audio, self.f0_up_key)
p_len = min(feats.shape[1], 13000, pitch.shape[0]) # 太大了爆显存
else:
pitch, pitchf = None, None
p_len = min(feats.shape[1], 13000) # 太大了爆显存
torch.cuda.synchronize() torch.cuda.synchronize()
# print(feats.shape,pitch.shape) # print(feats.shape,pitch.shape)
feats = feats[:, :p_len, :] feats = feats[:, :p_len, :]
if(self.if_f0==1):
pitch = pitch[:p_len] pitch = pitch[:p_len]
pitchf = pitchf[:p_len] pitchf = pitchf[:p_len]
p_len = torch.LongTensor([p_len]).to(device)
pitch = torch.LongTensor(pitch).unsqueeze(0).to(device) pitch = torch.LongTensor(pitch).unsqueeze(0).to(device)
pitchf = torch.FloatTensor(pitchf).unsqueeze(0).to(device) pitchf = torch.FloatTensor(pitchf).unsqueeze(0).to(device)
p_len = torch.LongTensor([p_len]).to(device)
ii = 0 # sid ii = 0 # sid
sid = torch.LongTensor([ii]).to(device) sid = torch.LongTensor([ii]).to(device)
with torch.no_grad(): with torch.no_grad():
if(self.if_f0==1):
infered_audio = ( infered_audio = (
self.net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0] self.net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0]
.data.cpu() .data.cpu()
.float() .float()
) # nsf )
else:
infered_audio = (
self.net_g.infer(feats, p_len, sid)[0][0, 0]
.data.cpu()
.float()
)
torch.cuda.synchronize() torch.cuda.synchronize()
return infered_audio return infered_audio
@ -363,7 +387,7 @@ class GUI:
self.config.pth_path, self.config.pth_path,
self.config.index_path, self.config.index_path,
self.config.npy_path, self.config.npy_path,
self.config.index_rate, self.config.index_rate
) )
self.input_wav: np.ndarray = np.zeros( self.input_wav: np.ndarray = np.zeros(
self.extra_frame self.extra_frame
@ -485,8 +509,9 @@ class GUI:
else: else:
outdata[:] = self.output_wav[:].repeat(2, 1).t().cpu().numpy() outdata[:] = self.output_wav[:].repeat(2, 1).t().cpu().numpy()
total_time = time.perf_counter() - start_time total_time = time.perf_counter() - start_time
print("infer time:" + str(total_time))
self.window["infer_time"].update(int(total_time * 1000)) self.window["infer_time"].update(int(total_time * 1000))
print("infer time:" + str(total_time))
def get_devices(self, update: bool = True): def get_devices(self, update: bool = True):
"""获取设备列表""" """获取设备列表"""

View File

@ -11,10 +11,8 @@ def load_language_list(language):
class I18nAuto: class I18nAuto:
def __init__(self, language=None): def __init__(self, language=None):
if language is None: if language in ['auto', None]:
language = "auto" language = locale.getdefaultlocale()[0]#getlocale can't identify the system's language ((None, None))
if language == "auto":
language = locale.getdefaultlocale()[0]
if not os.path.exists(f"./i18n/{language}.json"): if not os.path.exists(f"./i18n/{language}.json"):
language = "en_US" language = "en_US"
self.language = language self.language = language

View File

@ -27,8 +27,8 @@
"人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "Batch processing of vocal accompaniment separation, using UVR5 model. <br>Without harmony, use HP2, with harmony and extracted vocals do not need harmony, use HP5<br>Example of qualified folder path format: E:\\ codes\\py39\\vits_vc_gpu\\Egret Shuanghua test sample (just go to the address bar of the file manager and copy it)", "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "Batch processing of vocal accompaniment separation, using UVR5 model. <br>Without harmony, use HP2, with harmony and extracted vocals do not need harmony, use HP5<br>Example of qualified folder path format: E:\\ codes\\py39\\vits_vc_gpu\\Egret Shuanghua test sample (just go to the address bar of the file manager and copy it)",
"输入待处理音频文件夹路径": "Input audio folder path", "输入待处理音频文件夹路径": "Input audio folder path",
"模型": "Model", "模型": "Model",
"指定输出人声文件夹": "Specify output vocal folder", "指定输出人声文件夹": "Specify vocals output folder",
"指定输出乐器文件夹": "Specify output instrumental folder", "指定输出乐器文件夹": "Specify instrumentals output folder",
"训练": "Train", "训练": "Train",
"step1: 填写实验配置. 实验数据放在logs下, 每个实验一个文件夹, 需手工输入实验名路径, 内含实验配置, 日志, 训练得到的模型文件. ": "step1: Fill in the experimental configuration. The experimental data is placed under logs, and each experiment has a folder. You need to manually enter the experimental name path, which contains the experimental configuration, logs, and model files obtained from training. ", "step1: 填写实验配置. 实验数据放在logs下, 每个实验一个文件夹, 需手工输入实验名路径, 内含实验配置, 日志, 训练得到的模型文件. ": "step1: Fill in the experimental configuration. The experimental data is placed under logs, and each experiment has a folder. You need to manually enter the experimental name path, which contains the experimental configuration, logs, and model files obtained from training. ",
"输入实验名": "Input experiment name", "输入实验名": "Input experiment name",
@ -41,7 +41,7 @@
"step2b: 使用CPU提取音高(如果模型带音高), 使用GPU提取特征(选择卡号)": "step2b: Use CPU to extract pitch (if the model has pitch), use GPU to extract features (select card number)", "step2b: 使用CPU提取音高(如果模型带音高), 使用GPU提取特征(选择卡号)": "step2b: Use CPU to extract pitch (if the model has pitch), use GPU to extract features (select card number)",
"以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "Enter the card numbers used separated by -, for example 0-1-2 use card 0 and card 1 and card 2", "以-分隔输入使用的卡号, 例如 0-1-2 使用卡0和卡1和卡2": "Enter the card numbers used separated by -, for example 0-1-2 use card 0 and card 1 and card 2",
"显卡信息": "GPU information", "显卡信息": "GPU information",
"提取音高使用的CPU进程数": "Number of CPU processes used for pitch extraction", "提取音高使用的CPU进程数": "Number of CPU threads to use for pitch extraction",
"选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "Select pitch extraction algorithm: Use 'pm' for faster processing of singing voice, 'dio' for high-quality speech but slower processing, and 'harvest' for the best quality but slowest processing.", "选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "Select pitch extraction algorithm: Use 'pm' for faster processing of singing voice, 'dio' for high-quality speech but slower processing, and 'harvest' for the best quality but slowest processing.",
"特征提取": "Feature extraction", "特征提取": "Feature extraction",
"step3: 填写训练设置, 开始训练模型和索引": "step3: Fill in the training settings, start training the model and index", "step3: 填写训练设置, 开始训练模型和索引": "step3: Fill in the training settings, start training the model and index",
@ -58,7 +58,7 @@
"模型融合, 可用于测试音色融合": "Model Fusion, which can be used to test sound fusion", "模型融合, 可用于测试音色融合": "Model Fusion, which can be used to test sound fusion",
"A模型路径": "A model path.", "A模型路径": "A model path.",
"B模型路径": "B model path.", "B模型路径": "B model path.",
"A模型权重": "A model weight.", "A模型权重": "A model weight for model A.",
"模型是否带音高指导": "Whether the model has pitch guidance.", "模型是否带音高指导": "Whether the model has pitch guidance.",
"要置入的模型信息": "Model information to be placed.", "要置入的模型信息": "Model information to be placed.",
"保存的模型名不带后缀": "Saved model name without extension.", "保存的模型名不带后缀": "Saved model name without extension.",
@ -74,7 +74,7 @@
"查看": "View", "查看": "View",
"模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "Model extraction (enter the path of the large file model under the logs folder), which is suitable for half of the training and does not want to train the model without automatically extracting and saving the small file model, or if you want to test the intermediate model", "模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "Model extraction (enter the path of the large file model under the logs folder), which is suitable for half of the training and does not want to train the model without automatically extracting and saving the small file model, or if you want to test the intermediate model",
"保存名": "Save Name", "保存名": "Save Name",
"模型是否带音高指导,1是0否": "Whether the model comes with pitch guidance, 1 for yes, 0 for no", "模型是否带音高指导,1是0否": "Whether the model has pitch guidance, 1 for yes, 0 for no",
"提取": "Extract", "提取": "Extract",
"招募音高曲线前端编辑器": "Recruit front-end editors for pitch curves", "招募音高曲线前端编辑器": "Recruit front-end editors for pitch curves",
"加开发群联系我xxxxx": "Add development group to contact me xxxxx", "加开发群联系我xxxxx": "Add development group to contact me xxxxx",

View File

@ -11,17 +11,17 @@
"选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比": "ピッチ抽出アルゴリズムを選択してください。歌声の場合は、pmを使用して速度を上げることができます。低音が重要な場合は、harvestを使用できますが、非常に遅くなります。", "选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比": "ピッチ抽出アルゴリズムを選択してください。歌声の場合は、pmを使用して速度を上げることができます。低音が重要な場合は、harvestを使用できますが、非常に遅くなります。",
"特征检索库文件路径": "特徴量検索データベースのファイルパス", "特征检索库文件路径": "特徴量検索データベースのファイルパス",
"特征文件路径": "特徴量ファイルのパス", "特征文件路径": "特徴量ファイルのパス",
"F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调", "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0(最低共振周波数)カーブファイル(オプション、1行に1ピッチ、デフォルトのF0(最低共振周波数)とエレベーションを置き換えます。)",
"转换": "変換", "转换": "変換",
"输出信息": "出力情報", "输出信息": "出力情報",
"输出音频(右下角三个点,点了可以下载)": "出力音声(右下の三点をクリックしてダウンロードできます)", "输出音频(右下角三个点,点了可以下载)": "出力音声(右下の三点をクリックしてダウンロードできます)",
"批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ": "批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ", "批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ": "一括変換、変換する音声フォルダを入力、または複数の音声ファイルをアップロードし、指定したフォルダ(デフォルトのopt)に変換した音声を出力します。",
"指定输出文件夹": "出力フォルダを指定してください", "指定输出文件夹": "出力フォルダを指定してください",
"检索特征占比": "検索特徴率", "检索特征占比": "検索特徴率",
"输入待处理音频文件夹路径(去文件管理器地址栏拷就行了)": "処理対象音声フォルダーのパスを入力してください(ファイルマネージャのアドレスバーからコピーしてください)", "输入待处理音频文件夹路径(去文件管理器地址栏拷就行了)": "処理対象音声フォルダーのパスを入力してください(ファイルマネージャのアドレスバーからコピーしてください)",
"也可批量输入音频文件, 二选一, 优先读文件夹": "複数の音声ファイルを一括で入力することもできますが、フォルダーを優先して読み込みます", "也可批量输入音频文件, 二选一, 优先读文件夹": "複数の音声ファイルを一括で入力することもできますが、フォルダーを優先して読み込みます",
"伴奏人声分离": "伴奏とボーカルの分離", "伴奏人声分离": "伴奏とボーカルの分離",
"人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)", "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "UVR5モデルを使用した、声帯分離バッチ処理です。<br>HP2はハーモニー、ハーモニーのあるボーカルとハーモニーのないボーカルを抽出したものはHP5を使ってください <br>フォルダーパスの形式例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(エクスプローラーのアドレスバーからコピーするだけです)",
"输入待处理音频文件夹路径": "処理するオーディオファイルのフォルダパスを入力してください", "输入待处理音频文件夹路径": "処理するオーディオファイルのフォルダパスを入力してください",
"模型": "モデル", "模型": "モデル",
"指定输出人声文件夹": "人の声を出力するフォルダを指定してください", "指定输出人声文件夹": "人の声を出力するフォルダを指定してください",
@ -60,7 +60,7 @@
"要置入的模型信息": "挿入するモデル情報", "要置入的模型信息": "挿入するモデル情報",
"保存的模型名不带后缀": "拡張子のない保存するモデル名", "保存的模型名不带后缀": "拡張子のない保存するモデル名",
"融合": "フュージョン", "融合": "フュージョン",
"修改模型信息(仅支持weights文件夹下提取的小模型文件)": "修改模型信息(仅支持weights文件夹下提取的小模型文件)", "修改模型信息(仅支持weights文件夹下提取的小模型文件)": "モデル情報の修正(weightsフォルダから抽出された小さなモデルファイルのみ対応)",
"模型路径": "モデルパス", "模型路径": "モデルパス",
"要改的模型信息": "変更するモデル情報", "要改的模型信息": "変更するモデル情報",
"保存的文件名, 默认空为和源文件同名": "保存するファイル名、デフォルトでは空欄で元のファイル名と同じ名前になります", "保存的文件名, 默认空为和源文件同名": "保存するファイル名、デフォルトでは空欄で元のファイル名と同じ名前になります",
@ -68,18 +68,18 @@
"查看模型信息(仅支持weights文件夹下提取的小模型文件)": "モデル情報を表示する(小さいモデルファイルはweightsフォルダーからのみサポートされています)", "查看模型信息(仅支持weights文件夹下提取的小模型文件)": "モデル情報を表示する(小さいモデルファイルはweightsフォルダーからのみサポートされています)",
"查看": "表示", "查看": "表示",
"模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "モデル抽出(ログフォルダー内の大きなファイルのモデルパスを入力)、モデルを半分までトレーニングし、自動的に小さいファイルモデルを保存しなかったり、中間モデルをテストしたい場合に適用されます。", "模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "モデル抽出(ログフォルダー内の大きなファイルのモデルパスを入力)、モデルを半分までトレーニングし、自動的に小さいファイルモデルを保存しなかったり、中間モデルをテストしたい場合に適用されます。",
"保存名": "保存するファイル名", "保存名": "保存ファイル名",
"模型是否带音高指导,1是0否": "モデルに音高ガイドを付けるかどうか、1は付ける、0は付けない", "模型是否带音高指导,1是0否": "モデルに音高ガイドを付けるかどうか、1は付ける、0は付けない",
"提取": "抽出", "提取": "抽出",
"招募音高曲线前端编辑器": "音高曲線フロントエンドエディターを募集", "招募音高曲线前端编辑器": "音高曲線フロントエンドエディターを募集",
"加开发群联系我xxxxx": "開発グループに参加して私に連絡してくださいxxxxx", "加开发群联系我xxxxx": "開発グループに参加して私に連絡してくださいxxxxx",
"点击查看交流、问题反馈群号": "クリックして交流、問題フィードバックグループ番号を表示", "点击查看交流、问题反馈群号": "クリックして交流、問題フィードバックグループ番号を表示",
"xxxxx": "xxxxx", "xxxxx": "xxxxx",
"加载模型": "モデルをロードする", "加载模型": "モデルをロード",
"Hubert模型": "Hubert模型", "Hubert模型": "Hubert模型",
"选择.pth文件": ".pthファイルを選択する", "选择.pth文件": ".pthファイルを選択",
"选择.index文件": ".indexファイルを選択する", "选择.index文件": ".indexファイルを選択",
"选择.npy文件": ".npyファイルを選択する", "选择.npy文件": ".npyファイルを選択",
"输入设备": "入力デバイス", "输入设备": "入力デバイス",
"输出设备": "出力デバイス", "输出设备": "出力デバイス",
"音频设备(请使用同种类驱动)": "オーディオデバイス(同じ種類のドライバーを使用してください)", "音频设备(请使用同种类驱动)": "オーディオデバイス(同じ種類のドライバーを使用してください)",
@ -93,7 +93,12 @@
"输入降噪": "入力ノイズの低減", "输入降噪": "入力ノイズの低減",
"输出降噪": "出力ノイズの低減", "输出降噪": "出力ノイズの低減",
"性能设置": "パフォーマンス設定", "性能设置": "パフォーマンス設定",
"开始音频转换": "音声変換を開始する", "开始音频转换": "音声変換を開始",
"停止音频转换": "音声変換を停止する", "停止音频转换": "音声変換を停止",
"推理时间(ms):": "推論時間(ms):" "推理时间(ms):": "推論時間(ms):",
"Onnx导出": "Onnx",
"RVC模型路径": "RVCルパス",
"Onnx输出路径": "Onnx出力パス",
"导出Onnx模型": "Onnxに変換",
"MoeVS模型": "MoeSS"
} }

View File

@ -7,7 +7,9 @@ standard_file = "zh_CN.json"
# Find all JSON files in the directory # Find all JSON files in the directory
dir_path = "./" dir_path = "./"
languages = [f for f in os.listdir(dir_path) if f.endswith(".json") and f != standard_file] languages = [
f for f in os.listdir(dir_path) if f.endswith(".json") and f != standard_file
]
# Load the standard file # Load the standard file
with open(standard_file, "r", encoding="utf-8") as f: with open(standard_file, "r", encoding="utf-8") as f:
@ -40,3 +42,4 @@ for lang_file in languages:
# Save the updated language file # Save the updated language file
with open(lang_file, "w", encoding="utf-8") as f: with open(lang_file, "w", encoding="utf-8") as f:
json.dump(lang_data, f, ensure_ascii=False, indent=4) json.dump(lang_data, f, ensure_ascii=False, indent=4)
f.write("\n")

View File

@ -95,5 +95,10 @@
"性能设置": "性能设置", "性能设置": "性能设置",
"开始音频转换": "开始音频转换", "开始音频转换": "开始音频转换",
"停止音频转换": "停止音频转换", "停止音频转换": "停止音频转换",
"推理时间(ms):": "推理时间(ms):" "推理时间(ms):": "推理时间(ms):",
"Onnx导出": "Onnx导出",
"RVC模型路径": "RVC模型路径",
"Onnx输出路径": "Onnx输出路径",
"导出Onnx模型": "导出Onnx模型",
"MoeVS模型": "MoeSS模型"
} }

View File

@ -95,5 +95,10 @@
"性能设置": "效能設定", "性能设置": "效能設定",
"开始音频转换": "開始音訊轉換", "开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換", "停止音频转换": "停止音訊轉換",
"推理时间(ms):": "推理時間(ms):" "推理时间(ms):": "推理時間(ms):",
"Onnx导出": "Onnx导出",
"RVC模型路径": "RVC模型路径",
"Onnx输出路径": "Onnx输出路径",
"导出Onnx模型": "导出Onnx模型",
"MoeVS模型": "MoeSS模型"
} }

View File

@ -95,5 +95,10 @@
"性能设置": "效能設定", "性能设置": "效能設定",
"开始音频转换": "開始音訊轉換", "开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換", "停止音频转换": "停止音訊轉換",
"推理时间(ms):": "推理時間(ms):" "推理时间(ms):": "推理時間(ms):",
"Onnx导出": "Onnx导出",
"RVC模型路径": "RVC模型路径",
"Onnx输出路径": "Onnx输出路径",
"导出Onnx模型": "导出Onnx模型",
"MoeVS模型": "MoeSS模型"
} }

View File

@ -95,5 +95,10 @@
"性能设置": "效能設定", "性能设置": "效能設定",
"开始音频转换": "開始音訊轉換", "开始音频转换": "開始音訊轉換",
"停止音频转换": "停止音訊轉換", "停止音频转换": "停止音訊轉換",
"推理时间(ms):": "推理時間(ms):" "推理时间(ms):": "推理時間(ms):",
"Onnx导出": "Onnx导出",
"RVC模型路径": "RVC模型路径",
"Onnx输出路径": "Onnx输出路径",
"导出Onnx模型": "导出Onnx模型",
"MoeVS模型": "MoeSS模型"
} }

View File

@ -36,6 +36,10 @@ else:
or "20" in gpu_name or "20" in gpu_name
or "30" in gpu_name or "30" in gpu_name
or "40" in gpu_name or "40" in gpu_name
or "A2" in gpu_name.upper()
or "A3" in gpu_name.upper()
or "A4" in gpu_name.upper()
or "P4" in gpu_name.upper()
or "A50" in gpu_name.upper() or "A50" in gpu_name.upper()
or "70" in gpu_name or "70" in gpu_name
or "80" in gpu_name or "80" in gpu_name
@ -115,6 +119,7 @@ for name in os.listdir(weight_uvr5_root):
uvr5_names.append(name.replace(".pth", "")) uvr5_names.append(name.replace(".pth", ""))
def vc_single( def vc_single(
sid, sid,
input_audio, input_audio,
@ -135,6 +140,17 @@ def vc_single(
if hubert_model == None: if hubert_model == None:
load_hubert() load_hubert()
if_f0 = cpt.get("f0", 1) if_f0 = cpt.get("f0", 1)
file_index = (
file_index.strip(" ")
.strip('"')
.strip("\n")
.strip('"')
.strip(" ")
.replace("trained", "added")
) # 防止小白写错,自动帮他替换掉
file_big_npy = (
file_big_npy.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
)
audio_opt = vc.pipeline( audio_opt = vc.pipeline(
hubert_model, hubert_model,
net_g, net_g,
@ -870,6 +886,83 @@ def change_info_(ckpt_path):
return {"__type__": "update"}, {"__type__": "update"} return {"__type__": "update"}, {"__type__": "update"}
from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
from infer_pack.models_onnx import SynthesizerTrnMs256NSFsidO
def export_onnx(ModelPath, ExportedPath, MoeVS=True):
hidden_channels = 256 # hidden_channels为768Vec做准备
cpt = torch.load(ModelPath, map_location="cpu")
cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0] # n_spk
print(*cpt["config"])
test_phone = torch.rand(1, 200, hidden_channels) # hidden unit
test_phone_lengths = torch.tensor([200]).long() # hidden unit 长度(貌似没啥用)
test_pitch = torch.randint(size=(1, 200), low=5, high=255) # 基频(单位赫兹)
test_pitchf = torch.rand(1, 200) # nsf基频
test_ds = torch.LongTensor([0]) # 说话人ID
test_rnd = torch.rand(1, 192, 200) # 噪声(加入随机因子)
device = "cpu" #导出时设备(不影响使用模型)
if MoeVS:
net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False) # fp32导出C++要支持fp16必须手动将内存重新排列所以暂时不用fp16
net_g.load_state_dict(cpt["weight"], strict=False)
input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
output_names = [
"audio",
]
torch.onnx.export(
net_g,
(
test_phone.to(device),
test_phone_lengths.to(device),
test_pitch.to(device),
test_pitchf.to(device),
test_ds.to(device),
test_rnd.to(device),
),
ExportedPath,
dynamic_axes={
"phone": [1],
"pitch": [1],
"pitchf": [1],
"rnd": [2],
},
do_constant_folding=False,
opset_version=16,
verbose=False,
input_names=input_names,
output_names=output_names,
)
else:
net_g = SynthesizerTrnMs256NSFsidO(*cpt["config"], is_half=False) # fp32导出C++要支持fp16必须手动将内存重新排列所以暂时不用fp16
net_g.load_state_dict(cpt["weight"], strict=False)
input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds"]
output_names = [
"audio",
]
torch.onnx.export(
net_g,
(
test_phone.to(device),
test_phone_lengths.to(device),
test_pitch.to(device),
test_pitchf.to(device),
test_ds.to(device),
),
ExportedPath,
dynamic_axes={
"phone": [1],
"pitch": [1],
"pitchf": [1],
},
do_constant_folding=False,
opset_version=16,
verbose=False,
input_names=input_names,
output_names=output_names,
)
return "Finished"
with gr.Blocks() as app: with gr.Blocks() as app:
gr.Markdown( gr.Markdown(
value=i18n( value=i18n(
@ -932,7 +1025,7 @@ with gr.Blocks() as app:
minimum=0, minimum=0,
maximum=1, maximum=1,
label="检索特征占比", label="检索特征占比",
value=1, value=0.6,
interactive=True, interactive=True,
) )
f0_file = gr.File(label=i18n("F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调")) f0_file = gr.File(label=i18n("F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调"))
@ -1346,6 +1439,18 @@ with gr.Blocks() as app:
info7, info7,
) )
with gr.TabItem(i18n("Onnx导出")):
with gr.Row():
ckpt_dir = gr.Textbox(label=i18n("RVC模型路径"), value="", interactive=True)
with gr.Row():
onnx_dir = gr.Textbox(label=i18n("Onnx输出路径"), value="", interactive=True)
with gr.Row():
moevs = gr.Checkbox(label=i18n("MoeVS模型"), value=True)
infoOnnx = gr.Label(label="Null")
with gr.Row():
butOnnx = gr.Button(i18n("导出Onnx模型"), variant="primary")
butOnnx.click(export_onnx, [ckpt_dir, onnx_dir, moevs], infoOnnx)
# with gr.TabItem(i18n("招募音高曲线前端编辑器")): # with gr.TabItem(i18n("招募音高曲线前端编辑器")):
# gr.Markdown(value=i18n("加开发群联系我xxxxx")) # gr.Markdown(value=i18n("加开发群联系我xxxxx"))
# with gr.TabItem(i18n("点击查看交流、问题反馈群号")): # with gr.TabItem(i18n("点击查看交流、问题反馈群号")):

View File

@ -527,7 +527,7 @@ sr2sr = {
} }
class SynthesizerTrnMs256NSFsid(nn.Module): class SynthesizerTrnMs256NSFsidO(nn.Module):
def __init__( def __init__(
self, self,
spec_channels, spec_channels,
@ -612,104 +612,15 @@ class SynthesizerTrnMs256NSFsid(nn.Module):
self.flow.remove_weight_norm() self.flow.remove_weight_norm()
self.enc_q.remove_weight_norm() self.enc_q.remove_weight_norm()
def forward(self, phone, phone_lengths, pitch, nsff0, sid, rnd, max_len=None): def forward(self, phone, phone_lengths, pitch, nsff0, sid, max_len=None):
g = self.emb_g(sid).unsqueeze(-1) g = self.emb_g(sid).unsqueeze(-1)
m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths) m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
z_p = (m_p + torch.exp(logs_p) * rnd) * x_mask z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
z = self.flow(z_p, x_mask, g=g, reverse=True) z = self.flow(z_p, x_mask, g=g, reverse=True)
o = self.dec((z * x_mask)[:, :, :max_len], nsff0, g=g) o = self.dec((z * x_mask)[:, :, :max_len], nsff0, g=g)
return o return o
class SynthesizerTrnMs256NSFsid_sim(nn.Module):
"""
Synthesizer for Training
"""
def __init__(
self,
spec_channels,
segment_size,
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
spk_embed_dim,
# hop_length,
gin_channels=0,
use_sdp=True,
**kwargs
):
super().__init__()
self.spec_channels = spec_channels
self.inter_channels = inter_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.resblock = resblock
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.upsample_rates = upsample_rates
self.upsample_initial_channel = upsample_initial_channel
self.upsample_kernel_sizes = upsample_kernel_sizes
self.segment_size = segment_size
self.gin_channels = gin_channels
# self.hop_length = hop_length#
self.spk_embed_dim = spk_embed_dim
self.enc_p = TextEncoder256Sim(
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
)
self.dec = GeneratorNSF(
inter_channels,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
gin_channels=gin_channels,
is_half=kwargs["is_half"],
)
self.flow = ResidualCouplingBlock(
inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
)
self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
def remove_weight_norm(self):
self.dec.remove_weight_norm()
self.flow.remove_weight_norm()
self.enc_q.remove_weight_norm()
def forward(
self, phone, phone_lengths, pitch, pitchf, ds, max_len=None
): # y是spec不需要了现在
g = self.emb_g(ds.unsqueeze(0)).unsqueeze(-1) # [b, 256, 1]##1是t广播的
x, x_mask = self.enc_p(phone, pitch, phone_lengths)
x = self.flow(x, x_mask, g=g, reverse=True)
o = self.dec((x * x_mask)[:, :, :max_len], pitchf, g=g)
return o
class MultiPeriodDiscriminator(torch.nn.Module): class MultiPeriodDiscriminator(torch.nn.Module):
def __init__(self, use_spectral_norm=False): def __init__(self, use_spectral_norm=False):
super(MultiPeriodDiscriminator, self).__init__() super(MultiPeriodDiscriminator, self).__init__()

View File

@ -0,0 +1,849 @@
import math, pdb, os
from time import time as ttime
import torch
from torch import nn
from torch.nn import functional as F
from infer_pack import modules
from infer_pack import attentions
from infer_pack import commons
from infer_pack.commons import init_weights, get_padding
from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
from infer_pack.commons import init_weights
import numpy as np
from infer_pack import commons
class TextEncoder256(nn.Module):
def __init__(
self,
out_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
f0=True,
):
super().__init__()
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.emb_phone = nn.Linear(256, hidden_channels)
self.lrelu = nn.LeakyReLU(0.1, inplace=True)
if f0 == True:
self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
self.encoder = attentions.Encoder(
hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
)
self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
def forward(self, phone, pitch, lengths):
if pitch == None:
x = self.emb_phone(phone)
else:
x = self.emb_phone(phone) + self.emb_pitch(pitch)
x = x * math.sqrt(self.hidden_channels) # [b, t, h]
x = self.lrelu(x)
x = torch.transpose(x, 1, -1) # [b, h, t]
x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
x.dtype
)
x = self.encoder(x * x_mask, x_mask)
stats = self.proj(x) * x_mask
m, logs = torch.split(stats, self.out_channels, dim=1)
return m, logs, x_mask
class TextEncoder256Sim(nn.Module):
def __init__(
self,
out_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
f0=True,
):
super().__init__()
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.emb_phone = nn.Linear(256, hidden_channels)
self.lrelu = nn.LeakyReLU(0.1, inplace=True)
if f0 == True:
self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
self.encoder = attentions.Encoder(
hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
)
self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
def forward(self, phone, pitch, lengths):
if pitch == None:
x = self.emb_phone(phone)
else:
x = self.emb_phone(phone) + self.emb_pitch(pitch)
x = x * math.sqrt(self.hidden_channels) # [b, t, h]
x = self.lrelu(x)
x = torch.transpose(x, 1, -1) # [b, h, t]
x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
x.dtype
)
x = self.encoder(x * x_mask, x_mask)
x = self.proj(x) * x_mask
return x, x_mask
class ResidualCouplingBlock(nn.Module):
def __init__(
self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0,
):
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.n_flows = n_flows
self.gin_channels = gin_channels
self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(
modules.ResidualCouplingLayer(
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
gin_channels=gin_channels,
mean_only=True,
)
)
self.flows.append(modules.Flip())
def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x
def remove_weight_norm(self):
for i in range(self.n_flows):
self.flows[i * 2].remove_weight_norm()
class PosteriorEncoder(nn.Module):
def __init__(
self,
in_channels,
out_channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
gin_channels=0,
):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.gin_channels = gin_channels
self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
self.enc = modules.WN(
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
gin_channels=gin_channels,
)
self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
def forward(self, x, x_lengths, g=None):
x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
x.dtype
)
x = self.pre(x) * x_mask
x = self.enc(x, x_mask, g=g)
stats = self.proj(x) * x_mask
m, logs = torch.split(stats, self.out_channels, dim=1)
z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
return z, m, logs, x_mask
def remove_weight_norm(self):
self.enc.remove_weight_norm()
class Generator(torch.nn.Module):
def __init__(
self,
initial_channel,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
gin_channels=0,
):
super(Generator, self).__init__()
self.num_kernels = len(resblock_kernel_sizes)
self.num_upsamples = len(upsample_rates)
self.conv_pre = Conv1d(
initial_channel, upsample_initial_channel, 7, 1, padding=3
)
resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
self.ups.append(
weight_norm(
ConvTranspose1d(
upsample_initial_channel // (2**i),
upsample_initial_channel // (2 ** (i + 1)),
k,
u,
padding=(k - u) // 2,
)
)
)
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = upsample_initial_channel // (2 ** (i + 1))
for j, (k, d) in enumerate(
zip(resblock_kernel_sizes, resblock_dilation_sizes)
):
self.resblocks.append(resblock(ch, k, d))
self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
self.ups.apply(init_weights)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
def forward(self, x, g=None):
x = self.conv_pre(x)
if g is not None:
x = x + self.cond(g)
for i in range(self.num_upsamples):
x = F.leaky_relu(x, modules.LRELU_SLOPE)
x = self.ups[i](x)
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i * self.num_kernels + j](x)
else:
xs += self.resblocks[i * self.num_kernels + j](x)
x = xs / self.num_kernels
x = F.leaky_relu(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
def remove_weight_norm(self):
for l in self.ups:
remove_weight_norm(l)
for l in self.resblocks:
l.remove_weight_norm()
class SineGen(torch.nn.Module):
"""Definition of sine generator
SineGen(samp_rate, harmonic_num = 0,
sine_amp = 0.1, noise_std = 0.003,
voiced_threshold = 0,
flag_for_pulse=False)
samp_rate: sampling rate in Hz
harmonic_num: number of harmonic overtones (default 0)
sine_amp: amplitude of sine-wavefrom (default 0.1)
noise_std: std of Gaussian noise (default 0.003)
voiced_thoreshold: F0 threshold for U/V classification (default 0)
flag_for_pulse: this SinGen is used inside PulseGen (default False)
Note: when flag_for_pulse is True, the first time step of a voiced
segment is always sin(np.pi) or cos(0)
"""
def __init__(
self,
samp_rate,
harmonic_num=0,
sine_amp=0.1,
noise_std=0.003,
voiced_threshold=0,
flag_for_pulse=False,
):
super(SineGen, self).__init__()
self.sine_amp = sine_amp
self.noise_std = noise_std
self.harmonic_num = harmonic_num
self.dim = self.harmonic_num + 1
self.sampling_rate = samp_rate
self.voiced_threshold = voiced_threshold
def _f02uv(self, f0):
# generate uv signal
uv = torch.ones_like(f0)
uv = uv * (f0 > self.voiced_threshold)
return uv
def forward(self, f0, upp):
"""sine_tensor, uv = forward(f0)
input F0: tensor(batchsize=1, length, dim=1)
f0 for unvoiced steps should be 0
output sine_tensor: tensor(batchsize=1, length, dim)
output uv: tensor(batchsize=1, length, 1)
"""
with torch.no_grad():
f0 = f0[:, None].transpose(1, 2)
f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim, device=f0.device)
# fundamental component
f0_buf[:, :, 0] = f0[:, :, 0]
for idx in np.arange(self.harmonic_num):
f0_buf[:, :, idx + 1] = f0_buf[:, :, 0] * (
idx + 2
) # idx + 2: the (idx+1)-th overtone, (idx+2)-th harmonic
rad_values = (f0_buf / self.sampling_rate) % 1 ###%1意味着n_har的乘积无法后处理优化
rand_ini = torch.rand(
f0_buf.shape[0], f0_buf.shape[2], device=f0_buf.device
)
rand_ini[:, 0] = 0
rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
tmp_over_one = torch.cumsum(rad_values, 1) # % 1 #####%1意味着后面的cumsum无法再优化
tmp_over_one *= upp
tmp_over_one = F.interpolate(
tmp_over_one.transpose(2, 1),
scale_factor=upp,
mode="linear",
align_corners=True,
).transpose(2, 1)
rad_values = F.interpolate(
rad_values.transpose(2, 1), scale_factor=upp, mode="nearest"
).transpose(
2, 1
) #######
tmp_over_one %= 1
tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
cumsum_shift = torch.zeros_like(rad_values)
cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
sine_waves = torch.sin(
torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi
)
sine_waves = sine_waves * self.sine_amp
uv = self._f02uv(f0)
uv = F.interpolate(
uv.transpose(2, 1), scale_factor=upp, mode="nearest"
).transpose(2, 1)
noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
noise = noise_amp * torch.randn_like(sine_waves)
sine_waves = sine_waves * uv + noise
return sine_waves, uv, noise
class SourceModuleHnNSF(torch.nn.Module):
"""SourceModule for hn-nsf
SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
add_noise_std=0.003, voiced_threshod=0)
sampling_rate: sampling_rate in Hz
harmonic_num: number of harmonic above F0 (default: 0)
sine_amp: amplitude of sine source signal (default: 0.1)
add_noise_std: std of additive Gaussian noise (default: 0.003)
note that amplitude of noise in unvoiced is decided
by sine_amp
voiced_threshold: threhold to set U/V given F0 (default: 0)
Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
F0_sampled (batchsize, length, 1)
Sine_source (batchsize, length, 1)
noise_source (batchsize, length 1)
uv (batchsize, length, 1)
"""
def __init__(
self,
sampling_rate,
harmonic_num=0,
sine_amp=0.1,
add_noise_std=0.003,
voiced_threshod=0,
is_half=True,
):
super(SourceModuleHnNSF, self).__init__()
self.sine_amp = sine_amp
self.noise_std = add_noise_std
self.is_half = is_half
# to produce sine waveforms
self.l_sin_gen = SineGen(
sampling_rate, harmonic_num, sine_amp, add_noise_std, voiced_threshod
)
# to merge source harmonics into a single excitation
self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
self.l_tanh = torch.nn.Tanh()
def forward(self, x, upp=None):
sine_wavs, uv, _ = self.l_sin_gen(x, upp)
if self.is_half:
sine_wavs = sine_wavs.half()
sine_merge = self.l_tanh(self.l_linear(sine_wavs))
return sine_merge, None, None # noise, uv
class GeneratorNSF(torch.nn.Module):
def __init__(
self,
initial_channel,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
gin_channels,
sr,
is_half=False,
):
super(GeneratorNSF, self).__init__()
self.num_kernels = len(resblock_kernel_sizes)
self.num_upsamples = len(upsample_rates)
self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(upsample_rates))
self.m_source = SourceModuleHnNSF(
sampling_rate=sr, harmonic_num=0, is_half=is_half
)
self.noise_convs = nn.ModuleList()
self.conv_pre = Conv1d(
initial_channel, upsample_initial_channel, 7, 1, padding=3
)
resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
c_cur = upsample_initial_channel // (2 ** (i + 1))
self.ups.append(
weight_norm(
ConvTranspose1d(
upsample_initial_channel // (2**i),
upsample_initial_channel // (2 ** (i + 1)),
k,
u,
padding=(k - u) // 2,
)
)
)
if i + 1 < len(upsample_rates):
stride_f0 = np.prod(upsample_rates[i + 1 :])
self.noise_convs.append(
Conv1d(
1,
c_cur,
kernel_size=stride_f0 * 2,
stride=stride_f0,
padding=stride_f0 // 2,
)
)
else:
self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = upsample_initial_channel // (2 ** (i + 1))
for j, (k, d) in enumerate(
zip(resblock_kernel_sizes, resblock_dilation_sizes)
):
self.resblocks.append(resblock(ch, k, d))
self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
self.ups.apply(init_weights)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
self.upp = np.prod(upsample_rates)
def forward(self, x, f0, g=None):
har_source, noi_source, uv = self.m_source(f0, self.upp)
har_source = har_source.transpose(1, 2)
x = self.conv_pre(x)
if g is not None:
x = x + self.cond(g)
for i in range(self.num_upsamples):
x = F.leaky_relu(x, modules.LRELU_SLOPE)
x = self.ups[i](x)
x_source = self.noise_convs[i](har_source)
x = x + x_source
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i * self.num_kernels + j](x)
else:
xs += self.resblocks[i * self.num_kernels + j](x)
x = xs / self.num_kernels
x = F.leaky_relu(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
def remove_weight_norm(self):
for l in self.ups:
remove_weight_norm(l)
for l in self.resblocks:
l.remove_weight_norm()
sr2sr = {
"32k": 32000,
"40k": 40000,
"48k": 48000,
}
class SynthesizerTrnMs256NSFsidM(nn.Module):
def __init__(
self,
spec_channels,
segment_size,
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
spk_embed_dim,
gin_channels,
sr,
**kwargs
):
super().__init__()
if type(sr) == type("strr"):
sr = sr2sr[sr]
self.spec_channels = spec_channels
self.inter_channels = inter_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.resblock = resblock
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.upsample_rates = upsample_rates
self.upsample_initial_channel = upsample_initial_channel
self.upsample_kernel_sizes = upsample_kernel_sizes
self.segment_size = segment_size
self.gin_channels = gin_channels
# self.hop_length = hop_length#
self.spk_embed_dim = spk_embed_dim
self.enc_p = TextEncoder256(
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
)
self.dec = GeneratorNSF(
inter_channels,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
gin_channels=gin_channels,
sr=sr,
is_half=kwargs["is_half"],
)
self.enc_q = PosteriorEncoder(
spec_channels,
inter_channels,
hidden_channels,
5,
1,
16,
gin_channels=gin_channels,
)
self.flow = ResidualCouplingBlock(
inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
)
self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
def remove_weight_norm(self):
self.dec.remove_weight_norm()
self.flow.remove_weight_norm()
self.enc_q.remove_weight_norm()
def forward(self, phone, phone_lengths, pitch, nsff0, sid, rnd, max_len=None):
g = self.emb_g(sid).unsqueeze(-1)
m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
z_p = (m_p + torch.exp(logs_p) * rnd) * x_mask
z = self.flow(z_p, x_mask, g=g, reverse=True)
o = self.dec((z * x_mask)[:, :, :max_len], nsff0, g=g)
return o
class SynthesizerTrnMs256NSFsid_sim(nn.Module):
"""
Synthesizer for Training
"""
def __init__(
self,
spec_channels,
segment_size,
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
spk_embed_dim,
# hop_length,
gin_channels=0,
use_sdp=True,
**kwargs
):
super().__init__()
self.spec_channels = spec_channels
self.inter_channels = inter_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.resblock = resblock
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.upsample_rates = upsample_rates
self.upsample_initial_channel = upsample_initial_channel
self.upsample_kernel_sizes = upsample_kernel_sizes
self.segment_size = segment_size
self.gin_channels = gin_channels
# self.hop_length = hop_length#
self.spk_embed_dim = spk_embed_dim
self.enc_p = TextEncoder256Sim(
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
)
self.dec = GeneratorNSF(
inter_channels,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
gin_channels=gin_channels,
is_half=kwargs["is_half"],
)
self.flow = ResidualCouplingBlock(
inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
)
self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
def remove_weight_norm(self):
self.dec.remove_weight_norm()
self.flow.remove_weight_norm()
self.enc_q.remove_weight_norm()
def forward(
self, phone, phone_lengths, pitch, pitchf, ds, max_len=None
): # y是spec不需要了现在
g = self.emb_g(ds.unsqueeze(0)).unsqueeze(-1) # [b, 256, 1]##1是t广播的
x, x_mask = self.enc_p(phone, pitch, phone_lengths)
x = self.flow(x, x_mask, g=g, reverse=True)
o = self.dec((x * x_mask)[:, :, :max_len], pitchf, g=g)
return o
class MultiPeriodDiscriminator(torch.nn.Module):
def __init__(self, use_spectral_norm=False):
super(MultiPeriodDiscriminator, self).__init__()
periods = [2, 3, 5, 7, 11, 17]
# periods = [3, 5, 7, 11, 17, 23, 37]
discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
discs = discs + [
DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
]
self.discriminators = nn.ModuleList(discs)
def forward(self, y, y_hat):
y_d_rs = [] #
y_d_gs = []
fmap_rs = []
fmap_gs = []
for i, d in enumerate(self.discriminators):
y_d_r, fmap_r = d(y)
y_d_g, fmap_g = d(y_hat)
# for j in range(len(fmap_r)):
# print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
y_d_rs.append(y_d_r)
y_d_gs.append(y_d_g)
fmap_rs.append(fmap_r)
fmap_gs.append(fmap_g)
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
class DiscriminatorS(torch.nn.Module):
def __init__(self, use_spectral_norm=False):
super(DiscriminatorS, self).__init__()
norm_f = weight_norm if use_spectral_norm == False else spectral_norm
self.convs = nn.ModuleList(
[
norm_f(Conv1d(1, 16, 15, 1, padding=7)),
norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
]
)
self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
def forward(self, x):
fmap = []
for l in self.convs:
x = l(x)
x = F.leaky_relu(x, modules.LRELU_SLOPE)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
x = torch.flatten(x, 1, -1)
return x, fmap
class DiscriminatorP(torch.nn.Module):
def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
super(DiscriminatorP, self).__init__()
self.period = period
self.use_spectral_norm = use_spectral_norm
norm_f = weight_norm if use_spectral_norm == False else spectral_norm
self.convs = nn.ModuleList(
[
norm_f(
Conv2d(
1,
32,
(kernel_size, 1),
(stride, 1),
padding=(get_padding(kernel_size, 1), 0),
)
),
norm_f(
Conv2d(
32,
128,
(kernel_size, 1),
(stride, 1),
padding=(get_padding(kernel_size, 1), 0),
)
),
norm_f(
Conv2d(
128,
512,
(kernel_size, 1),
(stride, 1),
padding=(get_padding(kernel_size, 1), 0),
)
),
norm_f(
Conv2d(
512,
1024,
(kernel_size, 1),
(stride, 1),
padding=(get_padding(kernel_size, 1), 0),
)
),
norm_f(
Conv2d(
1024,
1024,
(kernel_size, 1),
1,
padding=(get_padding(kernel_size, 1), 0),
)
),
]
)
self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
def forward(self, x):
fmap = []
# 1d to 2d
b, c, t = x.shape
if t % self.period != 0: # pad first
n_pad = self.period - (t % self.period)
x = F.pad(x, (0, n_pad), "reflect")
t = t + n_pad
x = x.view(b, c, t // self.period, self.period)
for l in self.convs:
x = l(x)
x = F.leaky_relu(x, modules.LRELU_SLOPE)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
x = torch.flatten(x, 1, -1)
return x, fmap

View File

@ -12,10 +12,10 @@ def load_audio(file, sr):
) # 防止小白拷路径头尾带了空格和"和回车 ) # 防止小白拷路径头尾带了空格和"和回车
out, _ = ( out, _ = (
ffmpeg.input(file, threads=0) ffmpeg.input(file, threads=0)
.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr) .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
.run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True) .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
) )
except Exception as e: except Exception as e:
raise RuntimeError(f"Failed to load audio: {e}") raise RuntimeError(f"Failed to load audio: {e}")
return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0 return np.frombuffer(out, np.float32).flatten()

View File

@ -4,7 +4,8 @@ scipy==1.9.3
librosa==0.9.2 librosa==0.9.2
llvmlite==0.39.0 llvmlite==0.39.0
fairseq==0.12.2 fairseq==0.12.2
faiss-cpu==1.7.2 faiss-cpu==1.7.0; sys_platform == "darwin"
faiss-cpu==1.7.2; sys_platform != "darwin"
gradio gradio
Cython Cython
future>=0.18.3 future>=0.18.3

View File

@ -98,7 +98,10 @@ class TextAudioLoaderMultiNSFsid(torch.utils.data.Dataset):
sampling_rate, self.sampling_rate sampling_rate, self.sampling_rate
) )
) )
audio_norm = audio / self.max_wav_value audio_norm = audio
# audio_norm = audio / self.max_wav_value
# audio_norm = audio / np.abs(audio).max()
audio_norm = audio_norm.unsqueeze(0) audio_norm = audio_norm.unsqueeze(0)
spec_filename = filename.replace(".wav", ".spec.pt") spec_filename = filename.replace(".wav", ".spec.pt")
if os.path.exists(spec_filename): if os.path.exists(spec_filename):
@ -287,7 +290,10 @@ class TextAudioLoader(torch.utils.data.Dataset):
sampling_rate, self.sampling_rate sampling_rate, self.sampling_rate
) )
) )
audio_norm = audio / self.max_wav_value audio_norm = audio
# audio_norm = audio / self.max_wav_value
# audio_norm = audio / np.abs(audio).max()
audio_norm = audio_norm.unsqueeze(0) audio_norm = audio_norm.unsqueeze(0)
spec_filename = filename.replace(".wav", ".spec.pt") spec_filename = filename.replace(".wav", ".spec.pt")
if os.path.exists(spec_filename): if os.path.exists(spec_filename):

View File

@ -1,18 +1,8 @@
import math
import os
import random
import torch import torch
from torch import nn
import torch.nn.functional as F
import torch.utils.data import torch.utils.data
import numpy as np
import librosa
import librosa.util as librosa_util
from librosa.util import normalize, pad_center, tiny
from scipy.signal import get_window
from scipy.io.wavfile import read
from librosa.filters import mel as librosa_mel_fn from librosa.filters import mel as librosa_mel_fn
MAX_WAV_VALUE = 32768.0 MAX_WAV_VALUE = 32768.0
@ -35,25 +25,38 @@ def dynamic_range_decompression_torch(x, C=1):
def spectral_normalize_torch(magnitudes): def spectral_normalize_torch(magnitudes):
output = dynamic_range_compression_torch(magnitudes) return dynamic_range_compression_torch(magnitudes)
return output
def spectral_de_normalize_torch(magnitudes): def spectral_de_normalize_torch(magnitudes):
output = dynamic_range_decompression_torch(magnitudes) return dynamic_range_decompression_torch(magnitudes)
return output
# Reusable banks
mel_basis = {} mel_basis = {}
hann_window = {} hann_window = {}
def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False): def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
"""Convert waveform into Linear-frequency Linear-amplitude spectrogram.
Args:
y :: (B, T) - Audio waveforms
n_fft
sampling_rate
hop_size
win_size
center
Returns:
:: (B, Freq, Frame) - Linear-frequency Linear-amplitude spectrogram
"""
# Validation
if torch.min(y) < -1.0: if torch.min(y) < -1.0:
print("min value is ", torch.min(y)) print("min value is ", torch.min(y))
if torch.max(y) > 1.0: if torch.max(y) > 1.0:
print("max value is ", torch.max(y)) print("max value is ", torch.max(y))
# Window - Cache if needed
global hann_window global hann_window
dtype_device = str(y.dtype) + "_" + str(y.device) dtype_device = str(y.dtype) + "_" + str(y.device)
wnsize_dtype_device = str(win_size) + "_" + dtype_device wnsize_dtype_device = str(win_size) + "_" + dtype_device
@ -62,6 +65,7 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
dtype=y.dtype, device=y.device dtype=y.dtype, device=y.device
) )
# Padding
y = torch.nn.functional.pad( y = torch.nn.functional.pad(
y.unsqueeze(1), y.unsqueeze(1),
(int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
@ -69,6 +73,7 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
) )
y = y.squeeze(1) y = y.squeeze(1)
# Complex Spectrogram :: (B, T) -> (B, Freq, Frame, RealComplex=2)
spec = torch.stft( spec = torch.stft(
y, y,
n_fft, n_fft,
@ -82,79 +87,44 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
return_complex=False, return_complex=False,
) )
# Linear-frequency Linear-amplitude spectrogram :: (B, Freq, Frame, RealComplex=2) -> (B, Freq, Frame)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
return spec return spec
def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax): def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
# MelBasis - Cache if needed
global mel_basis global mel_basis
dtype_device = str(spec.dtype) + "_" + str(spec.device) dtype_device = str(spec.dtype) + "_" + str(spec.device)
fmax_dtype_device = str(fmax) + "_" + dtype_device fmax_dtype_device = str(fmax) + "_" + dtype_device
if fmax_dtype_device not in mel_basis: if fmax_dtype_device not in mel_basis:
mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) mel = librosa_mel_fn(
sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
)
mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
dtype=spec.dtype, device=spec.device dtype=spec.dtype, device=spec.device
) )
spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
spec = spectral_normalize_torch(spec) # Mel-frequency Log-amplitude spectrogram :: (B, Freq=num_mels, Frame)
return spec melspec = torch.matmul(mel_basis[fmax_dtype_device], spec)
melspec = spectral_normalize_torch(melspec)
return melspec
def mel_spectrogram_torch( def mel_spectrogram_torch(
y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
): ):
if torch.min(y) < -1.0: """Convert waveform into Mel-frequency Log-amplitude spectrogram.
print("min value is ", torch.min(y))
if torch.max(y) > 1.0:
print("max value is ", torch.max(y))
global mel_basis, hann_window Args:
dtype_device = str(y.dtype) + "_" + str(y.device) y :: (B, T) - Waveforms
fmax_dtype_device = str(fmax) + "_" + dtype_device Returns:
wnsize_dtype_device = str(win_size) + "_" + dtype_device melspec :: (B, Freq, Frame) - Mel-frequency Log-amplitude spectrogram
if fmax_dtype_device not in mel_basis: """
mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) # Linear-frequency Linear-amplitude spectrogram :: (B, T) -> (B, Freq, Frame)
mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( spec = spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center)
dtype=y.dtype, device=y.device
)
if wnsize_dtype_device not in hann_window:
hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
dtype=y.dtype, device=y.device
)
y = torch.nn.functional.pad( # Mel-frequency Log-amplitude spectrogram :: (B, Freq, Frame) -> (B, Freq=num_mels, Frame)
y.unsqueeze(1), melspec = spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax)
(int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
mode="reflect",
)
y = y.squeeze(1)
# spec = torch.stft( return melspec
# y,
# n_fft,
# hop_length=hop_size,
# win_length=win_size,
# window=hann_window[wnsize_dtype_device],
# center=center,
# pad_mode="reflect",
# normalized=False,
# onesided=True,
# )
spec = torch.stft(
y,
n_fft,
hop_length=hop_size,
win_length=win_size,
window=hann_window[wnsize_dtype_device],
center=center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=False,
)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
spec = spectral_normalize_torch(spec)
return spec

View File

@ -21,7 +21,7 @@ import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import autocast, GradScaler from torch.cuda.amp import autocast, GradScaler
from infer_pack import commons from infer_pack import commons
from time import sleep
from time import time as ttime from time import time as ttime
from data_utils import ( from data_utils import (
TextAudioLoaderMultiNSFsid, TextAudioLoaderMultiNSFsid,
@ -45,7 +45,7 @@ global_step = 0
def main(): def main():
# n_gpus = torch.cuda.device_count() # n_gpus = torch.cuda.device_count()
os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "5555" os.environ["MASTER_PORT"] = "51515"
mp.spawn( mp.spawn(
run, run,
@ -157,7 +157,7 @@ def run(rank, n_gpus, hps):
# epoch_str = 1 # epoch_str = 1
# global_step = 0 # global_step = 0
except: # 如果首次不能加载加载pretrain except: # 如果首次不能加载加载pretrain
traceback.print_exc() # traceback.print_exc()
epoch_str = 1 epoch_str = 1
global_step = 0 global_step = 0
if rank == 0: if rank == 0:
@ -230,39 +230,50 @@ def train_and_evaluate(
net_g.train() net_g.train()
net_d.train() net_d.train()
if cache == [] or hps.if_cache_data_in_gpu == False: # 第一个epoch把cache全部填满训练集
# print("caching") # Prepare data iterator
for batch_idx, info in enumerate(train_loader):
if hps.if_f0 == 1:
(
phone,
phone_lengths,
pitch,
pitchf,
spec,
spec_lengths,
wave,
wave_lengths,
sid,
) = info
else:
phone, phone_lengths, spec, spec_lengths, wave, wave_lengths, sid = info
if torch.cuda.is_available():
phone, phone_lengths = phone.cuda(
rank, non_blocking=True
), phone_lengths.cuda(rank, non_blocking=True)
if hps.if_f0 == 1:
pitch, pitchf = pitch.cuda(rank, non_blocking=True), pitchf.cuda(
rank, non_blocking=True
)
sid = sid.cuda(rank, non_blocking=True)
spec, spec_lengths = spec.cuda(
rank, non_blocking=True
), spec_lengths.cuda(rank, non_blocking=True)
wave, wave_lengths = wave.cuda(
rank, non_blocking=True
), wave_lengths.cuda(rank, non_blocking=True)
if hps.if_cache_data_in_gpu == True: if hps.if_cache_data_in_gpu == True:
# Use Cache
data_iterator = cache
if cache == []:
# Make new cache
for batch_idx, info in enumerate(train_loader):
# Unpack
if hps.if_f0 == 1:
(
phone,
phone_lengths,
pitch,
pitchf,
spec,
spec_lengths,
wave,
wave_lengths,
sid,
) = info
else:
(
phone,
phone_lengths,
spec,
spec_lengths,
wave,
wave_lengths,
sid,
) = info
# Load on CUDA
if torch.cuda.is_available():
phone = phone.cuda(rank, non_blocking=True)
phone_lengths = phone_lengths.cuda(rank, non_blocking=True)
if hps.if_f0 == 1:
pitch = pitch.cuda(rank, non_blocking=True)
pitchf = pitchf.cuda(rank, non_blocking=True)
sid = sid.cuda(rank, non_blocking=True)
spec = spec.cuda(rank, non_blocking=True)
spec_lengths = spec_lengths.cuda(rank, non_blocking=True)
wave = wave.cuda(rank, non_blocking=True)
wave_lengths = wave_lengths.cuda(rank, non_blocking=True)
# Cache on list
if hps.if_f0 == 1: if hps.if_f0 == 1:
cache.append( cache.append(
( (
@ -295,184 +306,17 @@ def train_and_evaluate(
), ),
) )
) )
with autocast(enabled=hps.train.fp16_run):
if hps.if_f0 == 1:
(
y_hat,
ids_slice,
x_mask,
z_mask,
(z, z_p, m_p, logs_p, m_q, logs_q),
) = net_g(
phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid
)
else: else:
( # Load shuffled cache
y_hat,
ids_slice,
x_mask,
z_mask,
(z, z_p, m_p, logs_p, m_q, logs_q),
) = net_g(phone, phone_lengths, spec, spec_lengths, sid)
mel = spec_to_mel_torch(
spec,
hps.data.filter_length,
hps.data.n_mel_channels,
hps.data.sampling_rate,
hps.data.mel_fmin,
hps.data.mel_fmax,
)
y_mel = commons.slice_segments(
mel, ids_slice, hps.train.segment_size // hps.data.hop_length
)
with autocast(enabled=False):
y_hat_mel = mel_spectrogram_torch(
y_hat.float().squeeze(1),
hps.data.filter_length,
hps.data.n_mel_channels,
hps.data.sampling_rate,
hps.data.hop_length,
hps.data.win_length,
hps.data.mel_fmin,
hps.data.mel_fmax,
)
if hps.train.fp16_run == True:
y_hat_mel = y_hat_mel.half()
wave = commons.slice_segments(
wave, ids_slice * hps.data.hop_length, hps.train.segment_size
) # slice
# Discriminator
y_d_hat_r, y_d_hat_g, _, _ = net_d(wave, y_hat.detach())
with autocast(enabled=False):
loss_disc, losses_disc_r, losses_disc_g = discriminator_loss(
y_d_hat_r, y_d_hat_g
)
optim_d.zero_grad()
scaler.scale(loss_disc).backward()
scaler.unscale_(optim_d)
grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None)
scaler.step(optim_d)
with autocast(enabled=hps.train.fp16_run):
# Generator
y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(wave, y_hat)
with autocast(enabled=False):
loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
loss_fm = feature_loss(fmap_r, fmap_g)
loss_gen, losses_gen = generator_loss(y_d_hat_g)
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl
optim_g.zero_grad()
scaler.scale(loss_gen_all).backward()
scaler.unscale_(optim_g)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None)
scaler.step(optim_g)
scaler.update()
if rank == 0:
if global_step % hps.train.log_interval == 0:
lr = optim_g.param_groups[0]["lr"]
logger.info(
"Train Epoch: {} [{:.0f}%]".format(
epoch, 100.0 * batch_idx / len(train_loader)
)
)
# Amor For Tensorboard display
if loss_mel > 50:
loss_mel = 50
if loss_kl > 5:
loss_kl = 5
logger.info([global_step, lr])
logger.info(
f"loss_disc={loss_disc:.3f}, loss_gen={loss_gen:.3f}, loss_fm={loss_fm:.3f},loss_mel={loss_mel:.3f}, loss_kl={loss_kl:.3f}"
)
scalar_dict = {
"loss/g/total": loss_gen_all,
"loss/d/total": loss_disc,
"learning_rate": lr,
"grad_norm_d": grad_norm_d,
"grad_norm_g": grad_norm_g,
}
scalar_dict.update(
{
"loss/g/fm": loss_fm,
"loss/g/mel": loss_mel,
"loss/g/kl": loss_kl,
}
)
scalar_dict.update(
{"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)}
)
scalar_dict.update(
{
"loss/d_r/{}".format(i): v
for i, v in enumerate(losses_disc_r)
}
)
scalar_dict.update(
{
"loss/d_g/{}".format(i): v
for i, v in enumerate(losses_disc_g)
}
)
image_dict = {
"slice/mel_org": utils.plot_spectrogram_to_numpy(
y_mel[0].data.cpu().numpy()
),
"slice/mel_gen": utils.plot_spectrogram_to_numpy(
y_hat_mel[0].data.cpu().numpy()
),
"all/mel": utils.plot_spectrogram_to_numpy(
mel[0].data.cpu().numpy()
),
}
utils.summarize(
writer=writer,
global_step=global_step,
images=image_dict,
scalars=scalar_dict,
)
global_step += 1
# if global_step % hps.train.eval_interval == 0:
if epoch % hps.save_every_epoch == 0 and rank == 0:
if hps.if_latest == 0:
utils.save_checkpoint(
net_g,
optim_g,
hps.train.learning_rate,
epoch,
os.path.join(hps.model_dir, "G_{}.pth".format(global_step)),
)
utils.save_checkpoint(
net_d,
optim_d,
hps.train.learning_rate,
epoch,
os.path.join(hps.model_dir, "D_{}.pth".format(global_step)),
)
else:
utils.save_checkpoint(
net_g,
optim_g,
hps.train.learning_rate,
epoch,
os.path.join(hps.model_dir, "G_{}.pth".format(2333333)),
)
utils.save_checkpoint(
net_d,
optim_d,
hps.train.learning_rate,
epoch,
os.path.join(hps.model_dir, "D_{}.pth".format(2333333)),
)
else: # 后续的epoch直接使用打乱的cache
shuffle(cache) shuffle(cache)
# print("using cache") else:
for batch_idx, info in cache: # Loader
data_iterator = enumerate(train_loader)
# Run steps
for batch_idx, info in data_iterator:
# Data
## Unpack
if hps.if_f0 == 1: if hps.if_f0 == 1:
( (
phone, phone,
@ -487,6 +331,20 @@ def train_and_evaluate(
) = info ) = info
else: else:
phone, phone_lengths, spec, spec_lengths, wave, wave_lengths, sid = info phone, phone_lengths, spec, spec_lengths, wave, wave_lengths, sid = info
## Load on CUDA
if (hps.if_cache_data_in_gpu == False) and torch.cuda.is_available():
phone = phone.cuda(rank, non_blocking=True)
phone_lengths = phone_lengths.cuda(rank, non_blocking=True)
if hps.if_f0 == 1:
pitch = pitch.cuda(rank, non_blocking=True)
pitchf = pitchf.cuda(rank, non_blocking=True)
sid = sid.cuda(rank, non_blocking=True)
spec = spec.cuda(rank, non_blocking=True)
spec_lengths = spec_lengths.cuda(rank, non_blocking=True)
wave = wave.cuda(rank, non_blocking=True)
wave_lengths = wave_lengths.cuda(rank, non_blocking=True)
# Calculate
with autocast(enabled=hps.train.fp16_run): with autocast(enabled=hps.train.fp16_run):
if hps.if_f0 == 1: if hps.if_f0 == 1:
( (
@ -495,9 +353,7 @@ def train_and_evaluate(
x_mask, x_mask,
z_mask, z_mask,
(z, z_p, m_p, logs_p, m_q, logs_q), (z, z_p, m_p, logs_p, m_q, logs_q),
) = net_g( ) = net_g(phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid)
phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid
)
else: else:
( (
y_hat, y_hat,
@ -552,7 +408,6 @@ def train_and_evaluate(
with autocast(enabled=False): with autocast(enabled=False):
loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
loss_fm = feature_loss(fmap_r, fmap_g) loss_fm = feature_loss(fmap_r, fmap_g)
loss_gen, losses_gen = generator_loss(y_d_hat_g) loss_gen, losses_gen = generator_loss(y_d_hat_g)
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl
@ -600,16 +455,10 @@ def train_and_evaluate(
{"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)} {"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)}
) )
scalar_dict.update( scalar_dict.update(
{ {"loss/d_r/{}".format(i): v for i, v in enumerate(losses_disc_r)}
"loss/d_r/{}".format(i): v
for i, v in enumerate(losses_disc_r)
}
) )
scalar_dict.update( scalar_dict.update(
{ {"loss/d_g/{}".format(i): v for i, v in enumerate(losses_disc_g)}
"loss/d_g/{}".format(i): v
for i, v in enumerate(losses_disc_g)
}
) )
image_dict = { image_dict = {
"slice/mel_org": utils.plot_spectrogram_to_numpy( "slice/mel_org": utils.plot_spectrogram_to_numpy(
@ -629,7 +478,8 @@ def train_and_evaluate(
scalars=scalar_dict, scalars=scalar_dict,
) )
global_step += 1 global_step += 1
# if global_step % hps.train.eval_interval == 0: # /Run steps
if epoch % hps.save_every_epoch == 0 and rank == 0: if epoch % hps.save_every_epoch == 0 and rank == 0:
if hps.if_latest == 0: if hps.if_latest == 0:
utils.save_checkpoint( utils.save_checkpoint(
@ -676,6 +526,7 @@ def train_and_evaluate(
"saving final ckpt:%s" "saving final ckpt:%s"
% (savee(ckpt, hps.sample_rate, hps.if_f0, hps.name, epoch)) % (savee(ckpt, hps.sample_rate, hps.if_f0, hps.name, epoch))
) )
sleep(1)
os._exit(2333333) os._exit(2333333)

View File

@ -1,4 +1,5 @@
import sys, os, multiprocessing import sys, os, multiprocessing
from scipy import signal
now_dir = os.getcwd() now_dir = os.getcwd()
sys.path.append(now_dir) sys.path.append(now_dir)
@ -31,13 +32,14 @@ class PreProcess:
def __init__(self, sr, exp_dir): def __init__(self, sr, exp_dir):
self.slicer = Slicer( self.slicer = Slicer(
sr=sr, sr=sr,
threshold=-32, threshold=-40,
min_length=800, min_length=800,
min_interval=400, min_interval=400,
hop_size=15, hop_size=15,
max_sil_kept=150, max_sil_kept=150,
) )
self.sr = sr self.sr = sr
self.bh, self.ah = signal.butter(N=5, Wn=48, btype="high", fs=self.sr)
self.per = 3.7 self.per = 3.7
self.overlap = 0.3 self.overlap = 0.3
self.tail = self.per + self.overlap self.tail = self.per + self.overlap
@ -57,18 +59,22 @@ class PreProcess:
wavfile.write( wavfile.write(
"%s/%s_%s.wav" % (self.gt_wavs_dir, idx0, idx1), "%s/%s_%s.wav" % (self.gt_wavs_dir, idx0, idx1),
self.sr, self.sr,
(tmp_audio * 32768).astype(np.int16), tmp_audio.astype(np.float32),
) )
tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000) tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000)#, res_type="soxr_vhq"
wavfile.write( wavfile.write(
"%s/%s_%s.wav" % (self.wavs16k_dir, idx0, idx1), "%s/%s_%s.wav" % (self.wavs16k_dir, idx0, idx1),
16000, 16000,
(tmp_audio * 32768).astype(np.int16), tmp_audio.astype(np.float32),
) )
def pipeline(self, path, idx0): def pipeline(self, path, idx0):
try: try:
audio = load_audio(path, self.sr) audio = load_audio(path, self.sr)
# zero phased digital filter cause pre-ringing noise...
# audio = signal.filtfilt(self.bh, self.ah, audio)
audio = signal.lfilter(self.bh, self.ah, audio)
idx1 = 0 idx1 = 0
for audio in self.slicer.slice(audio): for audio in self.slicer.slice(audio):
i = 0 i = 0
@ -81,6 +87,7 @@ class PreProcess:
idx1 += 1 idx1 += 1
else: else:
tmp_audio = audio[start:] tmp_audio = audio[start:]
idx1 += 1
break break
self.norm_write(tmp_audio, idx0, idx1) self.norm_write(tmp_audio, idx0, idx1)
println("%s->Suc." % path) println("%s->Suc." % path)

View File

@ -4,7 +4,7 @@ from tqdm import tqdm
import json import json
def load_data(file_name: str = "./uvr5_pack/data.json") -> dict: def load_data(file_name: str = "./uvr5_pack/name_params.json") -> dict:
with open(file_name, "r") as f: with open(file_name, "r") as f:
data = json.load(f) data = json.load(f)

View File

@ -4,6 +4,9 @@ import torch.nn.functional as F
from config import x_pad, x_query, x_center, x_max from config import x_pad, x_query, x_center, x_max
import scipy.signal as signal import scipy.signal as signal
import pyworld, os, traceback, faiss import pyworld, os, traceback, faiss
from scipy import signal
bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
class VC(object): class VC(object):
@ -189,6 +192,7 @@ class VC(object):
index = big_npy = None index = big_npy = None
else: else:
index = big_npy = None index = big_npy = None
audio = signal.filtfilt(bh, ah, audio)
audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode="reflect") audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode="reflect")
opt_ts = [] opt_ts = []
if audio_pad.shape[0] > self.t_max: if audio_pad.shape[0] > self.t_max: