Merge branch 'main' into patch-2

2025-05-06 20:01:37 +08:00 · 2023-04-24 20:22:08 +08:00 · 2023-04-24 20:22:08 +08:00 · d6db821ac7
commit d6db821ac7
parent 6dea328d5d fdf12a4add
36 changed files with 2062 additions and 763 deletions
--- a/.github/workflows/pull_format.yml
+++ b/.github/workflows/pull_format.yml
@ -2,18 +2,20 @@ name: pull format
 on: [pull_request]
 permissions:
  contents: write
 jobs:
  pull_format:
    permissions:
      actions: write
      checks: write
      contents: write
    runs-on: ubuntu-latest
    continue-on-error: true
    steps:
-      - uses: actions/checkout@v3
+      - name: checkout
        continue-on-error: true
        uses: actions/checkout@v3
        with:
          ref: ${{ github.head_ref }}
          fetch-depth: 0
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
--- a/.github/workflows/push_format.yml
+++ b/.github/workflows/push_format.yml
@ -0,0 +1,46 @@
 name: push format
 on:
  push:
    branches:
      - main
 permissions:
  contents: write
  pull-requests: write
 jobs:
  push_format:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          ref: ${{github.ref_name}}
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install Black
        run: pip install black
      - name: Run Black
        # run: black $(git ls-files '*.py')
        run: black .
      - name: Commit Back
        continue-on-error: true
        id: commitback
        run: |
          git config --local user.email "github-actions[bot]@users.noreply.github.com"
          git config --local user.name "github-actions[bot]"
          git add --all
          git commit -m "Format code"
      - name: Create Pull Request
        if: steps.commitback.outcome == 'success'
        continue-on-error: true
        uses: peter-evans/create-pull-request@v4
        with:
          body: Apply Code Formatter Change
          commit-message: Automatic code format
--- a/Changelog_CN.md
+++ b/Changelog_CN.md
@ -1,15 +1,23 @@
-20230409
+### 20230409
 - 修正训练参数，提升显卡平均利用率，A100最高从25%提升至90%左右，V100:50%->90%左右，2060S:60%->85%左右，P40:25%->95%左右，训练速度显著提升
 - 修正参数：总batch_size改为每张卡的batch_size
 - 修正total_epoch：最大限制100解锁至1000；默认10提升至默认20
 - 修复ckpt提取识别是否带音高错误导致推理异常的问题
 - 修复分布式训练每个rank都保存一次ckpt的问题
 - 特征提取进行nan特征过滤
 - 修复静音输入输出随机辅音or噪声的问题（老版模型需要重做训练集重训）
-修正训练参数，提升显卡平均利用率，A100最高从25%提升至90%左右，V100:50%->90%左右，2060S:60%->85%左右，P40:25%->95%左右，训练速度显著提升
+### 20230416更新
 - 新增本地实时变声迷你GUI，双击go-realtime-gui.bat启动
 - 训练推理均对<50Hz的频段进行滤波过滤
 - 训练推理音高提取pyworld最低音高从默认80下降至50,50-80hz间的男声低音不会哑
 - WebUI支持根据系统区域变更语言（现支持en_US，ja_JP，zh_CN，zh_HK，zh_SG，zh_TW，不支持的默认en_US）
 - 修正部分显卡识别（例如V100-16G识别失败，P4识别失败）
-修正参数：总batch_size改为每张卡的batch_size
+### 后续计划：
-
+- 收集呼吸wav加入训练集修正呼吸变声电音的问题
-修正total_epoch：最大限制100解锁至1000；默认10提升至默认20
+- 研究更优的默认faiss索引配置，计划将索引打包进weights/xxx.pth中，取消推理界面的 特征/检索库 选择
-
+- 根据显存情况和显卡架构自动给到最优配置（batch size，训练集切块，推理音频长度相关的config，训练是否fp16），未来所有>=4G显存的>=pascal架构的显卡都可以训练或推理，而<4G显存的显卡不会进行支持
-修复ckpt提取识别是否带音高错误导致推理异常的问题
+- 我们正在训练增加了歌声训练集的底模，未来会公开
-
+- 推理音高识别选项加入"是否开启中值滤波"
-修复分布式训练每个rank都保存一次ckpt的问题
+- 增加选项:每次epoch保存的小模型均进行提取; 增加选项:设置默认测试集音频，每次保存的小模型均在保存后对其进行推理导出，用户可试听（来选择哪个中间epoch最好）
 特征提取进行nan特征过滤
 修复静音输入输出随机辅音or噪声的问题（老版模型需要重做训练集重训）
--- a/README.md
+++ b/README.md
@ -60,7 +60,7 @@ poetry install
 你也可以通过pip来安装依赖：
-**注意**: `MacOS`下`faiss 1.7.2`版本会导致抛出段错误，请将`requirements.txt`的对应条目改为`faiss-cpu==1.7.0`
+**注意**: `MacOS`下`faiss 1.7.2`版本会导致抛出段错误，在手动安装时请使用命令`pip install faiss-cpu==1.7.0`指定使用`1.7.0`版本
 ```bash
 pip install -r requirements.txt
--- a/config.py
+++ b/config.py
@ -30,7 +30,7 @@ parser.add_argument(
 cmd_opts = parser.parse_args()
 python_cmd = cmd_opts.pycmd
-listen_port = cmd_opts.port
+listen_port = cmd_opts.port if 0 <= cmd_opts.port <= 65535 else 7865
 iscolab = cmd_opts.colab
 noparallel = cmd_opts.noparallel
 noautoopen = cmd_opts.noautoopen
--- a/docs/README.en.md
+++ b/docs/README.en.md
@ -24,6 +24,9 @@ An easy-to-use SVC framework based on VITS.<br><br>
 > Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
 > The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source dataset.
 > High quality licensed song datasets will be added to training-set one after another for your use, without worrying about copyright infringement.
 ## Summary
 This repository has the following features:
 + Reduce tone leakage by replacing source feature to training-set feature using top1 retrieval;
@ -32,7 +35,6 @@ This repository has the following features:
 + Supporting model fusion to change timbres (using ckpt processing tab->ckpt merge);
 + Easy-to-use Webui interface;
 + Use the UVR5 model to quickly separate vocals and instruments.
 + The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source dataset, and high quality licensed song datasets will be added to training-set one after another for your use, without worrying about copyright infringement.
 ## Preparing the environment
 We recommend you install the dependencies through poetry.
@ -43,8 +45,7 @@ The following commands need to be executed in the environment of Python version
 pip install torch torchvision torchaudio
 #For Windows + Nvidia Ampere Architecture(RTX30xx), you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21
-
+#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
 # Install the Poetry dependency management tool, skip if installed
 # Reference: https://python-poetry.org/docs/#installation
@ -55,7 +56,7 @@ poetry install
 ```
 You can also use pip to install the dependencies
-**Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please change corresponding line in `requirements.txt` to `faiss-cpu==1.7.0`
+**Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please use `pip install faiss-cpu==1.7.0` if you use pip to install it manually.
 ```bash
 pip install -r requirements.txt
@ -83,12 +84,16 @@ python infer-web.py
 ```
 If you are using Windows, you can download and extract `RVC-beta.7z` to use RVC directly and use `go-web.bat` to start Webui.
 We will develop an English version of the WebUI in 2 weeks.
 There's also a tutorial on RVC in Chinese and you can check it out if needed.
 ## Credits
-
+ [ContentVec](https://github.com/auspicious3000/contentvec/)
 + [VITS](https://github.com/jaywalnut310/vits)
 + [HIFIGAN](https://github.com/jik876/hifi-gan)
 + [Gradio](https://github.com/gradio-app/gradio)
 + [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
 ## Thanks to all contributors for their efforts
 <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
--- a/docs/README.ja.md
+++ b/docs/README.ja.md
@ -21,58 +21,58 @@ VITSに基づく使いやすい音声変換（voice changer）framework<br><br>
 [**English**](./README.en.md) | [**中文简体**](../README.md) | [**日本語**](./README.ja.md)
-> デモ動画は[こちら](https://www.bilibili.com/video/BV1pm4y1z7Gm/)でご覧ください
+> デモ動画は[こちら](https://www.bilibili.com/video/BV1pm4y1z7Gm/)でご覧ください。
 > RVCによるリアルタイム音声変換: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
-> 基底modelを訓練(training)したのは、約50時間の高品質なオープンソースのデータセット。著作権侵害を心配することなく使用できるように。
+> 著作権侵害を心配することなく使用できるように、基底モデルは約50時間の高品質なオープンソースデータセットで訓練されています。
-> 今後は次々と使用許可のある高品質歌声資料集を追加し、基底modelを訓練する。
+> 今後も、次々と使用許可のある高品質な歌声の資料集を追加し、基底モデルを訓練する予定です。
 ## はじめに
-本repoは下記の特徴があります
+本リポジトリには下記の特徴があります。
-+ 調子(tone)の漏洩が下がれるためtop1検索で源特徴量を訓練集特徴量に置換
+ Top1検索を用いることで、生の特徴量を訓練用データセット特徴量に変換し、トーンリーケージを削減します。
-+ 古い又は安いGPUでも高速に訓練できる
+ 比較的貧弱なGPUでも、高速かつ簡単に訓練できます。
-+ 小さい訓練集でもかなりいいmodelを得られる(10分以上の低noise音声を推奨)
+ 少量のデータセットからでも、比較的良い結果を得ることができます。（10分以上のノイズの少ない音声を推奨します。）
-+ modelを融合し音色をmergeできる(ckpt processing->ckpt mergeで使用)
+ モデルを融合することで、音声を混ぜることができます。（ckpt processingタブの、ckpt mergeを使用します。）
-+ 使いやすいWebUI
+ 使いやすいWebUI。
-+ UVR5 Modelも含めるため人声とBGMを素早く分離できる
+ UVR5 Modelも含んでいるため、人の声とBGMを素早く分離できます。
 ## 環境構築
-poetryで依存関係をinstallすることをお勧めします。
+Poetryで依存関係をインストールすることをお勧めします。
-下記のcommandsは、Python3.8以上の環境で実行する必要があります:
+下記のコマンドは、Python3.8以上の環境で実行する必要があります:
 ```bash
-# PyTorch関連の依存関係をinstall。install済の場合はskip
+# PyTorch関連の依存関係をインストール。インストール済の場合は省略。
 # 参照先: https://pytorch.org/get-started/locally/
 pip install torch torchvision torchaudio
 #Windows＋ Nvidia Ampere Architecture(RTX30xx)の場合、 #21 に従い、pytorchに対応するcuda versionを指定する必要があります。
 #pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
-# PyTorch関連の依存関係をinstall。install済の場合はskip
+# PyTorch関連の依存関係をインストール。インストール済の場合は省略。
 # 参照先: https://python-poetry.org/docs/#installation
 curl -sSL https://install.python-poetry.org | python3 -
-# Poetry経由で依存関係をinstall
+# Poetry経由で依存関係をインストール
 poetry install
 ```
-pipでも依存関係のinstallが可能です:
+pipでも依存関係のインストールが可能です:
-**注意**:`faiss 1.7.2`は`macOS`で`Segmentation Fault: 11`を起こすので、`requirements.txt`の該当行を `faiss-cpu==1.7.0`に変更してください。
+**注意**:`faiss 1.7.2`は`macOS`で`Segmentation Fault: 11`を起こすので、マニュアルインストールする場合は、 `pip install faiss-cpu==1.7.0`を実行してください。
 ```bash
 pip install -r requirements.txt
 ```
 ## 基底modelsを準備
-RVCは推論/訓練のために、様々な事前訓練を行った基底modelsが必要です。
+RVCは推論/訓練のために、様々な事前訓練を行った基底モデルを必要とします。
 modelsは[Hugging Face space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)からダウンロードできます。
-以下は、RVCに必要な基底modelsやその他のfilesの一覧です。
+以下は、RVCに必要な基底モデルやその他のファイルの一覧です。
 ```bash
 hubert_base.pt
@ -80,16 +80,16 @@ hubert_base.pt
 ./uvr5_weights
-# ffmpegがすでにinstallされている場合はskip
+# ffmpegがすでにinstallされている場合は省略
 ./ffmpeg
 ```
-その後、下記のcommandでWebUIを起動
+その後、下記のコマンドでWebUIを起動します。
 ```bash
 python infer-web.py
 ```
-Windowsをお使いの方は、直接に`RVC-beta.7z`をダウンロード後に展開し、`go-web.bat`をclickでWebUIを起動。(7zipが必要です)
+Windowsをお使いの方は、直接`RVC-beta.7z`をダウンロード後に展開し、`go-web.bat`をクリックすることで、WebUIを起動することができます。(7zipが必要です。)
-また、repoに[小白简易教程.doc](./小白简易教程.doc)がありますので、参考にしてください（中国語版のみ）。
+また、リポジトリに[小白简易教程.doc](./小白简易教程.doc)がありますので、参考にしてください（中国語版のみ）。
 ## 参考プロジェクト
 + [ContentVec](https://github.com/auspicious3000/contentvec/)
@ -100,7 +100,7 @@ Windowsをお使いの方は、直接に`RVC-beta.7z`をダウンロード後に
 + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 + [audio-slicer](https://github.com/openvpi/audio-slicer)
-## 貢献者(contributer)の皆様の尽力に感謝します
+## 貢献者(contributor)の皆様の尽力に感謝します
 <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
  <img src="https://contrib.rocks/image?repo=liujing04/Retrieval-based-Voice-Conversion-WebUI" />
 </a>
--- a/docs/faiss_tips_en.md
+++ b/docs/faiss_tips_en.md
@ -0,0 +1,146 @@
 faiss tuning TIPS
 ==================
 # about faiss
 faiss is a library of neighborhood searches for dense vectors, developed by facebook research, which efficiently implements many approximate neighborhood search methods.
 Approximate Neighbor Search finds similar vectors quickly while sacrificing some accuracy.
 ## faiss in RVC
 In RVC, for the embedding of features converted by HuBERT, we search for embeddings similar to the embedding generated from the training data and mix them to achieve a conversion that is closer to the original speech. However, since this search takes time if performed naively, high-speed conversion is realized by using approximate neighborhood search.
 # implementation overview
 In '/logs/your-experiment/3_feature256' where the model is located, features extracted by HuBERT from each voice data are located.
 From here we read the npy files in order sorted by filename and concatenate the vectors to create big_npy. (This vector has shape [N, 256].)
 After saving big_npy as /logs/your-experiment/total_fea.npy, train it with faiss.
 As of 2023/04/18, IVF based on L2 distance is used using the index factory function of faiss.
 The number of IVF divisions (n_ivf) is N//39, and n_probe uses int(np.power(n_ivf, 0.3)). (Look around train_index in infer-web.py.)
 In this article, I will first explain the meaning of these parameters, and then write advice for developers to create a better index.
 # Explanation of the method
 ## index factory
 An index factory is a unique faiss notation that expresses a pipeline that connects multiple approximate neighborhood search methods as a string.
 This allows you to try various approximate neighborhood search methods simply by changing the index factory string.
 In RVC it is used like this:
 ```python
 index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
 ```
 Among the arguments of index_factory, the first is the number of dimensions of the vector, the second is the index factory string, and the third is the distance to use.
 For more detailed notation
 https://github.com/facebookresearch/faiss/wiki/The-index-factory
 ## index for distance
 There are two typical indexes used as similarity of embedding as follows.
 - Euclidean distance (METRIC_L2)
 - inner product (METRIC_INNER_PRODUCT)
 Euclidean distance takes the squared difference in each dimension, sums the differences in all dimensions, and then takes the square root. This is the same as the distance in 2D and 3D that we use on a daily basis.
 The inner product is not used as an index of similarity as it is, and the cosine similarity that takes the inner product after being normalized by the L2 norm is generally used.
 Which is better depends on the case, but cosine similarity is often used in embedding obtained by word2vec and similar image retrieval models learned by ArcFace. If you want to do l2 normalization on vector X with numpy, you can do it with the following code with eps small enough to avoid 0 division.
 ```python
 X_normed = X / np.maximum(eps, np.linalg.norm(X, ord=2, axis=-1, keepdims=True))
 ```
 Also, for the index factory, you can change the distance index used for calculation by choosing the value to pass as the third argument.
 ```python
 index = faiss.index_factory(dimention, text, faiss.METRIC_INNER_PRODUCT)
 ```
 ## IVF
 IVF (Inverted file indexes) is an algorithm similar to the inverted index in full-text search.
 During learning, the search target is clustered with kmeans, and Voronoi partitioning is performed using the cluster center. Each data point is assigned a cluster, so we create a dictionary that looks up the data points from the clusters.
 For example, if clusters are assigned as follows
 |index|Cluster|
 |-----|-------|
 |1|A|
 |2|B|
 |3|A|
 |4|C|
 |5|B|
 The resulting inverted index looks like this:
 |cluster|index|
 |-------|-----|
 |A|1, 3|
 |B|2, 5|
 |C|4|
 When searching, we first search n_probe clusters from the clusters, and then calculate the distances for the data points belonging to each cluster.
 # recommend parameter
 There are official guidelines on how to choose an index, so I will explain accordingly.
 https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
 For datasets below 1M, 4bit-PQ is the most efficient method available in faiss as of April 2023.
 Combining this with IVF, narrowing down the candidates with 4bit-PQ, and finally recalculating the distance with an accurate index can be described by using the following index factory.
 ```python
 index = faiss.index_factory(256, "IVF1024,PQ128x4fs,RFlat")
 ```
 ## Recommended parameters for IVF
 Consider the case of too many IVFs. For example, if coarse quantization by IVF is performed for the number of data, this is the same as a naive exhaustive search and is inefficient.
 For 1M or less, IVF values are recommended between 4*sqrt(N) ~ 16*sqrt(N) for N number of data points.
 Since the calculation time increases in proportion to the number of n_probes, please consult with the accuracy and choose appropriately. Personally, I don't think RVC needs that much accuracy, so n_probe = 1 is fine.
 ## FastScan
 FastScan is a method that enables high-speed approximation of distances by Cartesian product quantization by performing them in registers.
 Cartesian product quantization performs clustering independently for each d dimension (usually d = 2) during learning, calculates the distance between clusters in advance, and creates a lookup table. At the time of prediction, the distance of each dimension can be calculated in O(1) by looking at the lookup table.
 So the number you specify after PQ usually specifies half the dimension of the vector.
 For a more detailed description of FastScan, please refer to the official documentation.
 https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-codes-(FastScan)
 ## RFlat
 RFlat is an instruction to recalculate the rough distance calculated by FastScan with the exact distance specified by the third argument of index factory.
 When getting k neighbors, k*k_factor points are recalculated.
 # Techniques for embedding
 ## alpha query expansion
 Query expansion is a technique used in searches, for example in full-text searches, where a few words are added to the entered search sentence to improve search accuracy. Several methods have also been proposed for vector search, among which α-query expansion is known as a highly effective method that does not require additional learning. In the paper, it is introduced in [Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019), etc., and [2nd place solution of kaggle shopee competition](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook).
 α-query expansion can be done by summing a vector with neighboring vectors with weights raised to the power of similarity. How to paste the code example. Replace big_npy with α query expansion.
 ```python
 alpha = 3.
 index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat")
 original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
 big_npy /= original_norm
 index.train(big_npy)
 index.add(big_npy)
 dist, neighbor = index.search(big_npy, num_expand)
 expand_arrays = []
 ixs = np.arange(big_npy.shape[0])
 for i in range(-(-big_npy.shape[0]//batch_size)):
    ix = ixs[i*batch_size:(i+1)*batch_size]
    weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha)
    expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1))
 big_npy = np.concatenate(expand_arrays, axis=0)
 # normalize index version
 big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
 ```
 This is a technique that can be applied both to the query that does the search and to the DB being searched.
 ## Compress embedding with MiniBatch KMeans
 If total_fea.npy is too large, it is possible to shrink the vector using KMeans.
 Compression of embedding is possible with the following code. Specify the size you want to compress for n_clusters, and specify 256 * number of CPU cores for batch_size to fully benefit from CPU parallelization.
 ```python
 import multiprocessing
 from sklearn.cluster import MiniBatchKMeans
 kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random")
 kmeans.fit(big_npy)
 sample_npy = kmeans.cluster_centers_
 ```
--- a/docs/faiss_tips_ja.md
+++ b/docs/faiss_tips_ja.md
@ -0,0 +1,146 @@
 faiss tuning TIPS
 ==================
 # about faiss
 faissはfacebook researchの開発する、密なベクトルに対する近傍探索をまとめたライブラリで、多くの近似近傍探索の手法を効率的に実装しています。
 近似近傍探索はある程度精度を犠牲にしながら高速に類似するベクトルを探します。
 ## faiss in RVC
 RVCではHuBERTで変換した特徴量のEmbeddingに対し、学習データから生成されたEmbeddingと類似するものを検索し、混ぜることでより元の音声に近い変換を実現しています。ただ、この検索は愚直に行うと時間がかかるため、近似近傍探索を用いることで高速な変換を実現しています。
 # 実装のoverview
 モデルが配置されている '/logs/your-experiment/3_feature256'には各音声データからHuBERTで抽出された特徴量が配置されています。
 ここからnpyファイルをファイル名でソートした順番で読み込み、ベクトルを連結してbig_npyを作成します。(このベクトルのshapeは[N, 256]です。)
 big_npyを/logs/your-experiment/total_fea.npyとして保存した後、faissを学習させます。
 2023/04/18時点ではfaissのindex factoryの機能を用いて、L2距離に基づくIVFを用いています。
 IVFの分割数(n_ivf)はN//39で、n_probeはint(np.power(n_ivf, 0.3))が採用されています。(infer-web.pyのtrain_index周りを探してください。)
 本Tipsではまずこれらのパラメータの意味を解説し、その後よりよいindexを作成するための開発者向けアドバイスを書きます。
 # 手法の解説
 ## index factory
 index factoryは複数の近似近傍探索の手法を繋げるパイプラインをstringで表記するfaiss独自の記法です。
 これにより、index factoryの文字列を変更するだけで様々な近似近傍探索の手法を試せます。
 RVCでは以下のように使われています。
 ```python
 index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
 ```
 index_factoryの引数のうち、1つ目はベクトルの次元数、2つ目はindex factoryの文字列で、3つ目には用いる距離を指定することができます。
 より詳細な記法については
 https://github.com/facebookresearch/faiss/wiki/The-index-factory
 ## 距離指標
 embeddingの類似度として用いられる代表的な指標として以下の二つがあります。
 - ユークリッド距離(METRIC_L2)
 - 内積(METRIC_INNER_PRODUCT)
 ユークリッド距離では各次元において二乗の差をとり、全次元の差を足してから平方根をとります。これは日常的に用いる2次元、3次元での距離と同じです。
 内積はこのままでは類似度の指標として用いず、一般的にはL2ノルムで正規化してから内積をとるコサイン類似度を用います。
 どちらがよいかは場合によりますが、word2vec等で得られるembeddingやArcFace等で学習した類似画像検索のモデルではコサイン類似度が用いられることが多いです。ベクトルXに対してl2正規化をnumpyで行う場合は、0 divisionを避けるために十分に小さな値をepsとして以下のコードで可能です。
 ```python
 X_normed = X / np.maximum(eps, np.linalg.norm(X, ord=2, axis=-1, keepdims=True))
 ```
 また、index factoryには第3引数に渡す値を選ぶことで計算に用いる距離指標を変更できます。
 ```python
 index = faiss.index_factory(dimention, text, faiss.METRIC_INNER_PRODUCT)
 ```
 ## IVF
 IVF(Inverted file indexes)は全文検索における転置インデックスと似たようなアルゴリズムです。
 学習時には検索対象に対してkmeansでクラスタリングを行い、クラスタ中心を用いてボロノイ分割を行います。各データ点には一つずつクラスタが割り当てられるので、クラスタからデータ点を逆引きする辞書を作成します。
 例えば以下のようにクラスタが割り当てられた場合
 |index|クラスタ|
 |-----|-------|
 |1|A|
 |2|B|
 |3|A|
 |4|C|
 |5|B|
 作成される転置インデックスは以下のようになります。
 |クラスタ|index|
 |-------|-----|
 |A|1, 3|
 |B|2, 5|
 |C|4|
 検索時にはまずクラスタからn_probe個のクラスタを検索し、次にそれぞれのクラスタに属するデータ点について距離を計算します。
 # 推奨されるパラメータ
 indexの選び方については公式にガイドラインがあるので、それに準じて説明します。
 https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
 1M以下のデータセットにおいては4bit-PQが2023年4月時点ではfaissで利用できる最も効率的な手法です。
 これをIVFと組み合わせ、4bit-PQで候補を絞り、最後に正確な指標で距離を再計算するには以下のindex factoryを用いることで記載できます。
 ```python
 index = faiss.index_factory(256, "IVF1024,PQ128x4fs,RFlat")
 ```
 ## IVFの推奨パラメータ
 IVFの数が多すぎる場合、たとえばデータ数の数だけIVFによる粗量子化を行うと、これは愚直な全探索と同じになり効率が悪いです。
 1M以下の場合ではIVFの値はデータ点の数Nに対して4*sqrt(N) ~ 16*sqrt(N)に推奨しています。
 n_probeはn_probeの数に比例して計算時間が増えるので、精度と相談して適切に選んでください。個人的にはRVCにおいてそこまで精度は必要ないと思うのでn_probe = 1で良いと思います。
 ## FastScan
 FastScanは直積量子化で大まかに距離を近似するのを、レジスタ内で行うことにより高速に行うようにした手法です。
 直積量子化は学習時にd次元ごと(通常はd=2)に独立してクラスタリングを行い、クラスタ同士の距離を事前計算してlookup tableを作成します。予測時はlookup tableを見ることで各次元の距離をO(1)で計算できます。
 そのため、PQの次に指定する数字は通常ベクトルの半分の次元を指定します。
 FastScanに関するより詳細な説明は公式のドキュメントを参照してください。
 https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-codes-(FastScan)
 ## RFlat
 RFlatはFastScanで計算した大まかな距離を、index factoryの第三引数で指定した正確な距離で再計算する指示です。
 k個の近傍を取得する際は、k*k_factor個の点について再計算が行われます。
 # Embeddingに関するテクニック
 ## alpha query expansion
 クエリ拡張は検索で使われるテクニックで、例えば全文検索では入力された検索文に単語を幾つか追加することで検索精度を上げることがあります。ベクトル検索にもいくつか提唱されていて、その内追加の学習がいらず効果が高い手法としてα-query expansionが知られています。論文では[Attention-Based Query Expansion Learning](https://arxiv.org/abs/2007.08019)などで紹介されていて、[kaggleのshopeeコンペの2位の解法](https://www.kaggle.com/code/lyakaap/2nd-place-solution/notebook)にも用いられていました。
 α-query expansionはあるベクトルに対し、近傍のベクトルを類似度のα乗した重みで足し合わせることでできます。いかにコードの例を張ります。big_npyをα query expansionしたものに置き換えます。
 ```python
 alpha = 3.
 index = faiss.index_factory(256, "IVF512,PQ128x4fs,RFlat")
 original_norm = np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
 big_npy /= original_norm
 index.train(big_npy)
 index.add(big_npy)
 dist, neighbor = index.search(big_npy, num_expand)
 expand_arrays = []
 ixs = np.arange(big_npy.shape[0])
 for i in range(-(-big_npy.shape[0]//batch_size)):
    ix = ixs[i*batch_size:(i+1)*batch_size]
    weight = np.power(np.einsum("nd,nmd->nm", big_npy[ix], big_npy[neighbor[ix]]), alpha)
    expand_arrays.append(np.sum(big_npy[neighbor[ix]] * np.expand_dims(weight, axis=2),axis=1))
 big_npy = np.concatenate(expand_arrays, axis=0)
 # normalize index version
 big_npy = big_npy / np.maximum(np.linalg.norm(big_npy, ord=2, axis=1, keepdims=True), 1e-9)
 ```
 これは、検索を行うクエリにも、検索対象のDBにも適応可能なテクニックです。
 ## MiniBatch KMeansによるembeddingの圧縮
 total_fea.npyが大きすぎる場合、KMeansを用いてベクトルを小さくすることが可能です。
 以下のコードで、embeddingの圧縮が可能です。n_clustersは圧縮したい大きさを指定し、batch_sizeは256 * CPUのコア数を指定することでCPUの並列化の恩恵を十分に得ることができます。
 ```python
 import multiprocessing
 from sklearn.cluster import MiniBatchKMeans
 kmeans = MiniBatchKMeans(n_clusters=10000, batch_size=256 * multiprocessing.cpu_count(), init="random")
 kmeans.fit(big_npy)
 sample_npy = kmeans.cluster_centers_
 ```
--- a/docs/training_tips_en.md
+++ b/docs/training_tips_en.md
@ -0,0 +1,52 @@
 Instructions and tips for RVC training
 ======================================
 This TIPS explains how data training is done.
 # Training flow
 I will explain along the steps in the training tab of the GUI.
 ## step1
 Set the experiment name here. You can also set here whether the model should take pitch into account.
 Data for each experiment is placed in `/logs/experiment name/`.
 ## step2a
 Loads and preprocesses audio.
 ### load audio
 If you specify a folder with audio, the audio files in that folder will be read automatically.
 For example, if you specify `C:Users\hoge\voices`, `C:Users\hoge\voices\voice.mp3` will be loaded, but `C:Users\hoge\voices\dir\voice.mp3` will Not loaded.
 Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically.
 After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.
 ### denoising
 The audio is smoothed by scipy's filtfilt.
 ### Audio Split
 First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to `/logs/experiment name/0_gt_wavs` and then convert it to 16k sampling rate to `/logs/experiment name/1_16k_wavs ` as a wav file.
 ## step2b
 ### Extract pitch
 Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in `/logs/experiment name/2a_f0`. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in `/logs/experiment name/2b-f0nsf`.
 ### Extract feature_print
 Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in `/logs/experiment name/1_16k_wavs`, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in `/logs/experiment name/3_feature256`.
 ## step3
 train the model.
 ### Glossary for Beginners
 In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.
 Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.
 ### Specify pretrained model
 RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset. By default it loads `rvc-location/pretrained/f0G40k.pth` and `rvc-location/pretrained/f0D40k.pth`. When learning, model parameters are saved in `logs/experiment name/G_{}.pth` and `logs/experiment name/D_{}.pth` for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.
 ### learning index
 RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance.
 For index learning, we use the approximate neighborhood search library faiss. Read the feature value of `/logs/experiment name/3_feature256`, save the combined feature value as `/logs/experiment name/total_fea.npy`, and use it to learn the index `/logs/experiment name Save it as /add_XXX.index`.
 ### Button description
 - Train model: After executing step2b, press this button to train the model.
 - Train feature index: After training the model, perform index learning.
 - One-click training: step2b, model training and feature index training all at once.
--- a/docs/training_tips_ja.md
+++ b/docs/training_tips_ja.md
@ -0,0 +1,53 @@
 RVCの訓練における説明、およびTIPS
 ===============================
 本TIPSではどのようにデータの訓練が行われているかを説明します。
 # 訓練の流れ
 GUIの訓練タブのstepに沿って説明します。
 ## step1 
 実験名の設定を行います。また、モデルにピッチを考慮させるかもここで設定できます。
 各実験のデータは`/logs/実験名/`に配置されます。
 ## step2a
 音声の読み込みと前処理を行います。
 ### load audio
 音声のあるフォルダを指定すると、そのフォルダ内にある音声ファイルを自動で読み込みます。
 例えば`C:Users\hoge\voices`を指定した場合、`C:Users\hoge\voices\voice.mp3`は読み込まれますが、`C:Users\hoge\voices\dir\voice.mp3`は読み込まれません。
 音声の読み込みには内部でffmpegを利用しているので、ffmpegで対応している拡張子であれば自動的に読み込まれます。
 ffmpegでint16に変換した後、float32に変換し、-1 ~ 1の間に正規化されます。
 ### denoising
 音声についてscipyのfiltfiltによる平滑化を行います。
 ### 音声の分割
 入力した音声はまず、一定期間(max_sil_kept=5秒?)より長く無音が続く部分を検知して音声を分割します。無音で音声を分割した後は、0.3秒のoverlapを含む4秒ごとに音声を分割します。4秒以内に区切られた音声は、音量の正規化を行った後wavファイルを`/logs/実験名/0_gt_wavs`に、そこから16kのサンプリングレートに変換して`/logs/実験名/1_16k_wavs`にwavファイルで保存します。
 ## step2b
 ### ピッチの抽出
 wavファイルからピッチ(音の高低)の情報を抽出します。parselmouthやpyworldに内蔵されている手法でピッチ情報(=f0)を抽出し、`/logs/実験名/2a_f0`に保存します。その後、ピッチ情報を対数で変換して1~255の整数に変換し、`/logs/実験名/2b-f0nsf`に保存します。
 ### feature_printの抽出
 HuBERTを用いてwavファイルを事前にembeddingに変換します。`/logs/実験名/1_16k_wavs`に保存したwavファイルを読み込み、HuBERTでwavファイルを256次元の特徴量に変換し、npy形式で`/logs/実験名/3_feature256`に保存します。
 ## step3
 モデルのトレーニングを行います。
 ### 初心者向け用語解説
 深層学習ではデータセットを分割し、少しずつ学習を進めていきます。一回のモデルの更新(step)では、batch_size個のデータを取り出し予測と誤差の修正を行います。これをデータセットに対して一通り行うと一epochと数えます。
 そのため、学習時間は 1step当たりの学習時間 x (データセット内のデータ数 ÷ バッチサイズ) x epoch数 かかります。一般にバッチサイズを大きくするほど学習は安定し、(1step当たりの学習時間÷バッチサイズ)は小さくなりますが、その分GPUのメモリを多く使用します。GPUのRAMはnvidia-smiコマンド等で確認できます。実行環境のマシンに合わせてバッチサイズをできるだけ大きくするとより短時間で学習が可能です。
 ### pretrained modelの指定
 RVCではモデルの訓練を0からではなく、事前学習済みの重みから開始するため、少ないデータセットで学習を行えます。デフォルトでは`RVCのある場所/pretrained/f0G40k.pth`と`RVCのある場所/pretrained/f0D40k.pth`を読み込みます。学習時はsave_every_epochごとにモデルのパラメータが`logs/実験名/G_{}.pth`と`logs/実験名/D_{}.pth`に保存されますが、このパスを指定することで学習を再開したり、もしくは違う実験で学習したモデルの重みから学習を開始できます。
 ### indexの学習
 RVCでは学習時に使われたHuBERTの特徴量を保存し、推論時は学習時の特徴量から近い特徴量を探してきて推論を行います。この検索を高速に行うために事前にindexの学習を行います。
 indexの学習には近似近傍探索ライブラリのfaissを用います。`/logs/実験名/3_feature256`の特徴量を読み込み、全て結合させた特徴量を`/logs/実験名/total_fea.npy`として保存、それを用いて学習したindexを`/logs/実験名/add_XXX.index`として保存します。
 ### ボタンの説明
 - モデルのトレーニング: step2bまでを実行した後、このボタンを押すとモデルの学習を行います。
 - 特徴インデックスのトレーニング: モデルのトレーニング後、indexの学習を行います。
 - ワンクリックトレーニング: step2bまでとモデルのトレーニング、特徴インデックスのトレーニングを一括で行います。
--- a/export_onnx.py
+++ b/export_onnx.py
@ -1,28 +1,34 @@
-from infer_pack.models_onnx import SynthesizerTrnMs256NSFsid
+from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
 from infer_pack.models_onnx import SynthesizerTrnMs256NSFsidO
 import torch
-person = "Shiroha/shiroha.pth"
+if __name__ == '__main__':
-exported_path = "model.onnx"
+    MoeVS = True #模型是否为MoeVoiceStudio（原MoeSS）使用
    ModelPath = "Shiroha/shiroha.pth"  #模型路径
    ExportedPath = "model.onnx"        #输出路径
    hidden_channels = 256                                              # hidden_channels，为768Vec做准备
    cpt = torch.load(ModelPath, map_location="cpu")                   
    cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]         # n_spk
    print(*cpt["config"])
-cpt = torch.load(person, map_location="cpu")
+    test_phone = torch.rand(1, 200, hidden_channels)                   # hidden unit
-cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]  # n_spk
+    test_phone_lengths = torch.tensor([200]).long()                    # hidden unit 长度（貌似没啥用）
-print(*cpt["config"])
+    test_pitch = torch.randint(size=(1, 200), low=5, high=255)         # 基频（单位赫兹）
-net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=False)
+    test_pitchf = torch.rand(1, 200)                                   # nsf基频
-net_g.load_state_dict(cpt["weight"], strict=False)
+    test_ds = torch.LongTensor([0])                                    # 说话人ID
    test_rnd = torch.rand(1, 192, 200)                                 # 噪声（加入随机因子）
-test_phone = torch.rand(1, 200, 256)
+    device = "cpu"  #导出时设备（不影响使用模型）
-test_phone_lengths = torch.tensor([200]).long()
+
-test_pitch = torch.randint(size=(1, 200), low=5, high=255)
+    if MoeVS:
-test_pitchf = torch.rand(1, 200)
+        net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False)   # fp32导出（C++要支持fp16必须手动将内存重新排列所以暂时不用fp16）
-test_ds = torch.LongTensor([0])
+        net_g.load_state_dict(cpt["weight"], strict=False)
-test_rnd = torch.rand(1, 192, 200)
+        input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
-input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
+        output_names = [
 output_names = [
            "audio",
-]
+        ]
-device = "cpu"
+        torch.onnx.export(
 torch.onnx.export(
            net_g,
            (
                test_phone.to(device),
@ -32,7 +38,7 @@ torch.onnx.export(
                test_ds.to(device),
                test_rnd.to(device),
            ),
-    exported_path,
+            ExportedPath,
            dynamic_axes={
                "phone": [1],
                "pitch": [1],
@ -44,4 +50,32 @@ torch.onnx.export(
            verbose=False,
            input_names=input_names,
            output_names=output_names,
-)
+        )
    else:
        net_g = SynthesizerTrnMs256NSFsidO(*cpt["config"], is_half=False)   # fp32导出（C++要支持fp16必须手动将内存重新排列所以暂时不用fp16）
        net_g.load_state_dict(cpt["weight"], strict=False)
        input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds"]
        output_names = [
            "audio",
        ]
        torch.onnx.export(
            net_g,
            (
                test_phone.to(device),
                test_phone_lengths.to(device),
                test_pitch.to(device),
                test_pitchf.to(device),
                test_ds.to(device),
            ),
            ExportedPath,
            dynamic_axes={
                "phone": [1],
                "pitch": [1],
                "pitchf": [1],
            },
            do_constant_folding=False,
            opset_version=16,
            verbose=False,
            input_names=input_names,
            output_names=output_names,
        )
--- a/export_onnx_old.py
+++ b/export_onnx_old.py
@ -0,0 +1,47 @@
 from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
 import torch
 person = "Shiroha/shiroha.pth"
 exported_path = "model.onnx"
 cpt = torch.load(person, map_location="cpu")
 cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]  # n_spk
 print(*cpt["config"])
 net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False)
 net_g.load_state_dict(cpt["weight"], strict=False)
 test_phone = torch.rand(1, 200, 256)
 test_phone_lengths = torch.tensor([200]).long()
 test_pitch = torch.randint(size=(1, 200), low=5, high=255)
 test_pitchf = torch.rand(1, 200)
 test_ds = torch.LongTensor([0])
 test_rnd = torch.rand(1, 192, 200)
 input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
 output_names = [
    "audio",
 ]
 device = "cpu"
 torch.onnx.export(
    net_g,
    (
        test_phone.to(device),
        test_phone_lengths.to(device),
        test_pitch.to(device),
        test_pitchf.to(device),
        test_ds.to(device),
        test_rnd.to(device),
    ),
    exported_path,
    dynamic_axes={
        "phone": [1],
        "pitch": [1],
        "pitchf": [1],
        "rnd": [2],
    },
    do_constant_folding=False,
    opset_version=16,
    verbose=False,
    input_names=input_names,
    output_names=output_names,
 )
--- a/extract_f0_print.py
+++ b/extract_f0_print.py
@ -33,7 +33,9 @@ class FeatureInput(object):
        self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
    def compute_f0(self, path, f0_method):
-        x, sr = librosa.load(path, self.fs)
+        # default resample type of librosa.resample is "soxr_hq".
        # Quality: soxr_vhq > soxr_hq
        x, sr = librosa.load(path, self.fs)#, res_type='soxr_vhq'
        p_len = x.shape[0] // self.hop
        assert sr == self.fs
        if f0_method == "pm":
--- a/extract_locale.py
+++ b/extract_locale.py
@ -28,3 +28,4 @@ process("gui.py")
 # Save as a JSON file
 with open("./i18n/zh_CN.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)
    f.write("\n")
--- a/go-realtime-gui.bat
+++ b/go-realtime-gui.bat
@ -0,0 +1,2 @@
 runtime\python.exe gui.py
 pause
--- a/gui.py
+++ b/gui.py
@ -1,18 +1,22 @@
 import os, sys
 now_dir = os.getcwd()
 sys.path.append(now_dir)
 import PySimpleGUI as sg
 import sounddevice as sd
 import noisereduce as nr
 import numpy as np
 from fairseq import checkpoint_utils
-import librosa, torch, parselmouth, faiss, time, threading
+import librosa, torch, pyworld, faiss, time, threading
 import torch.nn.functional as F
 import torchaudio.transforms as tat
 import scipy.signal as signal
 # import matplotlib.pyplot as plt
 from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono
 from i18n import I18nAuto
 i18n = I18nAuto()
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@ -23,16 +27,20 @@ class RVC:
        """
        初始化
        """
        try:
            self.f0_up_key = key
            self.time_step = 160 / 16000 * 1000
            self.f0_min = 50
            self.f0_max = 1100
            self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
            self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
            self.sr = 16000
            self.window = 160
            if index_rate != 0:
                self.index = faiss.read_index(index_path)
        self.index_rate = index_rate
        """NOT YET USED"""
                self.big_npy = np.load(npy_path)
                print("index search enabled")
            self.index_rate = index_rate
            model_path = hubert_path
            print("load model(s) from {}".format(model_path))
            models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
@ -46,8 +54,8 @@ class RVC:
            cpt = torch.load(pth_path, map_location="cpu")
            tgt_sr = cpt["config"][-1]
            cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]  # n_spk
-        if_f0 = cpt.get("f0", 1)
+            self.if_f0 = cpt.get("f0", 1)
-        if if_f0 == 1:
+            if self.if_f0 == 1:
                self.net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=True)
            else:
                self.net_g = SynthesizerTrnMs256NSFsid_nono(*cpt["config"])
@ -55,38 +63,46 @@ class RVC:
            print(self.net_g.load_state_dict(cpt["weight"], strict=False))
            self.net_g.eval().to(device)
            self.net_g.half()
        except Exception as e:
            print(e)
-    def get_f0_coarse(self, f0):
+    def get_f0(self, x, f0_up_key, inp_f0=None):
        x_pad=1
        f0_min = 50
        f0_max = 1100
        f0_mel_min = 1127 * np.log(1 + f0_min / 700)
        f0_mel_max = 1127 * np.log(1 + f0_max / 700)
        f0, t = pyworld.harvest(
            x.astype(np.double),
            fs=self.sr,
            f0_ceil=f0_max,
            f0_floor=f0_min,
            frame_period=10,
        )
        f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
        f0 = signal.medfilt(f0, 3)
        f0 *= pow(2, f0_up_key / 12)
        # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
        tf0 = self.sr // self.window  # 每秒f0点数
        if inp_f0 is not None:
            delta_t = np.round(
                (inp_f0[:, 0].max() - inp_f0[:, 0].min()) * tf0 + 1
            ).astype("int16")
            replace_f0 = np.interp(
                list(range(delta_t)), inp_f0[:, 0] * 100, inp_f0[:, 1]
            )
            shape = f0[x_pad * tf0 : x_pad * tf0 + len(replace_f0)].shape[0]
            f0[x_pad * tf0 : x_pad * tf0 + len(replace_f0)] = replace_f0[:shape]
        # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
        f0bak = f0.copy()
        f0_mel = 1127 * np.log(1 + f0 / 700)
-        f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * 254 / (
+        f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (
-            self.f0_mel_max - self.f0_mel_min
+            f0_mel_max - f0_mel_min
        ) + 1
        f0_mel[f0_mel <= 1] = 1
        f0_mel[f0_mel > 255] = 255
        # f0_mel[f0_mel > 188] = 188
        f0_coarse = np.rint(f0_mel).astype(np.int)
-        return f0_coarse
+        return f0_coarse, f0bak  # 1-0
    def get_f0(self, x, p_len, f0_up_key=0):
        f0 = (
            parselmouth.Sound(x, 16000)
            .to_pitch_ac(
                time_step=self.time_step / 1000,
                voicing_threshold=0.6,
                pitch_floor=self.f0_min,
                pitch_ceiling=self.f0_max,
            )
            .selected_array["frequency"]
        )
        pad_size = (p_len - len(f0) + 1) // 2
        if pad_size > 0 or p_len - len(f0) - pad_size > 0:
            f0 = np.pad(f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant")
        f0 *= pow(2, f0_up_key / 12)
        # f0=suofang(f0)
        f0bak = f0.copy()
        f0_coarse = self.get_f0_coarse(f0)
        return f0_coarse, f0bak
    def infer(self, feats: torch.Tensor) -> np.ndarray:
        """
@ -107,11 +123,7 @@ class RVC:
            feats = self.model.final_proj(logits[0])
        ####索引优化
-        if (
+        if hasattr(self, "index") and hasattr(self, "big_npy") and self.index_rate != 0:
            isinstance(self.index, type(None)) == False
            and isinstance(self.big_npy, type(None)) == False
            and self.index_rate != 0
        ):
            npy = feats[0].cpu().numpy().astype("float32")
            _, I = self.index.search(npy, 1)
            npy = self.big_npy[I.squeeze()].astype("float16")
@ -119,30 +131,42 @@ class RVC:
                torch.from_numpy(npy).unsqueeze(0).to(device) * self.index_rate
                + (1 - self.index_rate) * feats
            )
        else:
            print("index search FAIL or disabled")
        feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
        torch.cuda.synchronize()
        # p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存
        p_len = min(feats.shape[1], 12000)  #
        print(feats.shape)
-        pitch, pitchf = self.get_f0(audio, p_len, self.f0_up_key)
+        if(self.if_f0==1):
-        p_len = min(feats.shape[1], 12000, pitch.shape[0])  # 太大了爆显存
+            pitch, pitchf = self.get_f0(audio, self.f0_up_key)
            p_len = min(feats.shape[1], 13000, pitch.shape[0])  # 太大了爆显存
        else:
            pitch, pitchf = None, None
            p_len = min(feats.shape[1], 13000)  # 太大了爆显存
        torch.cuda.synchronize()
        # print(feats.shape,pitch.shape)
        feats = feats[:, :p_len, :]
        if(self.if_f0==1):
            pitch = pitch[:p_len]
            pitchf = pitchf[:p_len]
        p_len = torch.LongTensor([p_len]).to(device)
            pitch = torch.LongTensor(pitch).unsqueeze(0).to(device)
            pitchf = torch.FloatTensor(pitchf).unsqueeze(0).to(device)
        p_len = torch.LongTensor([p_len]).to(device)
        ii = 0  # sid
        sid = torch.LongTensor([ii]).to(device)
        with torch.no_grad():
            if(self.if_f0==1):
                infered_audio = (
                    self.net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0]
                    .data.cpu()
                    .float()
-            )  # nsf
+                )
            else:
                 infered_audio = (
                    self.net_g.infer(feats, p_len, sid)[0][0, 0]
                    .data.cpu()
                    .float()
                )
        torch.cuda.synchronize()
        return infered_audio
@ -363,7 +387,7 @@ class GUI:
            self.config.pth_path,
            self.config.index_path,
            self.config.npy_path,
-            self.config.index_rate,
+            self.config.index_rate
        )
        self.input_wav: np.ndarray = np.zeros(
            self.extra_frame
@ -485,8 +509,9 @@ class GUI:
        else:
            outdata[:] = self.output_wav[:].repeat(2, 1).t().cpu().numpy()
        total_time = time.perf_counter() - start_time
        print("infer time:" + str(total_time))
        self.window["infer_time"].update(int(total_time * 1000))
        print("infer time:" + str(total_time))
    def get_devices(self, update: bool = True):
        """获取设备列表"""
--- a/i18n.py
+++ b/i18n.py
@ -11,10 +11,8 @@ def load_language_list(language):
 class I18nAuto:
    def __init__(self, language=None):
-        if language is None:
+        if language in ['auto', None]:
-            language = "auto"
+            language = locale.getdefaultlocale()[0]#getlocale can't identify the system's language ((None, None))
        if language == "auto":
            language = locale.getdefaultlocale()[0]
        if not os.path.exists(f"./i18n/{language}.json"):
            language = "en_US"
        self.language = language
--- a/i18n/en_US.json
+++ b/i18n/en_US.json
@ -27,8 +27,8 @@
    "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "Batch processing of vocal accompaniment separation, using UVR5 model. <br>Without harmony, use HP2, with harmony and extracted vocals do not need harmony, use HP5<br>Example of qualified folder path format: E:\\ codes\\py39\\vits_vc_gpu\\Egret Shuanghua test sample (just go to the address bar of the file manager and copy it)",
    "输入待处理音频文件夹路径": "Input audio folder path",
    "模型": "Model",
-    "指定输出人声文件夹": "Specify output vocal folder",
+    "指定输出人声文件夹": "Specify vocals output folder",
-    "指定输出乐器文件夹": "Specify output instrumental folder",
+    "指定输出乐器文件夹": "Specify instrumentals output folder",
    "训练": "Train",
    "step1: 填写实验配置. 实验数据放在logs下, 每个实验一个文件夹, 需手工输入实验名路径, 内含实验配置, 日志, 训练得到的模型文件. ": "step1: Fill in the experimental configuration. The experimental data is placed under logs, and each experiment has a folder. You need to manually enter the experimental name path, which contains the experimental configuration, logs, and model files obtained from training. ",
    "输入实验名": "Input experiment name",
@ -41,7 +41,7 @@
    "step2b: 使用CPU提取音高(如果模型带音高), 使用GPU提取特征(选择卡号)": "step2b: Use CPU to extract pitch (if the model has pitch), use GPU to extract features (select card number)",
    "以-分隔输入使用的卡号, 例如   0-1-2   使用卡0和卡1和卡2": "Enter the card numbers used separated by -, for example 0-1-2 use card 0 and card 1 and card 2",
    "显卡信息": "GPU information",
-    "提取音高使用的CPU进程数": "Number of CPU processes used for pitch extraction",
+    "提取音高使用的CPU进程数": "Number of CPU threads to use for pitch extraction",
    "选择音高提取算法:输入歌声可用pm提速,高质量语音但CPU差可用dio提速,harvest质量更好但慢": "Select pitch extraction algorithm: Use 'pm' for faster processing of singing voice, 'dio' for high-quality speech but slower processing, and 'harvest' for the best quality but slowest processing.",
    "特征提取": "Feature extraction",
    "step3: 填写训练设置, 开始训练模型和索引": "step3: Fill in the training settings, start training the model and index",
@ -58,7 +58,7 @@
    "模型融合, 可用于测试音色融合": "Model Fusion, which can be used to test sound fusion",
    "A模型路径": "A model path.",
    "B模型路径": "B model path.",
-    "A模型权重": "A model weight.",
+    "A模型权重": "A model weight for model A.",
    "模型是否带音高指导": "Whether the model has pitch guidance.",
    "要置入的模型信息": "Model information to be placed.",
    "保存的模型名不带后缀": "Saved model name without extension.",
@ -74,7 +74,7 @@
    "查看": "View",
    "模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "Model extraction (enter the path of the large file model under the logs folder), which is suitable for half of the training and does not want to train the model without automatically extracting and saving the small file model, or if you want to test the intermediate model",
    "保存名": "Save Name",
-    "模型是否带音高指导,1是0否": "Whether the model comes with pitch guidance, 1 for yes, 0 for no",
+    "模型是否带音高指导,1是0否": "Whether the model has pitch guidance, 1 for yes, 0 for no",
    "提取": "Extract",
    "招募音高曲线前端编辑器": "Recruit front-end editors for pitch curves",
    "加开发群联系我xxxxx": "Add development group to contact me xxxxx",
--- a/i18n/ja_JP.json
+++ b/i18n/ja_JP.json
@ -11,17 +11,17 @@
    "选择音高提取算法,输入歌声可用pm提速,harvest低音好但巨慢无比": "ピッチ抽出アルゴリズムを選択してください。歌声の場合は、pmを使用して速度を上げることができます。低音が重要な場合は、harvestを使用できますが、非常に遅くなります。",
    "特征检索库文件路径": "特徴量検索データベースのファイルパス",
    "特征文件路径": "特徴量ファイルのパス",
-    "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调",
+    "F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调": "F0(最低共振周波数)カーブファイル(オプション、1行に1ピッチ、デフォルトのF0(最低共振周波数)とエレベーションを置き換えます。)",
    "转换": "変換",
    "输出信息": "出力情報",
    "输出音频(右下角三个点,点了可以下载)": "出力音声(右下の三点をクリックしてダウンロードできます)",
-    "批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ": "批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ",
+    "批量转换, 输入待转换音频文件夹, 或上传多个音频文件, 在指定文件夹(默认opt)下输出转换的音频. ": "一括変換、変換する音声フォルダを入力、または複数の音声ファイルをアップロードし、指定したフォルダ(デフォルトのopt)に変換した音声を出力します。",
    "指定输出文件夹": "出力フォルダを指定してください",
    "检索特征占比": "検索特徴率",
    "输入待处理音频文件夹路径(去文件管理器地址栏拷就行了)": "処理対象音声フォルダーのパスを入力してください(ファイルマネージャのアドレスバーからコピーしてください)",
    "也可批量输入音频文件, 二选一, 优先读文件夹": "複数の音声ファイルを一括で入力することもできますが、フォルダーを優先して読み込みます",
    "伴奏人声分离": "伴奏とボーカルの分離",
-    "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)",
+    "人声伴奏分离批量处理, 使用UVR5模型. <br>不带和声用HP2, 带和声且提取的人声不需要和声用HP5<br>合格的文件夹路径格式举例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(去文件管理器地址栏拷就行了)": "UVR5モデルを使用した、声帯分離バッチ処理です。<br>HP2はハーモニー、ハーモニーのあるボーカルとハーモニーのないボーカルを抽出したものはHP5を使ってください <br>フォルダーパスの形式例: E:\\codes\\py39\\vits_vc_gpu\\白鹭霜华测试样例(エクスプローラーのアドレスバーからコピーするだけです)",
    "输入待处理音频文件夹路径": "処理するオーディオファイルのフォルダパスを入力してください",
    "模型": "モデル",
    "指定输出人声文件夹": "人の声を出力するフォルダを指定してください",
@ -60,7 +60,7 @@
    "要置入的模型信息": "挿入するモデル情報",
    "保存的模型名不带后缀": "拡張子のない保存するモデル名",
    "融合": "フュージョン",
-    "修改模型信息(仅支持weights文件夹下提取的小模型文件)": "修改模型信息(仅支持weights文件夹下提取的小模型文件)",
+    "修改模型信息(仅支持weights文件夹下提取的小模型文件)": "モデル情報の修正(weightsフォルダから抽出された小さなモデルファイルのみ対応)",
    "模型路径": "モデルパス",
    "要改的模型信息": "変更するモデル情報",
    "保存的文件名, 默认空为和源文件同名": "保存するファイル名、デフォルトでは空欄で元のファイル名と同じ名前になります",
@ -68,18 +68,18 @@
    "查看模型信息(仅支持weights文件夹下提取的小模型文件)": "モデル情報を表示する(小さいモデルファイルはweightsフォルダーからのみサポートされています)",
    "查看": "表示",
    "模型提取(输入logs文件夹下大文件模型路径),适用于训一半不想训了模型没有自动提取保存小文件模型,或者想测试中间模型的情况": "モデル抽出(ログフォルダー内の大きなファイルのモデルパスを入力)、モデルを半分までトレーニングし、自動的に小さいファイルモデルを保存しなかったり、中間モデルをテストしたい場合に適用されます。",
-    "保存名": "保存するファイル名",
+    "保存名": "保存ファイル名",
    "模型是否带音高指导,1是0否": "モデルに音高ガイドを付けるかどうか、1は付ける、0は付けない",
    "提取": "抽出",
    "招募音高曲线前端编辑器": "音高曲線フロントエンドエディターを募集",
    "加开发群联系我xxxxx": "開発グループに参加して私に連絡してくださいxxxxx",
    "点击查看交流、问题反馈群号": "クリックして交流、問題フィードバックグループ番号を表示",
    "xxxxx": "xxxxx",
-    "加载模型": "モデルをロードする",
+    "加载模型": "モデルをロード",
    "Hubert模型": "Hubert模型",
-    "选择.pth文件": ".pthファイルを選択する",
+    "选择.pth文件": ".pthファイルを選択",
-    "选择.index文件": ".indexファイルを選択する",
+    "选择.index文件": ".indexファイルを選択",
-    "选择.npy文件": ".npyファイルを選択する",
+    "选择.npy文件": ".npyファイルを選択",
    "输入设备": "入力デバイス",
    "输出设备": "出力デバイス",
    "音频设备(请使用同种类驱动)": "オーディオデバイス(同じ種類のドライバーを使用してください)",
@ -93,7 +93,12 @@
    "输入降噪": "入力ノイズの低減",
    "输出降噪": "出力ノイズの低減",
    "性能设置": "パフォーマンス設定",
-    "开始音频转换": "音声変換を開始する",
+    "开始音频转换": "音声変換を開始",
-    "停止音频转换": "音声変換を停止する",
+    "停止音频转换": "音声変換を停止",
-    "推理时间(ms):": "推論時間(ms):"
+    "推理时间(ms):": "推論時間(ms):",
    "Onnx导出": "Onnx",
    "RVC模型路径": "RVCルパス",
    "Onnx输出路径": "Onnx出力パス",
    "导出Onnx模型": "Onnxに変換",
    "MoeVS模型": "MoeSS？"
 }
--- a/i18n/locale_diff.py
+++ b/i18n/locale_diff.py
@ -7,7 +7,9 @@ standard_file = "zh_CN.json"
 # Find all JSON files in the directory
 dir_path = "./"
-languages = [f for f in os.listdir(dir_path) if f.endswith(".json") and f != standard_file]
+languages = [
    f for f in os.listdir(dir_path) if f.endswith(".json") and f != standard_file
 ]
 # Load the standard file
 with open(standard_file, "r", encoding="utf-8") as f:
@ -40,3 +42,4 @@ for lang_file in languages:
    # Save the updated language file
    with open(lang_file, "w", encoding="utf-8") as f:
        json.dump(lang_data, f, ensure_ascii=False, indent=4)
        f.write("\n")
--- a/i18n/zh_CN.json
+++ b/i18n/zh_CN.json
@ -95,5 +95,10 @@
    "性能设置": "性能设置",
    "开始音频转换": "开始音频转换",
    "停止音频转换": "停止音频转换",
-    "推理时间(ms):": "推理时间(ms):"
+    "推理时间(ms):": "推理时间(ms):",
    "Onnx导出": "Onnx导出",
    "RVC模型路径": "RVC模型路径",
    "Onnx输出路径": "Onnx输出路径",
    "导出Onnx模型": "导出Onnx模型",
    "MoeVS模型": "MoeSS模型"
 }
--- a/i18n/zh_HK.json
+++ b/i18n/zh_HK.json
@ -95,5 +95,10 @@
    "性能设置": "效能設定",
    "开始音频转换": "開始音訊轉換",
    "停止音频转换": "停止音訊轉換",
-    "推理时间(ms):": "推理時間(ms):"
+    "推理时间(ms):": "推理時間(ms):",
    "Onnx导出": "Onnx导出",
    "RVC模型路径": "RVC模型路径",
    "Onnx输出路径": "Onnx输出路径",
    "导出Onnx模型": "导出Onnx模型",
    "MoeVS模型": "MoeSS模型"
 }
--- a/i18n/zh_SG.json
+++ b/i18n/zh_SG.json
@ -95,5 +95,10 @@
    "性能设置": "效能設定",
    "开始音频转换": "開始音訊轉換",
    "停止音频转换": "停止音訊轉換",
-    "推理时间(ms):": "推理時間(ms):"
+    "推理时间(ms):": "推理時間(ms):",
    "Onnx导出": "Onnx导出",
    "RVC模型路径": "RVC模型路径",
    "Onnx输出路径": "Onnx输出路径",
    "导出Onnx模型": "导出Onnx模型",
    "MoeVS模型": "MoeSS模型"
 }
--- a/i18n/zh_TW.json
+++ b/i18n/zh_TW.json
@ -95,5 +95,10 @@
    "性能设置": "效能設定",
    "开始音频转换": "開始音訊轉換",
    "停止音频转换": "停止音訊轉換",
-    "推理时间(ms):": "推理時間(ms):"
+    "推理时间(ms):": "推理時間(ms):",
    "Onnx导出": "Onnx导出",
    "RVC模型路径": "RVC模型路径",
    "Onnx输出路径": "Onnx输出路径",
    "导出Onnx模型": "导出Onnx模型",
    "MoeVS模型": "MoeSS模型"
 }
--- a/infer-web.py
+++ b/infer-web.py
@ -36,6 +36,10 @@ else:
            or "20" in gpu_name
            or "30" in gpu_name
            or "40" in gpu_name
            or "A2" in gpu_name.upper()
            or "A3" in gpu_name.upper()
            or "A4" in gpu_name.upper()
            or "P4" in gpu_name.upper()
            or "A50" in gpu_name.upper()
            or "70" in gpu_name
            or "80" in gpu_name
@ -115,6 +119,7 @@ for name in os.listdir(weight_uvr5_root):
        uvr5_names.append(name.replace(".pth", ""))
 def vc_single(
    sid,
    input_audio,
@ -135,6 +140,17 @@ def vc_single(
        if hubert_model == None:
            load_hubert()
        if_f0 = cpt.get("f0", 1)
        file_index = (
            file_index.strip(" ")
            .strip('"')
            .strip("\n")
            .strip('"')
            .strip(" ")
            .replace("trained", "added")
        )  # 防止小白写错，自动帮他替换掉
        file_big_npy = (
            file_big_npy.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
        )
        audio_opt = vc.pipeline(
            hubert_model,
            net_g,
@ -870,6 +886,83 @@ def change_info_(ckpt_path):
        return {"__type__": "update"}, {"__type__": "update"}
 from infer_pack.models_onnx_moess import SynthesizerTrnMs256NSFsidM
 from infer_pack.models_onnx import SynthesizerTrnMs256NSFsidO
 def export_onnx(ModelPath, ExportedPath, MoeVS=True):
    hidden_channels = 256                                              # hidden_channels，为768Vec做准备
    cpt = torch.load(ModelPath, map_location="cpu")                   
    cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]         # n_spk
    print(*cpt["config"])
    test_phone = torch.rand(1, 200, hidden_channels)                   # hidden unit
    test_phone_lengths = torch.tensor([200]).long()                    # hidden unit 长度（貌似没啥用）
    test_pitch = torch.randint(size=(1, 200), low=5, high=255)         # 基频（单位赫兹）
    test_pitchf = torch.rand(1, 200)                                   # nsf基频
    test_ds = torch.LongTensor([0])                                    # 说话人ID
    test_rnd = torch.rand(1, 192, 200)                                 # 噪声（加入随机因子）
    device = "cpu"  #导出时设备（不影响使用模型）
    if MoeVS:
        net_g = SynthesizerTrnMs256NSFsidM(*cpt["config"], is_half=False)   # fp32导出（C++要支持fp16必须手动将内存重新排列所以暂时不用fp16）
        net_g.load_state_dict(cpt["weight"], strict=False)
        input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
        output_names = [
            "audio",
        ]
        torch.onnx.export(
            net_g,
            (
                test_phone.to(device),
                test_phone_lengths.to(device),
                test_pitch.to(device),
                test_pitchf.to(device),
                test_ds.to(device),
                test_rnd.to(device),
            ),
            ExportedPath,
            dynamic_axes={
                "phone": [1],
                "pitch": [1],
                "pitchf": [1],
                "rnd": [2],
            },
            do_constant_folding=False,
            opset_version=16,
            verbose=False,
            input_names=input_names,
            output_names=output_names,
        )
    else:
        net_g = SynthesizerTrnMs256NSFsidO(*cpt["config"], is_half=False)   # fp32导出（C++要支持fp16必须手动将内存重新排列所以暂时不用fp16）
        net_g.load_state_dict(cpt["weight"], strict=False)
        input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds"]
        output_names = [
            "audio",
        ]
        torch.onnx.export(
            net_g,
            (
                test_phone.to(device),
                test_phone_lengths.to(device),
                test_pitch.to(device),
                test_pitchf.to(device),
                test_ds.to(device),
            ),
            ExportedPath,
            dynamic_axes={
                "phone": [1],
                "pitch": [1],
                "pitchf": [1],
            },
            do_constant_folding=False,
            opset_version=16,
            verbose=False,
            input_names=input_names,
            output_names=output_names,
        )
    return "Finished"
 with gr.Blocks() as app:
    gr.Markdown(
        value=i18n(
@ -932,7 +1025,7 @@ with gr.Blocks() as app:
                            minimum=0,
                            maximum=1,
                            label="检索特征占比",
-                            value=1,
+                            value=0.6,
                            interactive=True,
                        )
                    f0_file = gr.File(label=i18n("F0曲线文件, 可选, 一行一个音高, 代替默认F0及升降调"))
@ -1346,6 +1439,18 @@ with gr.Blocks() as app:
                    info7,
                )
        with gr.TabItem(i18n("Onnx导出")):
            with gr.Row():
                ckpt_dir = gr.Textbox(label=i18n("RVC模型路径"), value="", interactive=True)
            with gr.Row():
                onnx_dir = gr.Textbox(label=i18n("Onnx输出路径"), value="", interactive=True)
            with gr.Row():
                moevs = gr.Checkbox(label=i18n("MoeVS模型"), value=True)
                infoOnnx = gr.Label(label="Null")
            with gr.Row():
                butOnnx = gr.Button(i18n("导出Onnx模型"), variant="primary")
            butOnnx.click(export_onnx, [ckpt_dir, onnx_dir, moevs], infoOnnx)
        # with gr.TabItem(i18n("招募音高曲线前端编辑器")):
        #     gr.Markdown(value=i18n("加开发群联系我xxxxx"))
        # with gr.TabItem(i18n("点击查看交流、问题反馈群号")):
--- a/infer_pack/models_onnx.py
+++ b/infer_pack/models_onnx.py
@ -527,7 +527,7 @@ sr2sr = {
 }
-class SynthesizerTrnMs256NSFsid(nn.Module):
+class SynthesizerTrnMs256NSFsidO(nn.Module):
    def __init__(
        self,
        spec_channels,
@ -612,104 +612,15 @@ class SynthesizerTrnMs256NSFsid(nn.Module):
        self.flow.remove_weight_norm()
        self.enc_q.remove_weight_norm()
-    def forward(self, phone, phone_lengths, pitch, nsff0, sid, rnd, max_len=None):
+    def forward(self, phone, phone_lengths, pitch, nsff0, sid, max_len=None):
        g = self.emb_g(sid).unsqueeze(-1)
        m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
-        z_p = (m_p + torch.exp(logs_p) * rnd) * x_mask
+        z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
        z = self.flow(z_p, x_mask, g=g, reverse=True)
        o = self.dec((z * x_mask)[:, :, :max_len], nsff0, g=g)
        return o
 class SynthesizerTrnMs256NSFsid_sim(nn.Module):
    """
    Synthesizer for Training
    """
    def __init__(
        self,
        spec_channels,
        segment_size,
        inter_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        spk_embed_dim,
        # hop_length,
        gin_channels=0,
        use_sdp=True,
        **kwargs
    ):
        super().__init__()
        self.spec_channels = spec_channels
        self.inter_channels = inter_channels
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.resblock = resblock
        self.resblock_kernel_sizes = resblock_kernel_sizes
        self.resblock_dilation_sizes = resblock_dilation_sizes
        self.upsample_rates = upsample_rates
        self.upsample_initial_channel = upsample_initial_channel
        self.upsample_kernel_sizes = upsample_kernel_sizes
        self.segment_size = segment_size
        self.gin_channels = gin_channels
        # self.hop_length = hop_length#
        self.spk_embed_dim = spk_embed_dim
        self.enc_p = TextEncoder256Sim(
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
        )
        self.dec = GeneratorNSF(
            inter_channels,
            resblock,
            resblock_kernel_sizes,
            resblock_dilation_sizes,
            upsample_rates,
            upsample_initial_channel,
            upsample_kernel_sizes,
            gin_channels=gin_channels,
            is_half=kwargs["is_half"],
        )
        self.flow = ResidualCouplingBlock(
            inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
        )
        self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
        print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
    def remove_weight_norm(self):
        self.dec.remove_weight_norm()
        self.flow.remove_weight_norm()
        self.enc_q.remove_weight_norm()
    def forward(
        self, phone, phone_lengths, pitch, pitchf, ds, max_len=None
    ):  # y是spec不需要了现在
        g = self.emb_g(ds.unsqueeze(0)).unsqueeze(-1)  # [b, 256, 1]##1是t，广播的
        x, x_mask = self.enc_p(phone, pitch, phone_lengths)
        x = self.flow(x, x_mask, g=g, reverse=True)
        o = self.dec((x * x_mask)[:, :, :max_len], pitchf, g=g)
        return o
 class MultiPeriodDiscriminator(torch.nn.Module):
    def __init__(self, use_spectral_norm=False):
        super(MultiPeriodDiscriminator, self).__init__()
--- a/infer_pack/models_onnx_moess.py
+++ b/infer_pack/models_onnx_moess.py
@ -0,0 +1,849 @@
 import math, pdb, os
 from time import time as ttime
 import torch
 from torch import nn
 from torch.nn import functional as F
 from infer_pack import modules
 from infer_pack import attentions
 from infer_pack import commons
 from infer_pack.commons import init_weights, get_padding
 from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
 from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
 from infer_pack.commons import init_weights
 import numpy as np
 from infer_pack import commons
 class TextEncoder256(nn.Module):
    def __init__(
        self,
        out_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        f0=True,
    ):
        super().__init__()
        self.out_channels = out_channels
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.emb_phone = nn.Linear(256, hidden_channels)
        self.lrelu = nn.LeakyReLU(0.1, inplace=True)
        if f0 == True:
            self.emb_pitch = nn.Embedding(256, hidden_channels)  # pitch 256
        self.encoder = attentions.Encoder(
            hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
        )
        self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
    def forward(self, phone, pitch, lengths):
        if pitch == None:
            x = self.emb_phone(phone)
        else:
            x = self.emb_phone(phone) + self.emb_pitch(pitch)
        x = x * math.sqrt(self.hidden_channels)  # [b, t, h]
        x = self.lrelu(x)
        x = torch.transpose(x, 1, -1)  # [b, h, t]
        x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
            x.dtype
        )
        x = self.encoder(x * x_mask, x_mask)
        stats = self.proj(x) * x_mask
        m, logs = torch.split(stats, self.out_channels, dim=1)
        return m, logs, x_mask
 class TextEncoder256Sim(nn.Module):
    def __init__(
        self,
        out_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        f0=True,
    ):
        super().__init__()
        self.out_channels = out_channels
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.emb_phone = nn.Linear(256, hidden_channels)
        self.lrelu = nn.LeakyReLU(0.1, inplace=True)
        if f0 == True:
            self.emb_pitch = nn.Embedding(256, hidden_channels)  # pitch 256
        self.encoder = attentions.Encoder(
            hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
        )
        self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
    def forward(self, phone, pitch, lengths):
        if pitch == None:
            x = self.emb_phone(phone)
        else:
            x = self.emb_phone(phone) + self.emb_pitch(pitch)
        x = x * math.sqrt(self.hidden_channels)  # [b, t, h]
        x = self.lrelu(x)
        x = torch.transpose(x, 1, -1)  # [b, h, t]
        x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
            x.dtype
        )
        x = self.encoder(x * x_mask, x_mask)
        x = self.proj(x) * x_mask
        return x, x_mask
 class ResidualCouplingBlock(nn.Module):
    def __init__(
        self,
        channels,
        hidden_channels,
        kernel_size,
        dilation_rate,
        n_layers,
        n_flows=4,
        gin_channels=0,
    ):
        super().__init__()
        self.channels = channels
        self.hidden_channels = hidden_channels
        self.kernel_size = kernel_size
        self.dilation_rate = dilation_rate
        self.n_layers = n_layers
        self.n_flows = n_flows
        self.gin_channels = gin_channels
        self.flows = nn.ModuleList()
        for i in range(n_flows):
            self.flows.append(
                modules.ResidualCouplingLayer(
                    channels,
                    hidden_channels,
                    kernel_size,
                    dilation_rate,
                    n_layers,
                    gin_channels=gin_channels,
                    mean_only=True,
                )
            )
            self.flows.append(modules.Flip())
    def forward(self, x, x_mask, g=None, reverse=False):
        if not reverse:
            for flow in self.flows:
                x, _ = flow(x, x_mask, g=g, reverse=reverse)
        else:
            for flow in reversed(self.flows):
                x = flow(x, x_mask, g=g, reverse=reverse)
        return x
    def remove_weight_norm(self):
        for i in range(self.n_flows):
            self.flows[i * 2].remove_weight_norm()
 class PosteriorEncoder(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        hidden_channels,
        kernel_size,
        dilation_rate,
        n_layers,
        gin_channels=0,
    ):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.hidden_channels = hidden_channels
        self.kernel_size = kernel_size
        self.dilation_rate = dilation_rate
        self.n_layers = n_layers
        self.gin_channels = gin_channels
        self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
        self.enc = modules.WN(
            hidden_channels,
            kernel_size,
            dilation_rate,
            n_layers,
            gin_channels=gin_channels,
        )
        self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
    def forward(self, x, x_lengths, g=None):
        x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
            x.dtype
        )
        x = self.pre(x) * x_mask
        x = self.enc(x, x_mask, g=g)
        stats = self.proj(x) * x_mask
        m, logs = torch.split(stats, self.out_channels, dim=1)
        z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
        return z, m, logs, x_mask
    def remove_weight_norm(self):
        self.enc.remove_weight_norm()
 class Generator(torch.nn.Module):
    def __init__(
        self,
        initial_channel,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        gin_channels=0,
    ):
        super(Generator, self).__init__()
        self.num_kernels = len(resblock_kernel_sizes)
        self.num_upsamples = len(upsample_rates)
        self.conv_pre = Conv1d(
            initial_channel, upsample_initial_channel, 7, 1, padding=3
        )
        resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
        self.ups = nn.ModuleList()
        for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
            self.ups.append(
                weight_norm(
                    ConvTranspose1d(
                        upsample_initial_channel // (2**i),
                        upsample_initial_channel // (2 ** (i + 1)),
                        k,
                        u,
                        padding=(k - u) // 2,
                    )
                )
            )
        self.resblocks = nn.ModuleList()
        for i in range(len(self.ups)):
            ch = upsample_initial_channel // (2 ** (i + 1))
            for j, (k, d) in enumerate(
                zip(resblock_kernel_sizes, resblock_dilation_sizes)
            ):
                self.resblocks.append(resblock(ch, k, d))
        self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
        self.ups.apply(init_weights)
        if gin_channels != 0:
            self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
    def forward(self, x, g=None):
        x = self.conv_pre(x)
        if g is not None:
            x = x + self.cond(g)
        for i in range(self.num_upsamples):
            x = F.leaky_relu(x, modules.LRELU_SLOPE)
            x = self.ups[i](x)
            xs = None
            for j in range(self.num_kernels):
                if xs is None:
                    xs = self.resblocks[i * self.num_kernels + j](x)
                else:
                    xs += self.resblocks[i * self.num_kernels + j](x)
            x = xs / self.num_kernels
        x = F.leaky_relu(x)
        x = self.conv_post(x)
        x = torch.tanh(x)
        return x
    def remove_weight_norm(self):
        for l in self.ups:
            remove_weight_norm(l)
        for l in self.resblocks:
            l.remove_weight_norm()
 class SineGen(torch.nn.Module):
    """Definition of sine generator
    SineGen(samp_rate, harmonic_num = 0,
            sine_amp = 0.1, noise_std = 0.003,
            voiced_threshold = 0,
            flag_for_pulse=False)
    samp_rate: sampling rate in Hz
    harmonic_num: number of harmonic overtones (default 0)
    sine_amp: amplitude of sine-wavefrom (default 0.1)
    noise_std: std of Gaussian noise (default 0.003)
    voiced_thoreshold: F0 threshold for U/V classification (default 0)
    flag_for_pulse: this SinGen is used inside PulseGen (default False)
    Note: when flag_for_pulse is True, the first time step of a voiced
        segment is always sin(np.pi) or cos(0)
    """
    def __init__(
        self,
        samp_rate,
        harmonic_num=0,
        sine_amp=0.1,
        noise_std=0.003,
        voiced_threshold=0,
        flag_for_pulse=False,
    ):
        super(SineGen, self).__init__()
        self.sine_amp = sine_amp
        self.noise_std = noise_std
        self.harmonic_num = harmonic_num
        self.dim = self.harmonic_num + 1
        self.sampling_rate = samp_rate
        self.voiced_threshold = voiced_threshold
    def _f02uv(self, f0):
        # generate uv signal
        uv = torch.ones_like(f0)
        uv = uv * (f0 > self.voiced_threshold)
        return uv
    def forward(self, f0, upp):
        """sine_tensor, uv = forward(f0)
        input F0: tensor(batchsize=1, length, dim=1)
                  f0 for unvoiced steps should be 0
        output sine_tensor: tensor(batchsize=1, length, dim)
        output uv: tensor(batchsize=1, length, 1)
        """
        with torch.no_grad():
            f0 = f0[:, None].transpose(1, 2)
            f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim, device=f0.device)
            # fundamental component
            f0_buf[:, :, 0] = f0[:, :, 0]
            for idx in np.arange(self.harmonic_num):
                f0_buf[:, :, idx + 1] = f0_buf[:, :, 0] * (
                    idx + 2
                )  # idx + 2: the (idx+1)-th overtone, (idx+2)-th harmonic
            rad_values = (f0_buf / self.sampling_rate) % 1  ###%1意味着n_har的乘积无法后处理优化
            rand_ini = torch.rand(
                f0_buf.shape[0], f0_buf.shape[2], device=f0_buf.device
            )
            rand_ini[:, 0] = 0
            rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
            tmp_over_one = torch.cumsum(rad_values, 1)  # % 1  #####%1意味着后面的cumsum无法再优化
            tmp_over_one *= upp
            tmp_over_one = F.interpolate(
                tmp_over_one.transpose(2, 1),
                scale_factor=upp,
                mode="linear",
                align_corners=True,
            ).transpose(2, 1)
            rad_values = F.interpolate(
                rad_values.transpose(2, 1), scale_factor=upp, mode="nearest"
            ).transpose(
                2, 1
            )  #######
            tmp_over_one %= 1
            tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
            cumsum_shift = torch.zeros_like(rad_values)
            cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
            sine_waves = torch.sin(
                torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi
            )
            sine_waves = sine_waves * self.sine_amp
            uv = self._f02uv(f0)
            uv = F.interpolate(
                uv.transpose(2, 1), scale_factor=upp, mode="nearest"
            ).transpose(2, 1)
            noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
            noise = noise_amp * torch.randn_like(sine_waves)
            sine_waves = sine_waves * uv + noise
        return sine_waves, uv, noise
 class SourceModuleHnNSF(torch.nn.Module):
    """SourceModule for hn-nsf
    SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
                 add_noise_std=0.003, voiced_threshod=0)
    sampling_rate: sampling_rate in Hz
    harmonic_num: number of harmonic above F0 (default: 0)
    sine_amp: amplitude of sine source signal (default: 0.1)
    add_noise_std: std of additive Gaussian noise (default: 0.003)
        note that amplitude of noise in unvoiced is decided
        by sine_amp
    voiced_threshold: threhold to set U/V given F0 (default: 0)
    Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
    F0_sampled (batchsize, length, 1)
    Sine_source (batchsize, length, 1)
    noise_source (batchsize, length 1)
    uv (batchsize, length, 1)
    """
    def __init__(
        self,
        sampling_rate,
        harmonic_num=0,
        sine_amp=0.1,
        add_noise_std=0.003,
        voiced_threshod=0,
        is_half=True,
    ):
        super(SourceModuleHnNSF, self).__init__()
        self.sine_amp = sine_amp
        self.noise_std = add_noise_std
        self.is_half = is_half
        # to produce sine waveforms
        self.l_sin_gen = SineGen(
            sampling_rate, harmonic_num, sine_amp, add_noise_std, voiced_threshod
        )
        # to merge source harmonics into a single excitation
        self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
        self.l_tanh = torch.nn.Tanh()
    def forward(self, x, upp=None):
        sine_wavs, uv, _ = self.l_sin_gen(x, upp)
        if self.is_half:
            sine_wavs = sine_wavs.half()
        sine_merge = self.l_tanh(self.l_linear(sine_wavs))
        return sine_merge, None, None  # noise, uv
 class GeneratorNSF(torch.nn.Module):
    def __init__(
        self,
        initial_channel,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        gin_channels,
        sr,
        is_half=False,
    ):
        super(GeneratorNSF, self).__init__()
        self.num_kernels = len(resblock_kernel_sizes)
        self.num_upsamples = len(upsample_rates)
        self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(upsample_rates))
        self.m_source = SourceModuleHnNSF(
            sampling_rate=sr, harmonic_num=0, is_half=is_half
        )
        self.noise_convs = nn.ModuleList()
        self.conv_pre = Conv1d(
            initial_channel, upsample_initial_channel, 7, 1, padding=3
        )
        resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
        self.ups = nn.ModuleList()
        for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
            c_cur = upsample_initial_channel // (2 ** (i + 1))
            self.ups.append(
                weight_norm(
                    ConvTranspose1d(
                        upsample_initial_channel // (2**i),
                        upsample_initial_channel // (2 ** (i + 1)),
                        k,
                        u,
                        padding=(k - u) // 2,
                    )
                )
            )
            if i + 1 < len(upsample_rates):
                stride_f0 = np.prod(upsample_rates[i + 1 :])
                self.noise_convs.append(
                    Conv1d(
                        1,
                        c_cur,
                        kernel_size=stride_f0 * 2,
                        stride=stride_f0,
                        padding=stride_f0 // 2,
                    )
                )
            else:
                self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
        self.resblocks = nn.ModuleList()
        for i in range(len(self.ups)):
            ch = upsample_initial_channel // (2 ** (i + 1))
            for j, (k, d) in enumerate(
                zip(resblock_kernel_sizes, resblock_dilation_sizes)
            ):
                self.resblocks.append(resblock(ch, k, d))
        self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
        self.ups.apply(init_weights)
        if gin_channels != 0:
            self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
        self.upp = np.prod(upsample_rates)
    def forward(self, x, f0, g=None):
        har_source, noi_source, uv = self.m_source(f0, self.upp)
        har_source = har_source.transpose(1, 2)
        x = self.conv_pre(x)
        if g is not None:
            x = x + self.cond(g)
        for i in range(self.num_upsamples):
            x = F.leaky_relu(x, modules.LRELU_SLOPE)
            x = self.ups[i](x)
            x_source = self.noise_convs[i](har_source)
            x = x + x_source
            xs = None
            for j in range(self.num_kernels):
                if xs is None:
                    xs = self.resblocks[i * self.num_kernels + j](x)
                else:
                    xs += self.resblocks[i * self.num_kernels + j](x)
            x = xs / self.num_kernels
        x = F.leaky_relu(x)
        x = self.conv_post(x)
        x = torch.tanh(x)
        return x
    def remove_weight_norm(self):
        for l in self.ups:
            remove_weight_norm(l)
        for l in self.resblocks:
            l.remove_weight_norm()
 sr2sr = {
    "32k": 32000,
    "40k": 40000,
    "48k": 48000,
 }
 class SynthesizerTrnMs256NSFsidM(nn.Module):
    def __init__(
        self,
        spec_channels,
        segment_size,
        inter_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        spk_embed_dim,
        gin_channels,
        sr,
        **kwargs
    ):
        super().__init__()
        if type(sr) == type("strr"):
            sr = sr2sr[sr]
        self.spec_channels = spec_channels
        self.inter_channels = inter_channels
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.resblock = resblock
        self.resblock_kernel_sizes = resblock_kernel_sizes
        self.resblock_dilation_sizes = resblock_dilation_sizes
        self.upsample_rates = upsample_rates
        self.upsample_initial_channel = upsample_initial_channel
        self.upsample_kernel_sizes = upsample_kernel_sizes
        self.segment_size = segment_size
        self.gin_channels = gin_channels
        # self.hop_length = hop_length#
        self.spk_embed_dim = spk_embed_dim
        self.enc_p = TextEncoder256(
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
        )
        self.dec = GeneratorNSF(
            inter_channels,
            resblock,
            resblock_kernel_sizes,
            resblock_dilation_sizes,
            upsample_rates,
            upsample_initial_channel,
            upsample_kernel_sizes,
            gin_channels=gin_channels,
            sr=sr,
            is_half=kwargs["is_half"],
        )
        self.enc_q = PosteriorEncoder(
            spec_channels,
            inter_channels,
            hidden_channels,
            5,
            1,
            16,
            gin_channels=gin_channels,
        )
        self.flow = ResidualCouplingBlock(
            inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
        )
        self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
        print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
    def remove_weight_norm(self):
        self.dec.remove_weight_norm()
        self.flow.remove_weight_norm()
        self.enc_q.remove_weight_norm()
    def forward(self, phone, phone_lengths, pitch, nsff0, sid, rnd, max_len=None):
        g = self.emb_g(sid).unsqueeze(-1)
        m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
        z_p = (m_p + torch.exp(logs_p) * rnd) * x_mask
        z = self.flow(z_p, x_mask, g=g, reverse=True)
        o = self.dec((z * x_mask)[:, :, :max_len], nsff0, g=g)
        return o
 class SynthesizerTrnMs256NSFsid_sim(nn.Module):
    """
    Synthesizer for Training
    """
    def __init__(
        self,
        spec_channels,
        segment_size,
        inter_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        spk_embed_dim,
        # hop_length,
        gin_channels=0,
        use_sdp=True,
        **kwargs
    ):
        super().__init__()
        self.spec_channels = spec_channels
        self.inter_channels = inter_channels
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.resblock = resblock
        self.resblock_kernel_sizes = resblock_kernel_sizes
        self.resblock_dilation_sizes = resblock_dilation_sizes
        self.upsample_rates = upsample_rates
        self.upsample_initial_channel = upsample_initial_channel
        self.upsample_kernel_sizes = upsample_kernel_sizes
        self.segment_size = segment_size
        self.gin_channels = gin_channels
        # self.hop_length = hop_length#
        self.spk_embed_dim = spk_embed_dim
        self.enc_p = TextEncoder256Sim(
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
        )
        self.dec = GeneratorNSF(
            inter_channels,
            resblock,
            resblock_kernel_sizes,
            resblock_dilation_sizes,
            upsample_rates,
            upsample_initial_channel,
            upsample_kernel_sizes,
            gin_channels=gin_channels,
            is_half=kwargs["is_half"],
        )
        self.flow = ResidualCouplingBlock(
            inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
        )
        self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
        print("gin_channels:", gin_channels, "self.spk_embed_dim:", self.spk_embed_dim)
    def remove_weight_norm(self):
        self.dec.remove_weight_norm()
        self.flow.remove_weight_norm()
        self.enc_q.remove_weight_norm()
    def forward(
        self, phone, phone_lengths, pitch, pitchf, ds, max_len=None
    ):  # y是spec不需要了现在
        g = self.emb_g(ds.unsqueeze(0)).unsqueeze(-1)  # [b, 256, 1]##1是t，广播的
        x, x_mask = self.enc_p(phone, pitch, phone_lengths)
        x = self.flow(x, x_mask, g=g, reverse=True)
        o = self.dec((x * x_mask)[:, :, :max_len], pitchf, g=g)
        return o
 class MultiPeriodDiscriminator(torch.nn.Module):
    def __init__(self, use_spectral_norm=False):
        super(MultiPeriodDiscriminator, self).__init__()
        periods = [2, 3, 5, 7, 11, 17]
        # periods = [3, 5, 7, 11, 17, 23, 37]
        discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
        discs = discs + [
            DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
        ]
        self.discriminators = nn.ModuleList(discs)
    def forward(self, y, y_hat):
        y_d_rs = []  #
        y_d_gs = []
        fmap_rs = []
        fmap_gs = []
        for i, d in enumerate(self.discriminators):
            y_d_r, fmap_r = d(y)
            y_d_g, fmap_g = d(y_hat)
            # for j in range(len(fmap_r)):
            #     print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
            y_d_rs.append(y_d_r)
            y_d_gs.append(y_d_g)
            fmap_rs.append(fmap_r)
            fmap_gs.append(fmap_g)
        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
 class DiscriminatorS(torch.nn.Module):
    def __init__(self, use_spectral_norm=False):
        super(DiscriminatorS, self).__init__()
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.convs = nn.ModuleList(
            [
                norm_f(Conv1d(1, 16, 15, 1, padding=7)),
                norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
                norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
                norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
                norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
                norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
            ]
        )
        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
    def forward(self, x):
        fmap = []
        for l in self.convs:
            x = l(x)
            x = F.leaky_relu(x, modules.LRELU_SLOPE)
            fmap.append(x)
        x = self.conv_post(x)
        fmap.append(x)
        x = torch.flatten(x, 1, -1)
        return x, fmap
 class DiscriminatorP(torch.nn.Module):
    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
        super(DiscriminatorP, self).__init__()
        self.period = period
        self.use_spectral_norm = use_spectral_norm
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.convs = nn.ModuleList(
            [
                norm_f(
                    Conv2d(
                        1,
                        32,
                        (kernel_size, 1),
                        (stride, 1),
                        padding=(get_padding(kernel_size, 1), 0),
                    )
                ),
                norm_f(
                    Conv2d(
                        32,
                        128,
                        (kernel_size, 1),
                        (stride, 1),
                        padding=(get_padding(kernel_size, 1), 0),
                    )
                ),
                norm_f(
                    Conv2d(
                        128,
                        512,
                        (kernel_size, 1),
                        (stride, 1),
                        padding=(get_padding(kernel_size, 1), 0),
                    )
                ),
                norm_f(
                    Conv2d(
                        512,
                        1024,
                        (kernel_size, 1),
                        (stride, 1),
                        padding=(get_padding(kernel_size, 1), 0),
                    )
                ),
                norm_f(
                    Conv2d(
                        1024,
                        1024,
                        (kernel_size, 1),
                        1,
                        padding=(get_padding(kernel_size, 1), 0),
                    )
                ),
            ]
        )
        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
    def forward(self, x):
        fmap = []
        # 1d to 2d
        b, c, t = x.shape
        if t % self.period != 0:  # pad first
            n_pad = self.period - (t % self.period)
            x = F.pad(x, (0, n_pad), "reflect")
            t = t + n_pad
        x = x.view(b, c, t // self.period, self.period)
        for l in self.convs:
            x = l(x)
            x = F.leaky_relu(x, modules.LRELU_SLOPE)
            fmap.append(x)
        x = self.conv_post(x)
        fmap.append(x)
        x = torch.flatten(x, 1, -1)
        return x, fmap
--- a/my_utils.py
+++ b/my_utils.py
@ -12,10 +12,10 @@ def load_audio(file, sr):
        )  # 防止小白拷路径头尾带了空格和"和回车
        out, _ = (
            ffmpeg.input(file, threads=0)
-            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
+            .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
        )
    except Exception as e:
        raise RuntimeError(f"Failed to load audio: {e}")
-    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
+    return np.frombuffer(out, np.float32).flatten()
--- a/requirements.txt
+++ b/requirements.txt
@ -4,7 +4,8 @@ scipy==1.9.3
 librosa==0.9.2
 llvmlite==0.39.0
 fairseq==0.12.2
-faiss-cpu==1.7.2
+faiss-cpu==1.7.0; sys_platform == "darwin"
 faiss-cpu==1.7.2; sys_platform != "darwin"
 gradio
 Cython
 future>=0.18.3
--- a/train/data_utils.py
+++ b/train/data_utils.py
@ -98,7 +98,10 @@ class TextAudioLoaderMultiNSFsid(torch.utils.data.Dataset):
                    sampling_rate, self.sampling_rate
                )
            )
-        audio_norm = audio / self.max_wav_value
+        audio_norm = audio
 #        audio_norm = audio / self.max_wav_value
 #        audio_norm = audio / np.abs(audio).max()
        audio_norm = audio_norm.unsqueeze(0)
        spec_filename = filename.replace(".wav", ".spec.pt")
        if os.path.exists(spec_filename):
@ -287,7 +290,10 @@ class TextAudioLoader(torch.utils.data.Dataset):
                    sampling_rate, self.sampling_rate
                )
            )
-        audio_norm = audio / self.max_wav_value
+        audio_norm = audio
 #        audio_norm = audio / self.max_wav_value
 #        audio_norm = audio / np.abs(audio).max()
        audio_norm = audio_norm.unsqueeze(0)
        spec_filename = filename.replace(".wav", ".spec.pt")
        if os.path.exists(spec_filename):
--- a/train/mel_processing.py
+++ b/train/mel_processing.py
@ -1,18 +1,8 @@
 import math
 import os
 import random
 import torch
 from torch import nn
 import torch.nn.functional as F
 import torch.utils.data
 import numpy as np
 import librosa
 import librosa.util as librosa_util
 from librosa.util import normalize, pad_center, tiny
 from scipy.signal import get_window
 from scipy.io.wavfile import read
 from librosa.filters import mel as librosa_mel_fn
 MAX_WAV_VALUE = 32768.0
@ -35,25 +25,38 @@ def dynamic_range_decompression_torch(x, C=1):
 def spectral_normalize_torch(magnitudes):
-    output = dynamic_range_compression_torch(magnitudes)
+    return dynamic_range_compression_torch(magnitudes)
    return output
 def spectral_de_normalize_torch(magnitudes):
-    output = dynamic_range_decompression_torch(magnitudes)
+    return dynamic_range_decompression_torch(magnitudes)
    return output
 # Reusable banks
 mel_basis = {}
 hann_window = {}
 def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
    """Convert waveform into Linear-frequency Linear-amplitude spectrogram.
    Args:
        y             :: (B, T) - Audio waveforms
        n_fft
        sampling_rate
        hop_size
        win_size
        center
    Returns:
        :: (B, Freq, Frame) - Linear-frequency Linear-amplitude spectrogram
    """
    # Validation
    if torch.min(y) < -1.0:
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))
    # Window - Cache if needed
    global hann_window
    dtype_device = str(y.dtype) + "_" + str(y.device)
    wnsize_dtype_device = str(win_size) + "_" + dtype_device
@ -62,6 +65,7 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
            dtype=y.dtype, device=y.device
        )
    # Padding
    y = torch.nn.functional.pad(
        y.unsqueeze(1),
        (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
@ -69,6 +73,7 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
    )
    y = y.squeeze(1)
    # Complex Spectrogram :: (B, T) -> (B, Freq, Frame, RealComplex=2)
    spec = torch.stft(
        y,
        n_fft,
@ -82,79 +87,44 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
        return_complex=False,
    )
    # Linear-frequency Linear-amplitude spectrogram :: (B, Freq, Frame, RealComplex=2) -> (B, Freq, Frame)
    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
    return spec
 def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
    # MelBasis - Cache if needed
    global mel_basis
    dtype_device = str(spec.dtype) + "_" + str(spec.device)
    fmax_dtype_device = str(fmax) + "_" + dtype_device
    if fmax_dtype_device not in mel_basis:
-        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
+        mel = librosa_mel_fn(
            sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
        )
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
            dtype=spec.dtype, device=spec.device
        )
-    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
+
-    spec = spectral_normalize_torch(spec)
+    # Mel-frequency Log-amplitude spectrogram :: (B, Freq=num_mels, Frame)
-    return spec
+    melspec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    melspec = spectral_normalize_torch(melspec)
    return melspec
 def mel_spectrogram_torch(
    y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
 ):
-    if torch.min(y) < -1.0:
+    """Convert waveform into Mel-frequency Log-amplitude spectrogram.
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))
-    global mel_basis, hann_window
+    Args:
-    dtype_device = str(y.dtype) + "_" + str(y.device)
+        y       :: (B, T)           - Waveforms
-    fmax_dtype_device = str(fmax) + "_" + dtype_device
+    Returns:
-    wnsize_dtype_device = str(win_size) + "_" + dtype_device
+        melspec :: (B, Freq, Frame) - Mel-frequency Log-amplitude spectrogram
-    if fmax_dtype_device not in mel_basis:
+    """
-        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
+    # Linear-frequency Linear-amplitude spectrogram :: (B, T) -> (B, Freq, Frame)
-        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
+    spec = spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center)
            dtype=y.dtype, device=y.device
        )
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
            dtype=y.dtype, device=y.device
        )
-    y = torch.nn.functional.pad(
+    # Mel-frequency Log-amplitude spectrogram :: (B, Freq, Frame) -> (B, Freq=num_mels, Frame)
-        y.unsqueeze(1),
+    melspec = spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax)
        (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
        mode="reflect",
    )
    y = y.squeeze(1)
-    # spec = torch.stft(
+    return melspec
    #     y,
    #     n_fft,
    #     hop_length=hop_size,
    #     win_length=win_size,
    #     window=hann_window[wnsize_dtype_device],
    #     center=center,
    #     pad_mode="reflect",
    #     normalized=False,
    #     onesided=True,
    # )
    spec = torch.stft(
        y,
        n_fft,
        hop_length=hop_size,
        win_length=win_size,
        window=hann_window[wnsize_dtype_device],
        center=center,
        pad_mode="reflect",
        normalized=False,
        onesided=True,
        return_complex=False,
    )
    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)
    return spec
--- a/train_nsf_sim_cache_sid_load_pretrain.py
+++ b/train_nsf_sim_cache_sid_load_pretrain.py
@ -21,7 +21,7 @@ import torch.distributed as dist
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.cuda.amp import autocast, GradScaler
 from infer_pack import commons
-
+from time import sleep
 from time import time as ttime
 from data_utils import (
    TextAudioLoaderMultiNSFsid,
@ -45,7 +45,7 @@ global_step = 0
 def main():
    # n_gpus = torch.cuda.device_count()
    os.environ["MASTER_ADDR"] = "localhost"
-    os.environ["MASTER_PORT"] = "5555"
+    os.environ["MASTER_PORT"] = "51515"
    mp.spawn(
        run,
@ -157,7 +157,7 @@ def run(rank, n_gpus, hps):
        # epoch_str = 1
        # global_step = 0
    except:  # 如果首次不能加载，加载pretrain
-        traceback.print_exc()
+        # traceback.print_exc()
        epoch_str = 1
        global_step = 0
        if rank == 0:
@ -230,39 +230,50 @@ def train_and_evaluate(
    net_g.train()
    net_d.train()
-    if cache == [] or hps.if_cache_data_in_gpu == False:  # 第一个epoch把cache全部填满训练集
+
-        # print("caching")
+    # Prepare data iterator
        for batch_idx, info in enumerate(train_loader):
            if hps.if_f0 == 1:
                (
                    phone,
                    phone_lengths,
                    pitch,
                    pitchf,
                    spec,
                    spec_lengths,
                    wave,
                    wave_lengths,
                    sid,
                ) = info
            else:
                phone, phone_lengths, spec, spec_lengths, wave, wave_lengths, sid = info
            if torch.cuda.is_available():
                phone, phone_lengths = phone.cuda(
                    rank, non_blocking=True
                ), phone_lengths.cuda(rank, non_blocking=True)
                if hps.if_f0 == 1:
                    pitch, pitchf = pitch.cuda(rank, non_blocking=True), pitchf.cuda(
                        rank, non_blocking=True
                    )
                sid = sid.cuda(rank, non_blocking=True)
                spec, spec_lengths = spec.cuda(
                    rank, non_blocking=True
                ), spec_lengths.cuda(rank, non_blocking=True)
                wave, wave_lengths = wave.cuda(
                    rank, non_blocking=True
                ), wave_lengths.cuda(rank, non_blocking=True)
    if hps.if_cache_data_in_gpu == True:
        # Use Cache
        data_iterator = cache
        if cache == []:
            # Make new cache
            for batch_idx, info in enumerate(train_loader):
                # Unpack
                if hps.if_f0 == 1:
                    (
                        phone,
                        phone_lengths,
                        pitch,
                        pitchf,
                        spec,
                        spec_lengths,
                        wave,
                        wave_lengths,
                        sid,
                    ) = info
                else:
                    (
                        phone,
                        phone_lengths,
                        spec,
                        spec_lengths,
                        wave,
                        wave_lengths,
                        sid,
                    ) = info
                # Load on CUDA
                if torch.cuda.is_available():
                    phone = phone.cuda(rank, non_blocking=True)
                    phone_lengths = phone_lengths.cuda(rank, non_blocking=True)
                    if hps.if_f0 == 1:
                        pitch = pitch.cuda(rank, non_blocking=True)
                        pitchf = pitchf.cuda(rank, non_blocking=True)
                    sid = sid.cuda(rank, non_blocking=True)
                    spec = spec.cuda(rank, non_blocking=True)
                    spec_lengths = spec_lengths.cuda(rank, non_blocking=True)
                    wave = wave.cuda(rank, non_blocking=True)
                    wave_lengths = wave_lengths.cuda(rank, non_blocking=True)
                # Cache on list
                if hps.if_f0 == 1:
                    cache.append(
                        (
@ -295,184 +306,17 @@ def train_and_evaluate(
                            ),
                        )
                    )
            with autocast(enabled=hps.train.fp16_run):
                if hps.if_f0 == 1:
                    (
                        y_hat,
                        ids_slice,
                        x_mask,
                        z_mask,
                        (z, z_p, m_p, logs_p, m_q, logs_q),
                    ) = net_g(
                        phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid
                    )
        else:
-                    (
+            # Load shuffled cache
                        y_hat,
                        ids_slice,
                        x_mask,
                        z_mask,
                        (z, z_p, m_p, logs_p, m_q, logs_q),
                    ) = net_g(phone, phone_lengths, spec, spec_lengths, sid)
                mel = spec_to_mel_torch(
                    spec,
                    hps.data.filter_length,
                    hps.data.n_mel_channels,
                    hps.data.sampling_rate,
                    hps.data.mel_fmin,
                    hps.data.mel_fmax,
                )
                y_mel = commons.slice_segments(
                    mel, ids_slice, hps.train.segment_size // hps.data.hop_length
                )
                with autocast(enabled=False):
                    y_hat_mel = mel_spectrogram_torch(
                        y_hat.float().squeeze(1),
                        hps.data.filter_length,
                        hps.data.n_mel_channels,
                        hps.data.sampling_rate,
                        hps.data.hop_length,
                        hps.data.win_length,
                        hps.data.mel_fmin,
                        hps.data.mel_fmax,
                    )
                if hps.train.fp16_run == True:
                    y_hat_mel = y_hat_mel.half()
                wave = commons.slice_segments(
                    wave, ids_slice * hps.data.hop_length, hps.train.segment_size
                )  # slice
                # Discriminator
                y_d_hat_r, y_d_hat_g, _, _ = net_d(wave, y_hat.detach())
                with autocast(enabled=False):
                    loss_disc, losses_disc_r, losses_disc_g = discriminator_loss(
                        y_d_hat_r, y_d_hat_g
                    )
            optim_d.zero_grad()
            scaler.scale(loss_disc).backward()
            scaler.unscale_(optim_d)
            grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None)
            scaler.step(optim_d)
            with autocast(enabled=hps.train.fp16_run):
                # Generator
                y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(wave, y_hat)
                with autocast(enabled=False):
                    loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
                    loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
                    loss_fm = feature_loss(fmap_r, fmap_g)
                    loss_gen, losses_gen = generator_loss(y_d_hat_g)
                    loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl
            optim_g.zero_grad()
            scaler.scale(loss_gen_all).backward()
            scaler.unscale_(optim_g)
            grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None)
            scaler.step(optim_g)
            scaler.update()
            if rank == 0:
                if global_step % hps.train.log_interval == 0:
                    lr = optim_g.param_groups[0]["lr"]
                    logger.info(
                        "Train Epoch: {} [{:.0f}%]".format(
                            epoch, 100.0 * batch_idx / len(train_loader)
                        )
                    )
                    # Amor For Tensorboard display
                    if loss_mel > 50:
                        loss_mel = 50
                    if loss_kl > 5:
                        loss_kl = 5
                    logger.info([global_step, lr])
                    logger.info(
                        f"loss_disc={loss_disc:.3f}, loss_gen={loss_gen:.3f}, loss_fm={loss_fm:.3f},loss_mel={loss_mel:.3f}, loss_kl={loss_kl:.3f}"
                    )
                    scalar_dict = {
                        "loss/g/total": loss_gen_all,
                        "loss/d/total": loss_disc,
                        "learning_rate": lr,
                        "grad_norm_d": grad_norm_d,
                        "grad_norm_g": grad_norm_g,
                    }
                    scalar_dict.update(
                        {
                            "loss/g/fm": loss_fm,
                            "loss/g/mel": loss_mel,
                            "loss/g/kl": loss_kl,
                        }
                    )
                    scalar_dict.update(
                        {"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)}
                    )
                    scalar_dict.update(
                        {
                            "loss/d_r/{}".format(i): v
                            for i, v in enumerate(losses_disc_r)
                        }
                    )
                    scalar_dict.update(
                        {
                            "loss/d_g/{}".format(i): v
                            for i, v in enumerate(losses_disc_g)
                        }
                    )
                    image_dict = {
                        "slice/mel_org": utils.plot_spectrogram_to_numpy(
                            y_mel[0].data.cpu().numpy()
                        ),
                        "slice/mel_gen": utils.plot_spectrogram_to_numpy(
                            y_hat_mel[0].data.cpu().numpy()
                        ),
                        "all/mel": utils.plot_spectrogram_to_numpy(
                            mel[0].data.cpu().numpy()
                        ),
                    }
                    utils.summarize(
                        writer=writer,
                        global_step=global_step,
                        images=image_dict,
                        scalars=scalar_dict,
                    )
            global_step += 1
        # if global_step % hps.train.eval_interval == 0:
        if epoch % hps.save_every_epoch == 0 and rank == 0:
            if hps.if_latest == 0:
                utils.save_checkpoint(
                    net_g,
                    optim_g,
                    hps.train.learning_rate,
                    epoch,
                    os.path.join(hps.model_dir, "G_{}.pth".format(global_step)),
                )
                utils.save_checkpoint(
                    net_d,
                    optim_d,
                    hps.train.learning_rate,
                    epoch,
                    os.path.join(hps.model_dir, "D_{}.pth".format(global_step)),
                )
            else:
                utils.save_checkpoint(
                    net_g,
                    optim_g,
                    hps.train.learning_rate,
                    epoch,
                    os.path.join(hps.model_dir, "G_{}.pth".format(2333333)),
                )
                utils.save_checkpoint(
                    net_d,
                    optim_d,
                    hps.train.learning_rate,
                    epoch,
                    os.path.join(hps.model_dir, "D_{}.pth".format(2333333)),
                )
    else:  # 后续的epoch直接使用打乱的cache
            shuffle(cache)
-        # print("using cache")
+    else:
-        for batch_idx, info in cache:
+        # Loader
        data_iterator = enumerate(train_loader)
    # Run steps
    for batch_idx, info in data_iterator:
        # Data
        ## Unpack
        if hps.if_f0 == 1:
            (
                phone,
@ -487,6 +331,20 @@ def train_and_evaluate(
            ) = info
        else:
            phone, phone_lengths, spec, spec_lengths, wave, wave_lengths, sid = info
        ## Load on CUDA
        if (hps.if_cache_data_in_gpu == False) and torch.cuda.is_available():
            phone = phone.cuda(rank, non_blocking=True)
            phone_lengths = phone_lengths.cuda(rank, non_blocking=True)
            if hps.if_f0 == 1:
                pitch = pitch.cuda(rank, non_blocking=True)
                pitchf = pitchf.cuda(rank, non_blocking=True)
            sid = sid.cuda(rank, non_blocking=True)
            spec = spec.cuda(rank, non_blocking=True)
            spec_lengths = spec_lengths.cuda(rank, non_blocking=True)
            wave = wave.cuda(rank, non_blocking=True)
            wave_lengths = wave_lengths.cuda(rank, non_blocking=True)
        # Calculate
        with autocast(enabled=hps.train.fp16_run):
            if hps.if_f0 == 1:
                (
@ -495,9 +353,7 @@ def train_and_evaluate(
                    x_mask,
                    z_mask,
                    (z, z_p, m_p, logs_p, m_q, logs_q),
-                    ) = net_g(
+                ) = net_g(phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid)
                        phone, phone_lengths, pitch, pitchf, spec, spec_lengths, sid
                    )
            else:
                (
                    y_hat,
@ -552,7 +408,6 @@ def train_and_evaluate(
            with autocast(enabled=False):
                loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
                loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
                loss_fm = feature_loss(fmap_r, fmap_g)
                loss_gen, losses_gen = generator_loss(y_d_hat_g)
                loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl
@ -600,16 +455,10 @@ def train_and_evaluate(
                    {"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)}
                )
                scalar_dict.update(
-                        {
+                    {"loss/d_r/{}".format(i): v for i, v in enumerate(losses_disc_r)}
                            "loss/d_r/{}".format(i): v
                            for i, v in enumerate(losses_disc_r)
                        }
                )
                scalar_dict.update(
-                        {
+                    {"loss/d_g/{}".format(i): v for i, v in enumerate(losses_disc_g)}
                            "loss/d_g/{}".format(i): v
                            for i, v in enumerate(losses_disc_g)
                        }
                )
                image_dict = {
                    "slice/mel_org": utils.plot_spectrogram_to_numpy(
@ -629,7 +478,8 @@ def train_and_evaluate(
                    scalars=scalar_dict,
                )
        global_step += 1
-        # if global_step % hps.train.eval_interval == 0:
+    # /Run steps
    if epoch % hps.save_every_epoch == 0 and rank == 0:
        if hps.if_latest == 0:
            utils.save_checkpoint(
@ -676,6 +526,7 @@ def train_and_evaluate(
            "saving final ckpt:%s"
            % (savee(ckpt, hps.sample_rate, hps.if_f0, hps.name, epoch))
        )
        sleep(1)
        os._exit(2333333)
--- a/trainset_preprocess_pipeline_print.py
+++ b/trainset_preprocess_pipeline_print.py
@ -1,4 +1,5 @@
 import sys, os, multiprocessing
 from scipy import signal
 now_dir = os.getcwd()
 sys.path.append(now_dir)
@ -31,13 +32,14 @@ class PreProcess:
    def __init__(self, sr, exp_dir):
        self.slicer = Slicer(
            sr=sr,
-            threshold=-32,
+            threshold=-40,
            min_length=800,
            min_interval=400,
            hop_size=15,
            max_sil_kept=150,
        )
        self.sr = sr
        self.bh, self.ah = signal.butter(N=5, Wn=48, btype="high", fs=self.sr)
        self.per = 3.7
        self.overlap = 0.3
        self.tail = self.per + self.overlap
@ -57,18 +59,22 @@ class PreProcess:
        wavfile.write(
            "%s/%s_%s.wav" % (self.gt_wavs_dir, idx0, idx1),
            self.sr,
-            (tmp_audio * 32768).astype(np.int16),
+            tmp_audio.astype(np.float32),
        )
-        tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000)
+        tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000)#, res_type="soxr_vhq"
        wavfile.write(
            "%s/%s_%s.wav" % (self.wavs16k_dir, idx0, idx1),
            16000,
-            (tmp_audio * 32768).astype(np.int16),
+            tmp_audio.astype(np.float32),
        )
    def pipeline(self, path, idx0):
        try:
            audio = load_audio(path, self.sr)
            # zero phased digital filter cause pre-ringing noise...
            # audio = signal.filtfilt(self.bh, self.ah, audio) 
            audio = signal.lfilter(self.bh, self.ah, audio)
            idx1 = 0
            for audio in self.slicer.slice(audio):
                i = 0
@ -81,6 +87,7 @@ class PreProcess:
                        idx1 += 1
                    else:
                        tmp_audio = audio[start:]
                        idx1 += 1
                        break
                self.norm_write(tmp_audio, idx0, idx1)
            println("%s->Suc." % path)
--- a/uvr5_pack/utils.py
+++ b/uvr5_pack/utils.py
@ -4,7 +4,7 @@ from tqdm import tqdm
 import json
-def load_data(file_name: str = "./uvr5_pack/data.json") -> dict:
+def load_data(file_name: str = "./uvr5_pack/name_params.json") -> dict:
    with open(file_name, "r") as f:
        data = json.load(f)
--- a/vc_infer_pipeline.py
+++ b/vc_infer_pipeline.py
@ -4,6 +4,9 @@ import torch.nn.functional as F
 from config import x_pad, x_query, x_center, x_max
 import scipy.signal as signal
 import pyworld, os, traceback, faiss
 from scipy import signal
 bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
 class VC(object):
@ -189,6 +192,7 @@ class VC(object):
                index = big_npy = None
        else:
            index = big_npy = None
        audio = signal.filtfilt(bh, ah, audio)
        audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode="reflect")
        opt_ts = []
        if audio_pad.shape[0] > self.t_max: