Skip to content

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer” [TASLP 2024]

License

Notifications You must be signed in to change notification settings

Audio-WestlakeU/SAR-SSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SAR-SSL

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024.

  • Contributions
    • Self-supervised learning of spatial acoustic representation (SSL-SAR)

      • first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
      • designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
      • learns useful knowledge that can be transferred to the spatial acoustics-related tasks
    • Multi-channel audio Conformer (MC-Conformer)

      • unified architecture for both the pretext and downstream tasks
      • learns the local and global properties of spatial acoustics present in the time-frequency domain
      • boosts the performance of both pretext and downstream tasks

Datasets

  • Source signals: from WSJ0 database
  • Simulated RIRs: generated by gpuRIR toolbox
  • Simulated noise: generated by arbitrary noise field generator
  • Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases
    Datasets #Room Microphone Array #Mic. Pair #Room x #Source position x #Array position Noise Type
    MIR 3 Three 8-channel linear arrays 60 3 x 26 x 1 W/o
    MeshRIR 1 441 microphones 8874 1 x 32 x 1 W/o
    DCASE 9 A 4-channel tetrahedral array (EM32) 3 38530 Ambience
    dEchorate 11 Six 5-channel linear arrays 48 11 x 3 x 1 Ambience, babble, white
    BUTReverb 9 An 8-channel spherical array 28 51 Ambience
    ACE 7 A 2-channel array (Chromebook), 433 7 x 1 x 2 Ambience, babble, fan
    a 3-channel right-angled triangle array (Mobile),
    an 8-channel linear array (Lin8Ch),
    a 32-channel spherical array (EM32)
    LOCATA 1 A 15-channel linear array (DICIT), 492 Moving/static Ambience
    a 12-channel robot array (Robot head),
    a 32-channel spherical array (Eigenmike)
    MC- WSJ-AV 3 Two 8-channel linear arrays
    LibriCSS 1 A 7-channel circular array
    AMIMeeting 3 A 8-channel circular array
    AISHELL-4 10 A 8-channel circular array
    AliMeeting 21 A 8-channel circular array
    RealMAN 32 A 32-channel high-precision array

Quick start

Version update

  • code: 202407: the results are testing (to be updated).
  • code_v1: 202402, the results are the same as the paper.

Data generation

1. Download datasets to folders according to the following dictionary

.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
  .-SrcSig
  | .-wsj0
  |   .-dt
  |   .-et
  |   .-tr
  .-RIR
  | .-Mesh
  | | .-S32-M441_npy
  | .-MIRDB
  | | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
  | .-DCASE
  | | .-TAU-SRIR_DB
  | | .-TAU-SNoise_DB
  | .-dEchorate
  | | .-dEchorate_database.csv
  | | .-dEchorate_rir.h5
  | | .-dEchorate_annotations.h5
  | | .-dEchorate_noise_gzip7.hdf5
  | | .-dEchorate_babble_gzip7.hdf5
  | | .-dEchorate_silence_gzip7.hdf5
  | .-BUTReverb
  | | .-RIRs
  | .-ACE
  |   .-RIRN
  |   .-Data
  .-MicSig
    .-LOCATA
      .-dev
      .-eval
    .- MC_WSJ_AV
    .- LibriCSS
    .- AMIMeeting
    .- AISHELL-4
    .- AliMeeting
    .- RealMAN

2. Generate room impulse responses or microphone signals

  • Data for simulated experimets

    • pre-training
      python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1]
      python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      
    • some test instances
      python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      
    • downstream training
      python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      
  • Data for real-world experimets

    • real-wolrld RIR and noise signals
      python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
      python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
      
    • microphone signals for pre-training with selected RIRs and noise signals
      python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real 
      python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real  
      
    • LOCATA microphone signals for downstream training (TDOA estimation)
      python gen_LOCATA.py --stage train --save-to../../data/MicSig/real_ds_locata
      python gen_LOCATA.py --stage val --save-to../../data/MicSig/real_ds_locata
      python gen_LOCATA.py --stage test --save-to../../data/MicSig/real_ds_locata
      
    • additional RIRs for downstream training
      python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu 
      

Pretext Task

1. Preparation

  • Install: numpy, scipy, soundfile, gpuRIR, etc.

2. Training

  • Simulated experiments

    • Pretext task: pre-training

      python run_pretrain.py --pretrain --simu-exp --gpu-id 0,
      
    • Pretext task: evaluation

      # * denotes the time version of pre-training model 
      # --test-mode all: all or ins
      python run_pretrain.py --test --simu-exp --time * --test-mode all --gpu-id 0, 
      
    • Downstream task: training

      # --ds-nsimroom: 2,4,8,16,32,64,128 or 256
      # --ds-task: TDOA DRR T60 C50 or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
      
      Stage Trials nRooms nRIRs/Room nSrcSig/RIR nMicSig
      train x16 2 50 2 200
      x8 4 50 2 400
      x4 8 50 2 800
      x2 16 50 2 1600
      x1 32 50 2 3200
      x1 64 50 2 6400
      x1 128 50 2 12800
      x1 256 50 2 25600
      val - 20 50 1 1000
      test - 20 50 4 4000
    • Downstream task: evaluation

      # --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
      # --ds-task: TDOA, DRR, T60, C50, or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      # --test_mode: cal_metric, cal_metric_wo_info or vis_embed
      python run_downstream.py --ds-test --test_mode cal_metric --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
      
  • Real-world experiments

    • Pretext task:pre-training

      when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.

      python run_pretrain.py --pretrain --gpu-id 0, 
      
    • Downstream task: training

      # --ds-task: TDOA DRR T60 C50 or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      # --ds-real-sim-ratio = 1 1, 1 0 or 0 1
      python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
      python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
      
    • Downstream task: read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files

      python read_result_from_downstream_matfile.py --time *
      python read_lossmetric_simdata_from_logfile.py
      python read_lossmetric_realdata_from_logfile.py
      
  • Trained models

    • pretext task
      • best_model.tar
    • downstream task
      • ensemble_model.tar

Others

If OSError: [Errno 24] Too many open files occurs, input the following at the command line

ulimit -n 2048

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yang2024sarssl,
    Author = "Bing Yang and Xiaofei Li",
    Title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
    Journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)",
    Volume = "32",	
    Number = "",
    Pages = "4211-4225",
    Year = "2024"}

Licence

MIT

Releases

No releases published

Packages

No packages published

Languages