Gender recognition by voice is a technique in which you can determine the gender category of a speak

Vishwanath Akuthota
Oct 25, 2022
4 min read

Updated: May 28, 2024

Gender recognition can be helpful in many fields, including automatic speech recognition, in which it can help improve the performance of these systems. we can also use it in categorising calls by gender, or you can add it as a feature to a virtual assistant that can distinguish the talker's gender.

Here is the table of contents:

Preparing the Dataset
Building the Model
Training the Model
Testing the Model
Testing the Model with your own voice

Preparing the Dataset for Gender recognition

We won't be using raw audio data since audio samples can be of any length and can be problematic in terms of noise. As a result, we need to perform some feature extraction before feeding it into the neural network.

Feature extraction is always the first phase of any speech analysis task. It takes audio of any length as an input and outputs a fixed-length vector that is suitable for classification. Examples of feature extraction methods on audio are the MFCC and Mel Spectrogram.

We'll be using Mozilla's Common Voice Dataset, a corpus of speech data read by users on the Common Voice website. Its purpose is to enable the training and testing of automatic speech recognition. However, after I took a look at the dataset, many of the samples were labeled in the genre column. Therefore, we can extract these labeled samples and perform gender recognition.

Here is what I did to prepare the dataset for gender recognition:

First, I only filtered the labeled samples in the genre field.
After that, I balanced the dataset so that the number of female samples is equal to male samples; this will help the neural network not overfit on a particular gender.
Finally, I've used the Mel Spectrogram extraction technique to get a vector of the length 128 from each voice sample.

You can look at the prepared dataset for this tutorial in this repository(https://github.com/vishwachintu/Gender-recgo)

To get started, install the following libraries using pip:

pip install numpy pandas tqdm sklearn tensorflow pyaudio librosa

To follow along, open up a new notebook and import the modules we are going to need:

import pandas as pd
import numpy as np
import os
import tqdm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping
from sklearn.model_selection import train_test_split

Now to get the gender of each sample, there is a CSV metadata file (check it here) that links each audio sample's file path to its appropriate gender:

df = pd.read_csv("balanced-all.csv")
df.head()
                                
                           filename        gender


0  data/cv-other-train/sample-069205.npy  female
1  data/cv-valid-train/sample-063134.npy  female
2  data/cv-other-train/sample-080873.npy  female
3  data/cv-other-train/sample-105595.npy  female
4  data/cv-valid-train/sample-144613.npy  female

Let's see the number of samples of each gender:

# get total samples
n_samples = len(df)
# get total male samples
n_male_samples = len(df[df['gender'] == 'male'])
# get total female samples
n_female_samples = len(df[df['gender'] == 'female'])
print("Total samples:", n_samples)
print("Total male samples:", n_male_samples)
print("Total female samples:", n_female_samples)

Output:

Total samples: 66938
Total male samples: 33469
Total female samples: 33469

Perfect, a large number of balanced audio samples, the following function loads all the files into a single array; we don't need any generation mechanism as it fits the memory (since each audio sample is only the extracted feature with the size of 1KB):

label2int = {
    "male": 1,
    "female": 0
}

def load_data(vector_length=128):
    """A function to load gender recognition dataset from `data` folder
    After the second run, this will load from results/features.npy and results/labels.npy files
    as it is much faster!"""
    # make sure results folder exists
    if not os.path.isdir("results"):
        os.mkdir("results")
    # if features & labels already loaded individually and bundled, load them from there instead
    if os.path.isfile("results/features.npy") and os.path.isfile("results/labels.npy"):
        X = np.load("results/features.npy")
        y = np.load("results/labels.npy")
        return X, y
    # read dataframe
    df = pd.read_csv("balanced-all.csv")
    # get total samples
    n_samples = len(df)
    # get total male samples
    n_male_samples = len(df[df['gender'] == 'male'])
    # get total female samples
    n_female_samples = len(df[df['gender'] == 'female'])
    print("Total samples:", n_samples)
    print("Total male samples:", n_male_samples)
    print("Total female samples:", n_female_samples)
    # initialize an empty array for all audio features
    X = np.zeros((n_samples, vector_length))
    # initialize an empty array for all audio labels (1 for male and 0 for female)
    y = np.zeros((n_samples, 1))
    for i, (filename, gender) in tqdm.tqdm(enumerate(zip(df['filename'], df['gender'])), "Loading data", total=n_samples):
        features = np.load(filename)
        X[i] = features
        y[i] = label2int[gender]
    # save the audio features and labels into files
    # so we won't load each one of them next run
    np.save("results/features", X)
    np.save("results/labels", y)
    return X, y

The above function is responsible for reading that CSV file and loading all audio samples in a single array, this will take some time the first time you run it, but it will save that bundled array in results folder, which will save us time in the second run.

label2int dictionary simply maps each gender to an integer value; we need it in the load_data() function to translate string labels to integer labels.

Now, this is a single array, but we need to split our dataset into training, testing, and validation sets. The below function is doing that:

def split_data(X, y, test_size=0.1, valid_size=0.1):
    # split training set and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=7)
    # split training set and validation set
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=valid_size, random_state=7)
    # return a dictionary of values
    return {
        "X_train": X_train,
        "X_valid": X_valid,
        "X_test": X_test,
        "y_train": y_train,
        "y_valid": y_valid,
        "y_test": y_test
    }

We're using sklearn's train_test_split() convenient function, which will shuffle our dataset and split it into training and testing sets. We then rerun it on the training set to get the validation set. Let's use these functions:

# load the dataset
X, y = load_data()
# split the data into training, validation and testing sets
data = split_data(X, y, test_size=0.1, valid_size=0.1)

Building the Model

For this tutorial, we are going to use a deep feed-forward neural network with 5 hidden layers, it isn't the perfect architecture, but it does the job so far:

def create_model(vector_length=128):
    """5 hidden dense layers from 256 units to 64, not the best model."""
    model = Sequential()
    model.add(Dense(256, input_shape=(vector_length,)))
    model.add(Dropout(0.3))
    model.add(Dense(256, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    # one output neuron with sigmoid activation function, 0 means female, 1 means male
    model.add(Dense(1, activation="sigmoid"))
    # using binary crossentropy as it's male/female classification (binary)
    model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="adam")
    # print summary of the model
    model.summary()
    return model

Gender recognition by voice is a technique in which you can determine the gender category of a speak

Preparing the Dataset for Gender recognition

Building the Model

Recent Posts

Comentarios