Skip to main content

Python Project: Audio Transcription and Text-to-Speech Conversion Using Wav2Vec2 and Pyttsx3

 Explore an advanced Python project that combines audio transcription and text-to-speech synthesis using state-of-the-art tools like Librosa, PyTorch, and Hugging Face's Transformers library. This script demonstrates how to load and resample audio files, transcribe speech to text using Facebook's Wav2Vec2 model, and convert text back to speech with customizable voice options using pyttsx3. Perfect for anyone interested in speech processing, AI-driven voice technology, or natural language processing projects. Ideal for enhancing your Python skills and diving into real-world applications of AI in audio analysis.



import librosa

from scipy.signal import resample

import torch

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

import pyttsx3

from scipy.signal import resample


# Load audio file

audio_file = "directory of audio file"

audio, sr = librosa.load(audio_file, sr=None)



def resample_audio(audio, orig_sr, target_sr):

    duration = audio.shape[0] / orig_sr

    target_length = int(duration * target_sr)

    resampled_audio = resample(audio, target_length)

    return resampled_audio


# Example usage:

# resampled_audio = resample_audio(audio, 48000, 16000)


# Resample if necessary

if sr != 16000:

    audio = resample_audio(audio, sr, 16000)

    sr = 16000


print("Audio loaded and resampled successfully.")


# Load Wav2Vec2 model and tokenizer

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


print("Model loaded successfully.")


# Tokenize input

input_values = tokenizer(audio, return_tensors="pt").input_values


# Perform inference

with torch.no_grad():

    logits = model(input_values).logits


# Get predicted ids

predicted_ids = torch.argmax(logits, dim=-1)


# Decode the ids to text

transcription = tokenizer.batch_decode(predicted_ids)[0]

print("Transcription: ", transcription)


# Text-to-Speech

def text_to_speech(text, voice_gender='female', rate=150):

    engine = pyttsx3.init()

    voices = engine.getProperty('voices')

    

    if voice_gender == 'male':

        engine.setProperty('voice', voices[0].id)

    else:

        engine.setProperty('voice', voices[1].id)

    

    engine.setProperty('rate', rate)

    engine.say(text)

    engine.runAndWait()


# Example usage

long_text = "hi what happened"

text_to_speech(long_text, voice_gender='male', rate=150)  # For male voice

text_to_speech(long_text, voice_gender='female', rate=180)  # For female voice


print("Text-to-Speech conversion completed.")



#PythonProject
#AudioTranscription
#TextToSpeech
#Wav2Vec2
#PyTorch
#Librosa
#NLP
#SpeechRecognition
#VoiceSynthesis
#AIinPython
#NaturalLanguageProcessing
#SpeechToText
#Pyttsx3
#MachineLearning
#DeepLearning
#AudioProcessing
#PythonAI
#TransformersLibrary
#PythonCoding
#PythonTutorial

Comments

Popular posts from this blog

Cyber Attack Countermeasures : Module 4

 Cyber Attack Countermeasures :  Module 4 Quiz #cyber #quiz #coursera #exam #module #answers 1 . Question 1 CBC mode cryptography involves which of the following? 1 / 1  point Mediation of overt channels Mediation of covert channels Auditing of overt channels Auditing of covert channels None of the above Correct Correct! CBC mode is specifically designed to close covert communication channels in block encryption algorithms. 2 . Question 2 Which is a true statement? 1 / 1  point Conventional crypto scales perfectly well Conventional crypto scales poorly to large groups Conventional crypto does not need to scale All of the above Correct Correct! The symmetric key based method inherent in conventional cryptography does not scale well to large groups. 3 . Question 3 Public Key Cryptography involves which of the following? 1 / 1  point Publicly known secret keys Publicly known private keys Publicly known public keys All of the above ...

Cyber Attack Countermeasures : Module 2 Quiz

Cyber Attack Countermeasures: Module 2 Quiz #cyber #quiz #course #era #answer #module 1 . Question 1 “Identification” in the process of authentication involves which of the following? 1 / 1  point Typing a password Keying in a passphrase Typing in User ID and password Typing in User ID None of the above Correct Correct! The definition of identification involves providing a user’s ID (identification). 2 . Question 2 Which of the following statements is true? 1 / 1  point Identifiers are secret Identifiers are not secret Identifiers are the secret part of authentication All of the above Correct Correct! Identifiers for users are generally not viewed by security experts as being secret. 3 . Question 3 Which of the following is not a good candidate for use as a proof factor in the authentication process? 1 / 1  point Making sure the User ID is correct Typing in a correct password Confirming location, regardless of the country you are in The move...

Rectangular Microstrip Patch Antenna

Microstrip is a type of electrical transmission line which can be fabricated using printed circuit board technology, and is used to convey microwave-frequency signals. It consists of a conducting strip separated from a ground plane by a dielectric layer known as the substrate. The most commonly employed microstrip antenna is a rectangular patch which looks like a truncated  microstrip  transmission line. It is approximately of one-half wavelength long. When air is used as the dielectric substrate, the length of the rectangular microstrip antenna is approximately one-half of a free-space  wavelength . As the antenna is loaded with a dielectric as its substrate, the length of the antenna decreases as the relative  dielectric constant  of the substrate increases. The resonant length of the antenna is slightly shorter because of the extended electric "fringing fields" which increase the electrical length of the antenna slightly. An early model of the microst...