Back with another Tinker Tuesday! Today we are building a Speech-to-Text Transcription service on audio file uploads with Python and Flask using the SpeechRecognition module! This one is one of my favorite projects as I love working with audio and a unique one to add to your portfolio!

Here's a roadmap for today's project:

  1. We'll learn how to use the SpeechRecognition module in Python
  2. Then, we'll use Flask to take in an Audio file and create both a GET and POST request on the same route
  3. Finally, we'll render the transcribed results of the speech file to the user.

Before we begin, I want to mention that the guide below is an abridged version of the free video tutorial. You can find more free courses and projects on my website, TheCodex to learn how to design and build applications. You can find all the code for this project at my GitHub Repo here.

Our final result!

Step 1: Getting the Audio File Input in Flask

The first step with this project is to build a simple Flask Web application that takes in an input audio file from the user. Let's go ahead and initialize an empty project (PyCharm is my preference) and then create the our Flask file app.py.

For now, our app.py should just contain the simple Flask structure with one route, our home page that will facilitate both the audio upload and the rendering of the transcription.

from flask import Flask, render_template, request, redirect
import speech_recognition as sr

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def index():
	return "Hello World"

if __name__ == "__main__":
    app.run(debug=True, threaded=True)
app.py | Basic Flask app

You'll notice we've already imported the speech_recognition for future use. If you need to install this module in your environment, run:

python3 -m pip install SpeechRecognition

Awesome! You'll notice that we added two methods to our route, a GET and a POST method. This is because our page needs a GET method to load the content of the site, and then the POST method will facilitate the retrieval of the audio file from the user and transcription of the audio.

The next step is to create an html template file for rendering the view for audio file input. In your project, create a new folder called templates and inside of that, create a new file called index.html with the following content:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Speech Recognition in Python</title>
    <link href="https://fonts.googleapis.com/css2?family=Lato:wght@400;700&display=swap" rel="stylesheet">
    <link rel="stylesheet" href="{{ url_for('static', filename='styles/index.css') }}" />
</head>
<body>
<div id="speechContainer">
    <h1>Upload new File</h1>
    <form method="post" enctype="multipart/form-data">
        <input type="file" name="file"/>
        <br>
        <input type="submit" id="submitButton" value="Transcribe"/>
    </form>
</div>
</body>
</html>
templates/index.html

And then create a new folder called static, with a sub-folder called styles. Inside of styles, create a new file called index.css and leave it blank for now.

The HTML code here enables us to get the audio file as input from the user. We created a form with a POST method that will trigger our end point in the route we setup above. There's also a bunch of extra CSS styling with id's that will be used in the last step of this blog to style our page.

To show that our page can now handle both the GET and POST request as well as render our newly created template, let's modify our app.py file slightly:

@app.route("/", methods=["GET", "POST"])
def index():
	if request.method == "POST":
    	print("POST request received!")
	return render_template('index.html')
Updated app.py

If you go ahead and run the project now, with the command: python3 app.py, you should see a simple file upload display, that prints our above command once the submit button is clicked.

Rendered Localhost

Step 2: Analyzing and Transcribing the Audio File

Now that we got a simple UI that can analyze our Audio File, let's go ahead and retrieve the file from our POST request and start analyzing it.

First, let's make sure that a file was actually sent in the POST request. If the file object exists, we also want to make sure that the file has an associated value. Right now if the Transcribe button is pressed with no file uploaded, an empty file is sent and we want to make sure we catch this edge case.

Once we've gotten the file object, we simply have to pass that into an instance of the SpeechRecognition module. There's a nifty function called sr.AudioFile that takes in an audio file as input, and returns an AudioFile instance of SpeechRecognition. Once we have that, we simply have to record the file into an object that the SpeechRecognition module can recognize and then pass it to one of the several different speech recognition platforms in the module.

To put all the above steps together, let's update our app.py as such:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = ""
    if request.method == "POST":
        print("FORM DATA RECEIVED")

        if "file" not in request.files:
            return redirect(request.url)

        file = request.files["file"]
        if file.filename == "":
            return redirect(request.url)
            
        if file:
            recognizer = sr.Recognizer()
            audioFile = sr.AudioFile(file)
            with audioFile as source:
                data = recognizer.record(source)
            transcript = recognizer.recognize_google(data, key=None)
            print(transcript)

    return render_template('index.html', transcript=transcript)
app.py | Speech Analysis

Breaking this down line by line, we can see that we first check to make sure "file" exists in the request's POST method. Once that's done, we ensure that the file actually has a file name.

If a file has successfully been delivered, we run the above SpeechRecognition code to convert it into an analyzable format and then run Google's Speech Recognition API on the file. Read more about the linked module to see all the amazing speech recognition tasks you can perform with it.

Note: The basic recognizer.recognize_google allows for roughly <1 minute of audio transcription. If you want to analyze larger files, you'll need to specify an actual API key / upgrade to a paid license on Google's API key. Check out SpeechRecognition on PyPI for more info: https://pypi.org/project/SpeechRecognition/

Try playing around with the recognizer object of the module. SpeechRecognition is an amazing library that hosts a wide variety of different recognizers that you can implement.

A list of all the possible recognizers

Good work, but hold up. We wrote all this code, but we don't have a file to test it with. The Google speech recognizer asks for a WAV file to be uploaded for analysis. There's a high chance you don't have a WAV file lying around. Let's head over to the Open Speech Repository and download a sample WAV file of someone speaking. Any of the WAV files listed at the domain should work. Download one upload it to your web service running on localhost.

If everything worked, you should now see the printed out transcript at the bottom of your Python console.

Step 3: Displaying the Transcription + Final Touches

Almost done! The last step we have is to take the transcription we're printing out and pass it to our template, rendering the results to our user.

We did something very similar in a previous project - Building a Weather Dashboard with Python and Flask. Check it out if you enjoyed this project and want to build more Python applications!

Let's go ahead and pass in the transcription as a variable in our render_template method. Your final app.py index function should look like this:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = ""
    if request.method == "POST":
        print("FORM DATA RECEIVED")

        if "file" not in request.files:
            return redirect(request.url)

        file = request.files["file"]
        if file.filename == "":
            return redirect(request.url)

        if file:
            recognizer = sr.Recognizer()
            audioFile = sr.AudioFile(file)
            with audioFile as source:
                data = recognizer.record(source)
            transcript = recognizer.recognize_google(data, key=None)

    return render_template('index.html', transcript=transcript)
Final app.py

Now, all we have to do is use some Jinja2 in our template and render the transcript to the user. Heading back over to our index.html file, let's update the div holding our form with the following code:

<div id="speechContainer">
    <h1>Upload new File</h1>
    <form method="post" enctype="multipart/form-data">
        <input type="file" name="file"/>
        <br>
        <input type="submit" id="submitButton" value="Transcribe"/>
    </form>

    {% if transcript != "" %}
        <div id="speechTranscriptContainer">
            <h1>Transcript</h1>
            <p id="speechText">{{ transcript }}</p>
        </div>
    {% endif %}
</div>

We're writing a simple if statement in Jinja2 that only renders the transcription div if the transcription has been received from our script. If the transcript exists, we're rendering it and displaying the final text to the user.

The final task is to beautify our code. Remember that index.css file you made in Part 1? Let's head back over to that and add the following code:

h1, p, input {
    font-family: 'Lato', sans-serif;
}

#speechContainer {
    margin: 20px;
}

#submitButton {
    background-color: #0191FE;
    color: white;
    border-radius: 5px;
    border: none;
    padding: 10px 30px;
    margin-top: 20px;
}

#submitButton:hover {
    cursor: pointer;
}

#speechTranscriptContainer {
    margin-top: 20px;
}

#speechText {
    font-size: 18px;
    width: 500px;
}

Feel free to modify the CSS however you like - this just helps me beautify our Speech Recognition application. And voila! You're done. If you've followed everything along, you should see the following result when uploading you WAV file:

View of Final Project

That's it folks! You just built an end-to-end Flask app that can take in any WAV file and transcribe the speech spoken in the audio file. You can find all the code for this project at our GitHub Repo here. As always, if you have face any troubles building this project, join our discord and TheCodex community can help!


For those of you interested in more project walkthroughs: Every Tuesday, I release a new Python/Data Science Project tutorial. I was honestly just tired of watching webcasted lectures and YouTube videos of instructors droning on with robotic voices teaching pure theory, so I started recording my own fun and practical projects. Next Tuesday, I'll be releasing a tutorial on how to build a COVID-19 Case Tracker to map the global spread of the virus!

Want to get notified every time a new project launches?

Subscribe to get Tinker Tuesday delivered to your inbox.

    No spam. Just 1 email / project. Unsubscribe at any time.

    tuesday is my favorite day of the week bc TINKER TUESDAY 🚀

    Hey! I'm Avi - your new Python and data science teacher. I've taught over 600,000 students around the world not just how to code, but how to build real projects. I'm on a mission to help you jumpstart your career by helping you master python and data science. Start your journey on TheCodex here: https://thecodex.me/