Understanding the IBM Watson Transcription Services

Silence is the language of God. All else is poor translation.

― Rumi

When it comes down to it, my workday pretty much consists of doing what I am told to do and doing what I want to do.  While there are certainly occasions where those two aspects don’t quite line up, there are more than enough instances when they do.  Case in point is the subject of today’s article.

A sales person came to me with a customer request for information about using the IBM Watson Transcription Services (Speech to Text and Text to Speech).  While I have a great deal of experience using many of the other Watson services (i.e. translation, assistant, tone, and language detection), transcription wasn’t something I’ve had the pleasure of working with.   Thankfully, her inquiry piqued my interest and after a couple of days of off-and-on investigation, I felt capable of not only explaining IBM transcription, but writing software that used both services.

So, without any further delay, here is enough information to make you as transcription dangerous as I have become.

Getting Started

With any Watson Service, you need to have an IBM Cloud account.  They are easy to get.  Simply go to https://cloud.ibm.com/login and click on Create IBM Cloud account.  If you already have an account, enter your credentials and click Log in.

After logging in, you are presented with the Dashboard page.  The Dashboard provides access to all your configured services.

My account has three configured Cloud Foundry services and nine Services.  Transcription falls under Services.  When I click on Services, I am presented with a list that includes Speech to Text and Text to Speech.

Of course, if you are reading this article you probably haven’t configured your own transcription services.  To do so, click on Create resource.  This brings you to the resource library.  Scroll down until you find Speech to Text and Text to Speech.

One at a time, click on each service and add it to your Services list.

After you have created both services, return to the Resource list page and one at a time, click on each service.  This is what I see when I click on my Speech to Text service.

To access a service programmatically, you will need its API Key and URL.  Copy both and store the values in a safe place.  Do the same for the Text to Speech service.

At this point, the services are configured and ready to use.

Text to Speech

As I nearly always do when I want to explore a new web service, I start with Google Postman.  Postman allows me to send messages and look at how a service responds without having to write a lick of code.

I began with the easier of the two – Text to Speech.  For this, I needed to configure the REST type, the service URL, the Authorization values, the required Headers, and the Message Body.  The following screen shots show this.

First, I set the type to POST and the URL to the URL found on the IBM Text to Speech Services page.  You need to append “/v1/synthesize” to the URL.  For Authentication, I used Basic Auth and set username to “apikey” and password to the value found on the Services page.

You need two headers — Content-Type and Accept.

Next, I set the Message Body.  This is a JSON object that contains the text to be transcribed.  Follow this format replacing “hello Orlando” with the words you want transcribed to audio.

Now that the REST call has been configured, I can click send to invoke the Text to Speech service.  Postman presents me with a player to hear the words “hello Orlando.”  Way to go, Andrew!

Speech to Text

You use Postman to invoke Speech to Text in a very similar way.  The biggest difference is found in the Message Body.  Instead of passing JSON containing text, you send the bit stream of an audio file.  In this example, I am using a file encoded as a .flac file.

Here are is how the call is configured.  Note that “v1/recognize” has been appended to the URL.  Make sure you use API Key from the Speech to Text service.  You cannot use API Key from the Text to Speech service.

I set Content-Type to audio/flac since I am using a .flac file.

Set set the body type to binary and choose a .flac file on your PC to transcribe.  You can download a test .flac file from a number of different Internet sources.  Search and you will find.

After pressing Send, IBM returns a JSON interpretation of the audio file.

For the example file, the transcription is flawless.  I have tried other files and found it to be less than perfect, but the transcription was still very good.

Let’s Get Real Geeky

Postman is a fantastic tool for playing around with web services, but in the end, I like to get down and dirty with real programming.  For this article, I chose to code the same functions using Python.  They look as follows:

To protect my privacy, I blacked-out my API Key.

For Speech to Text, I read in an audio file and display its JSON representation.  For Text to Speech, I read in text and save the output from Watson in a .wav file.  I skipped showing you all the GUI aspects because they are not important to this discussion.  The good stuff is in those two functions.

Mischief Managed

I have always felt that I can’t claim to understand something until I’ve explained it to someone else.  After writing this article, I can honestly say, “I get it.”

There is more that you can do with these services (e.g. keyword search), but this is enough for now.  If you are up for a little fun, take a crack at configuring your own transcription services and let me know how they turn out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: