Using Artificial Intelligence to Create Smart Transcriptions

“Think before you speak. Read before you think.”

Fran Lebowitz

Here I am again waxing poetic (check out the video) about artificial intelligence and video analysis. This time around, I am exploring how AI can be used to create smart transcriptions. By “smart” I mean transcriptions that not only contains what was said, but who said it and when. This becomes very useful with long meeting recordings where you are only interested in certain subjects discussed by certain people. Rather than having to watch or scan through the entire recording, smart transcriptions allow you to immediately go to only the parts that interest you.

My demo solution consists of several components. I have a desktop application that watches a folder for new video files and when one appears, it sends it off to the AWS AI Transcription engine for processing. Once the engine has accepted the file, it returns a job Id. The processing takes some time so rather than sitting around waiting, I developed a second application that runs when the transcription (i.e. Job Id) is ready. That application runs on my public Linux server and is kicked off by a webhook invocation. The invocation contains a reference to the completed Job Id which tells me where to find the transcription results. Those results are then examined and the relevant information is sent to an Avaya Spaces room. A “real solution” would connect the results to the video file, but that’s more work than I am ready to take on for this blog article.

To see all of the above in action, please take a look at my latest Cheapo-Cheapo Productions video:

Mischief Managed

My interest in AI is taking me to some very interesting places. What began as simple image analysis has blossomed into exploring solutions that use AI to solve common and very real problems.

These endeavors also reinforce my belief in cloud services and composable solutions. While I had to do some significant coding to pull these demos together, developing my own AI engine and collaboration platform are undertakings far beyond my humble abilities. Thankfully, there are smart companies out there that understand the need to wrap their products in publicly accessible APIs. So, rather than reinvent wheel after wheel, I take what I need to create my own unique offerings.

As always, thank you for reading and watching. Feel free to reach out to me if you want to dig deeper. I am always happy to speak geek.


  1. Chris Bain · · Reply

    Fascinating stuff.

    If you combined this with your facial recognition work could speakers be identified by name rather than number? Or i guess you could identify speakers through voice recognition? So in an analysis you could create a red flag when, for example, an unrecognised person asks for sensitive information.

    Another thought. Would it be possible, instead of noting the time in the video, to note the actual time of day that something was said? I can think of a few verticals that might want this. For instance – the legal profession may need to prove conclusively when a statement was made or a conversation happened for purposes of a court case, or a sports betting firm may want to know exactly when a development happened in the field of play in order to adjust the odds.

    1. Thanks for your thoughts, Chris. All of that is possible, although some parts will be harder than others. Creating a timeline based on date and time would be the easiest of the bunch. I would simply base my offsets on the start of the conference. Speaker recognition requires work if I don’t have any reference points. Red flags would be easy, too, since I know everything that is said and could look for sensitive words and data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: