【Utilizing ChatGPT】Development of a Speech-to-Text Conversion Tool and its Implementation Methods

【Utilizing ChatGPT】Development of a Speech-to-Text Conversion Tool and its Implementation Methods

Hello everyone, my name is Hieu from the System Department. I’m Vietnamese and it has been about six months since I joined GMO Research.

This time, I have an article about the highly popular ChatGPT that is currently attracting attention.

I started a project aimed at creating a web-based tool that could transcribe spoken audio to text with improved grammar and suggestions.The primary goal of this project was to test the effectiveness of ChatGPT in developing new applications and creating a proof of concept with the shortest lead time possible.

With the help of ChatGPT, I implemented several functions, such as audio recording, speech-to-text conversion, text analysis, and display of results using React.js. And a workable demo is built in just a day.

In this report, I will provide a detailed description of my approach to implementing these functions, the tools and technologies used, the accuracy achieved with ChatGPT, and the innovative aspects of the project. I hope this report will serve as a testament to the potential and usefulness of ChatGPT and inspire developers to explore new AI-based solutions.

Basic Design

Since my idea is pretty simple and straightforward, I can split my idea to smaller and do-able parts as below.

graph TD;
    A(Record Audio)-->B(Transcribe Audio with Google Speech-to-text);
    B-->C(Improve Grammar/Vocabulary with OpenAI API);
    C-->D(Display Result on Web Page);

Audio Recording

I used the Web API available in modern browsers to capture the audio input from the user’s device. With the help of the MediaRecorder class, I was able to record and store the audio.
※A few years ago, if you wanted to capture audio, customization was required. However, now it has become easier to implement.

Speech-to-Text Conversion

To convert speech-to-text, I used Google’s Cloud Speech-to-Text API. This API takes in the audio file as input and generates a transcript of the spoken words.
This API accepts audio files as input and generates a transcription of spoken words. The reasons why I chose this API among other speech recognition APIs are as follows.

①Abundant Documentation:
There are easily accessible resources such as reference documents, tutorials, and sample code, allowing for immediate problem-solving if any issues arise while using the API.

②Long-standing History:
The Cloud Speech-to-Text API is one of Google’s APIs and has been developed over many years, making it more mature and stable compared to other APIs.

③Easy Integration:
By utilizing SDKs and client libraries, seamless access to the API is possible. This enables quick and straightforward utilization of the API’s functionalities.

④High Accuracy and Language Support:
The Cloud Speech-to-Text API demonstrates high accuracy in converting speech to text, providing reliable results in fields such as business and scientific research. Additionally, it supports a wide range of languages, making it suitable for speech-to-text conversion in multiple languages spoken worldwide.

 I used the Axios library to call and send the request to the Google Speech-to-text API. On receiving the transcript, I processed it and stored it in local storage.

Installation and Commands for Axios

To install Axios, execute the following command:

It will be reflected in the package.json file.

※The latest version in July 2023 is 1.4.0.

Text Analysis and Generation

For the next step, I used OpenAI’s API. I sent the transcript text to the API along with a predefined prompt. The API then generated improved versions of the original text by taking into account the text surrounding it and the prompt itself. 

Using ChatGPT for Code Generation

To achieve a fast-proof concept, I used ChatGPT to generate the code required for all parts of the application.
Before asking questions to ChatGPT, it would be helpful to summarize the overview of your project and articulate in detail what you want to accomplish and the desired results. 
Please provide as much information as possible. After that, you can ask ChatGPT to generate code for specific parts as needed. 
※If you only need ideas, you can ask ChatGPT about the design considerations and the necessary components for your project.

[Code Generation] Example Questions for ChatGPT

For example, if I want to record audio and convert it to text, I asked ChatGPT the following question.

question: want to create a react app with following features.
1. Record user’s speech 
2. Transcript speech to text using external API

Can you help me design what parts are needed, and also what api I should use to achieve above targets. The api should be easy to register and implement.

[Code Generation] Response from ChatGPT

The response from ChatGPT is as follows.

To create a React app with the desired features of recording user’s speech and transcribing it to text using an external API, you will need the following components:

1.User Interface (UI): Design a UI that allows the user to initiate and control the speech recording process. You can use buttons to start and stop the recording, and provide feedback to the user through text or visual cues.
2.Speech Recording: Implement functionality to capture the user’s speech input in real-time using the Web Speech API of the browser. This API will allow you to access the user’s microphone and start/stop recording. You can use libraries like react-mic or react-microphone to simplify this process.
3.Transcription API Integration: Choose an external API that can convert speech to text. There are multiple options available, and a popular choice is the Google Cloud Speech-to-Text API. Other alternatives include Microsoft Azure Speech-to-Text and IBM Watson Speech-to-Text. These APIs provide a simple and developer-friendly interface to convert audio recordings into text.

To implement the transcription feature, you need to:

・Register for an account with the chosen provider and obtain your API credentials (e.g., API key, access tokens).
・Utilize the provider’s SDK or RESTful API to send the recorded audio file and receive the transcribed text response.
・Ensure that you properly handle the API communication, including error handling and response parsing.
Keep in mind that the integration steps may vary depending on the specific API you choose. The documentation provided by the API provider should guide you through the implementation process.
Remember to handle user data and privacy considerations, such as obtaining user consent for recording audio and being transparent about how the recorded data will be used and stored.

Good luck with your React app development!

[Code Generation]Example Question for ChatGPT ②

For example, if I want to generate a component for voice recording using React, I would ask the following question to ChatGPT.

I want to create a React component to record speech. 
This component will allow users to record their speech and its layout will have the following items 

1. “Record” button to start recording your speech. 
2. “Stop” button to stop recording your speech. 
3. “Play” button to replay recorded speech

[Code Generation] Example Response from ChatGPT ②

The response from ChatGPT is as follows.

You can create a React component to record speech using the Web Speech API and the MediaRecorder API. Here’s an example implementation of the component with the requested layout:

Specifically, I used it to generate code snippets for the audio recording component, and for sending and receiving data with the Google and OpenAI APIs.

Using ChatGPT, I entered prompts that described the required functionality in natural language and received context-aware code snippets as output. This approach allowed me to save time by not having to write boilerplate code and getting a head start on implementing complex functions.

However, the code generated by ChatGPT may not always be accurate and may not necessarily conform to industry standards. Therefore, relying solely on code generated by ChatGPT may not be appropriate. It is important to thoroughly evaluate the code generated before implementing it.

Results and Conclusion

The completed demonstration can be found here.

Final Result

Upon completion of the project, I successfully developed a proof-of-concept application that allows users to record and transcribe audio content and enhance the output with OpenAI’s GPT-3 API. The application was functional, and all the features worked seamlessly as intended.

The audio recording function was efficient and took less than a minute to record and save an audio file. The Google Speech-to-Text API performed exceptionally well with an accuracy rate of over 85%, thus providing a reliable transcription service. Lastly, the OpenAI’s GPT-3 API helped to improve the grammar and coherence of the transcripts with ease and provided an enhanced final output.

Conclusion of Development Utilizing ChatGPT

In summary, this development project allowed us to achieve the goals we set and proved to be an enjoyable experiment. Regarding the use of ChatGPT, we have reached the following conclusions.

– ChatGPT is an excellent tool for generating code and accelerating the development process of new applications.
– Instead of spending time on code implementation, users can now concentrate on properly thinking about actual logic of application.
– However, it’s crucial to use it in combination with industry-standard practices and evaluation methods to ensure the accuracy and quality of the generated code.
– Additionally, the provided code by ChatGPT is not always correct and troubleshooting these code can sometimes be very tricky, and it requires time and effort.

Thank you for reading.