Showing posts with label Automotive World. Show all posts
Showing posts with label Automotive World. Show all posts

Monday, May 1, 2017

뉴로모픽 컴퓨팅 기술 (Neuromorphic Computing)




 최근에는 GPU와 같이 대규모의 데이터를 병렬처리를 통해 빠르게 연산할 수 있는 프로세서가 비약적으로 빠르게 개발되고 있습니다. 그리고 그에 따라서 그동안 연산량의 한계로 사용이 제한적이었던 딥러닝과 같은 머신러닝 응용기술들이 급속도로 확산되고 있습니다. 그리고 그에 따른 성능적인 우수함도 검증이 진행되는 과정에 있습니다.

하지만 한편에서는 인간의 뇌, 그 자체를 모방하여 하드웨어로 구현하기 위한 연구가 지속되어 왔습니다. 즉, 인간의 뇌에서 이루어지는 생각과 기억, 인식, 판단 등을 하드웨어 형태로 구현된 뉴런을 통해 동일한 형태로 구현해보고자 하는 노력이 진행되는 것 입니다. 이를 뉴로모픽이라는 용어로 해외에서도 많은 관심과 연구들이 진행되고 있습니다.

IBM이나 퀄컴과 같은 세계적인 대기업은 물론이고, General Vision과 같은 신생 벤쳐기업, 그리고 국내에서도 국책연구기관이나 대학 연구소에서 관련된 많은 선행 연구가 수행되고 있습니다.

아래의 링크는 뉴로모픽 컴퓨팅에 대한 기술 소개와 최근 연구 트렌드에 대한 소개 자료입니다. 저와 같이 뉴로모픽 기술에 관심있으신 분들은 한번쯤 읽어 보시면 많은 도움이 될 것 같습니다. ^^


Click Here

Sunday, April 23, 2017

Calibration of IR Camera (Nonunifomity Correction)

2017년 4월 13일, 대전 항공우주연구원에서 적외선 카메라 설계 및 영상 보정에 대한 초청 세미나가 있었습니다. 그 때 제가 발표했던 자료를 민감한 내용들은 제거하고 아래 링크와 같이 공유합니다. 최근에는 적외선 센서의 가격이 저렴해지고, 적외선 대역에서 얻을 수 있는 독특한 이점들 때문에 항공/우주/군사 영역을 넘어 자동차와 같은 민수 업계로도 그 응용 범위가 점차 넓어지고 있는 것 같습니다. 적외선 센서 가격이 지금 정도의 성능을 유지하면서 100불 이하로 내려간다면 차량용 나이트비전 시스템과 같은 대량 양산품에도 적용될 수 있을 것으로 기대됩니다.

Friday, January 6, 2017

Human speech recognition techniques (Cepstrum and MFCC based)

1. Introduction

The movie ‘Iron Man’ comes with an artificial intelligence computer "Jarvis" that perfectly comprehends human language as shown in Fig. 1. 'Jarvis' can be commanded through a human voice without an input device such as a separate keyboard or mouse. Speech recognition is the most basic technology for realizing this 'Jarvis'. To create a machine that recognizes human speech, it is necessary to understand the principle of voice generation and the characteristics of voice signals. In this article, we describe the most widely used MFCCs for analyzing the characteristics of human speech signals.


Fig 1. Jarvis in the Iron Man movie

2. The generation of speech sound

The generation of a speech sound is due to vibrations of the vocal cords as shown in Fig. 2. During the human breathing process, the air passes through the airway quickly. At this time, the vocal cords are vibrated by the air flowing between the vocal cords due to the Bernoulli’s effect. Between the pairs of vocal cords there is a glottis where the sound waves are first generated. The average vibration frequency of vocal cords for men is 120 Hz and the average vibration frequency of vocal cords for women is 230 Hz. This frequency is the fundamental frequency and the harmonic components are synthesized in multiples of the fundamental frequency.

Fig 2. Structure of vocal organ

  The sound waves, generated in the vocal cords and glottis, pass through the vocal tract and are converted to a certain level of speech signal as shown in Fig. 3. A vocal tract is a path of sound composed of the neck, mouth, oral cavity, nasal cavity, tongue, and so on. Depending on the shape of the vocal tract, the speech sound is produced as syllable units. In terms of digital signal processing, the shape of a vocal tract can be regarded as a kind of transfer function. Therefore, we can define it as vocal tract transfer function. The excitation signal generated by the glottis is a source of voice, but not important in terms of speech recognition. Rather, recognizing the vocal tract that can estimate the shape of the oral cavity and the position of the tongue is the main purpose of speech recognition. In other words, it is essential to extract the frequency envelope of the speech signal and to remove the excitation signal.

Fig 3. The process of speech sound generation

3. Characteristics of speech signal



The speech sound is generated by applying an excitation signal generated by vibrations of the vocal cords to the vocal tract transfer function. In the speaker recognition field, the excitation signal can be used as an important element of personality. However, in the field of speech recognition, only the envelope of the speech signal expressed by the vocal tract transfer function is important. Therefore, in order to recognize a speech, it is necessary to extract the envelope of the speech signal and focus on suppressing the characteristics of the individual as much as possible.
  Figure 4 shows one frame of the speech signal in the time domain and frequency domain. The frequency domain of the speech signal is obtained by an FFT operation. As shown in Fig. 4, there are many ripples in the frequency domain signal. These ripples are due to vibrations of the vocal cords and are considered noise components that interfere with speech recognition. Therefore, it is necessary to focus on the spectral envelope of the speech signal in order to increase the recognition rate of the speech signal.

Fig 4. Speech signal in time domain and frequency domain

4. Speech signal analysis using Cepstrum

First of all, a person's voice is generated through the vibration of the vocal cords (excitation signal) and is made into a speech expressing language as it passes through the vocal tract. From the viewpoint of digital signal processing, it can be assumed that the excitation signal of the vocal cords passes through a digital filter expressed by a vocal tract transfer function as shown in Fig. 5.


Fig 5. Speech generation from the perspective of digital filter

  Thus, the speech signal s(n) is the result of the convolution operation of the excitation signal x(n) and the transfer function h(n), and this is expressed in the frequency domain as a multiplication of X(f) and H(f).




  Here, h(n) is the object that we want to extract for human speech recognition purposes. The shape of the lips and the position of the tongue can be estimated through h(n). However, it is difficult to separate the two signals from S(f), which is the form of the product of the two signals. Therefore, it is meaningful to use cepstrum to convert the form of multiplication into the form of addition by taking logarithm on both sides. The two signals synthesized in an additive form can be easily separated. It can be seen that the excitation signal and spectral envelope can be successfully separated using cepstrum as shown in Fig. 6. Then, ‘liftering’, which is the same concept of ‘filtering’ in the general digital signal processing field, can be used to extract only envelope area in cepstrum field.

Fig 6. Separation of envelope and excitation signal by using cepstrum

The entire process of speech signal analysis using cepstrum is illustrated in Fig. 7. As described above, this process consists of the 'hamming windowing', 'DFT (FFT)' for frequency domain transformation, and log operations and so on. Finally, we can get cepstrum coefficients that can be used to pattern match or categorize.

Fig 7. Cepstrum process of speech signal


5. Speech signal analysis using Mel-Frequency Cepstrum Coefficients (MFCC)

We have already seen that the cepstrum is capable of recognizing the speech signal by itself. Nevertheless, why do we need the Mel-frequency cepstrum coefficients (MFCC)? The only difference between cepstrum and MFCC is that there is a Mel-filter bank in the process.
The Mel-filter bank is designed to mimic sound scale perception of the human ear. Therefore, the Mel-filter bank allocates fewer sub-filters in the high-frequency range and a larger number of sub-filters in the lower-frequency range as shown in Fig. 8. A detailed description of the Mel-Scale and Mel-filter banks is available at the following link:


Fig 8. Structure of Mel-filter bank

Therefore, the entire process of MFCC is illustrated in Fig. 9. As already mentioned above, the MFCC process is almost identical to the cepstrum process but only the Mel-filter bank. During the MFCC process, the 'IDFT (IFFT)' step can be replaced by 'Discrete Cosine Transform (DCT)'. When the IDFT or the IFFT is performed, there is a disadvantage that the imaginary term is generated again in the resultant value. On the contrary, when using the DCT, the result is composed of only real number coefficients and has the advantage of faster processing time. In addition, because the DCT coefficient values are arranged in the order of low frequency to high frequency, the envelope components concentrated in the low frequency region can be effectively separated using this feature.

Fig 9. Entire process of MFCC

  Finally, what are the advantages of using MFCC in speech recognition? Thanks to the Mel-filter bank in the MFCC process, it can be assumed that subtle differences in pronunciation according to individual characteristics are eliminated. For human speech recognition, we need to find common characteristics of everyone who uses same language. Speech recognition becomes difficult when the characteristics of pronunciation of each individual are reflected. In other words, by using the Mel-filter bank, the subtle pronunciation characteristics of each individual can be eliminated. In addition, since the noise in the high-frequency range is suppressed, the effect of suppressing the surrounding high-frequency noise that can be included when recording the voice through the microphone can be also obtained. Therefore, cepstrum can be used for human speech recognition, but MFCC is used in more application.

6. Conclusion

1. The speech sound is generated by applying an excitation signal generated by vibrations of the vocal cords to the vocal tract transfer function.
2. Both cepstrum and MFCC can be used to human speech recognition.
3. The only difference between cepstrum and MFCC is that there is a Mel-filter bank in the MFCC process.

4. Due to the Mel-filter bank in the MFCC process, subtle differences in pronunciation according to individual characteristics could be eliminated. In addition, it can also suppress the high frequency band noise that can be included in the voice recording process. Because of this feature, MFCC can increase speech recognition rate dramatically.



End.

Tuesday, January 3, 2017

MFCC algorithm for natural language processing

I recently left a question on the MFCC algorithm in LinkedIn's Audio Engineering Society.

The question was:

Question about speech recognition by using MFCC algorithm
MFCC is the most widely used algorithm for human speech recognition. MFCC mimics how humans perceive the sounds, and decomposes the speech signals into the Mel-Frequency domain.
Here is the question. When implementing a real speech recognition system, the subject that recognizes the voice is not a person but a microphone. The voice is captured by the microphone and the signal processing is performed by CPU or DSP, in this process, the use of Mel-Filter bank seems to distort the original signal.
In my opinion, converting a voice signal received through a microphone to a frequency domain or Cepstrum without using a Mel-Filter Bank can be a way of utilizing meaningful information. What are the advantages of converting to the Mel-Frequency domain without processing the original audio signal acquired through the microphone?Please let me know about this. Thank you.

And it's been two weeks since I left the question, but no one left a reply. But thanks to that, during this time I began to find myself answering the question. And leave the summary for others as below.
First of all, a person 's voice is first created through the vibration of the vocal cords (excitation signals) and is made into a voice expressing language as it passes through the vocal tract. From the viewpoint of signal processing, it can be assumed that the vibration signal of the vocal cords passes through a filter expressed by a vocal tract transfer function.


Thus, the speech signal s (n) is the result of the convolution operation of the excitation signal x (n) and the transfer function h (n), and this is expressed in the frequency domain as a multiplication of X (f) and H(f).
Here, h(n) is the object that we want to extract for natural language processing purposes. The shape of the lips and the position of the tongue can be estimated through h(n). However, it is difficult to separate the two signals from S (f), which is the form of the product of the two signals. Therefore, it is meaningful to use Cepstrum to convert the form of multiplication into the form of addition by taking logarithm on both sides. The two signals synthesized in an additive form can be easily separated.
So why do we need the MFCC? The MFCC uses a Mel-filter bank to allocate a larger number of filters to the low-frequency range. In this process, it can be assumed that subtle differences in pronunciation according to individual characteristics are eliminated. For natural language processing, we need to find common characteristics of everyone who uses same language. Recognition becomes difficult when the characteristics of pronunciation of each individual are reflected. In other words, by using the Mel-filter bank, the subtle pronunciation characteristics of each individual can be eliminated. In addition, since the noise in the high-frequency range is suppressed, the effect of suppressing the surrounding high-frequency noise that can be included when recording the voice through the microphone can be also obtained.
Therefore, Cepstrum can be used for natural language recognition, but MFCC is the most commonly used.