Object Detection System with Image and Speech Recognition


Nowadays, robotics have made human life very convenient, not only in industrial applications, but also in the research area of education, medical science and entertainment [1]. Many companies— namely Boston Dynamics—and researchers have worked to develop a number of robots to fulfill various requirements according to their research area and also to create a fruitful synergy of human-robot interaction. In addition, robots can easily deal with complex tasks that are difficult to deal with for humans. At present, numerous robots have captured the market but the robotic arm remains one of the most successful [2]. Usage of the robotic arm is the most important tool in factories for assembly processes, especially car assembly, and big manufacturing machines. In order to control coordination and movement of the robotic arm, accuracy, stability, and precision play an important role. Figure 5.1 represents robotics connected with other related technologies.

Object-recognition technology has rapidly increased in many domains such as object detection in 3D and 2D images, movement detection and with the handshaking of artificial intelligence (AI), these robots behaving like humans are called humanoid robots (HR). Moreover, HR also learns from its mistakes and improvises skills with

Robotics connected with other technologies

FIGURE 5.1 Robotics connected with other technologies.

the use of reinforcement learning. These days robots provide a helping hand to humans in their day-to-day life.

Vision is one of the intelligent core technologies in robotics. The research area in vision called “Computer Vision” is now considered from a scientific point of view for investigating how artificial computer vision can make a robot humanlike and what algorithm underlies it [3].

Open-Source Computer Vision (OpenCV) is a real-time library program developed by Gary Bradsky in 1999 [4]. It is an open-source library for both educational and commercial purposes. It supports C, C++ and Python interfaces and optimizes nearly 2,500 algorithms [5]. OpenCV plays a supportive role in the development of Computer Vision into a new futuristic world and enables millions of people to enhance their limits in productive work.

Many researchers have proposed and developed the robotic arm and visual system over the last few decades. Furuta [6] proposed a method to control trajectory tracking using a sensor-based feedback system. With the use of a laser beam, the proposed algorithm is used to achieve the desired coordinates for the robotic arm. Manasinghe [7] proposed an algorithm for the industrial robotic arm to contour problem cartesian velocity and joint torque. In this work, simulation is established to compute the coordinates of each joint in the robotic arm. Koga [8] developed a virtual model for the robotic arm to calculate joint coordinates when it picks up and places an object. Efe [9] presented a scheme to adjust the robotic arm fuzzy sliding controller with the use of the adaptive neuro-fuzzy inference system (ANFIS). Wang [10] proposed a robotic arm that is fixed on the mobile robot to detect signs or numbers. Image processing and detection are carried out through the use of a microcamera fitted on the robotic arm. Juang [11] developed a robotic arm system to grab objects via visual recognition. This system is equipped with two webcams: One webcam is employed to catch commands on screen and the other is used for word recognition. The authors [ 12] developed an application named APP that is integrated with the robotic arm as well as Raspberry Pi for the computation of Convolutional Neural Networks. Moreover, the camera is fitted on the robotic arm for the selection of donuts that matches witli the customer flavor. Gaussian mixture is also part of the proposed system to segment the foreground and background picture. Karke [13] discussed realtime implementation of deep learning models with robotics application. Arduino Uno detects objects and classified them according to their category. Convolution Neural Network is used for further processing. Kumbhar [ 14] presented a project based on a low-cost robotic arm enabled with camera vision. Arduino Mega 2650 is deployed as the main controller as well as six different motors controlled via controller. Object detection and image edge detection has been recognized via Raspberry Pi.

The main concept of this chapter is to give vision capability to the robotic arm through the use of image processing and speech recognition. An ultrasonic sensor is also integrated with the camera on the robotic arm to check and calculate the distance between the object and the robotic arm. The proposed study is discussed as follows: Section 5.2 explains the methodology, Section 5.3 illustrates the results and discussion and Section 5.4 concludes the proposed study.



Our proposed work is categorized into different sections; namely speech recognition, image capturing, processing, sensing the distance of the object, flow of algorithm and robotic arm functions.

Raspberry Pi serves as the backbone of our proposed system. Image processing is done through the use of a camera and OpenCV. Raspbian operating system(OS) [15] used in the Raspberry Pi which is based on Debian operating system provides over

Proposed System Architecture

FIGURE 5.2 Proposed System Architecture.

35,000 pre-installed packages and pre-compiled software such as python, sonic-pi, and java. In addition, it is more than a pure OS. OpenCV, installed through Linux commands, provides library packages to process images taken through the camera installed with Raspberry Pi. These libraries are based on python. Two different sensors, namely ultrasonic and IR sensor, check the distance and location of the object. The sensor works on the principle of calculating the distance of the reflected wave. The formula for distance calculation of the reflected wave is:

Where D is Distance, T is time and SS is the speed of sound. SS varies with humidity and temperature.

For the robotic arm, Servo motors are employed. These motors work on the pulse width modulation (PWM) principle and control through three wires, 1) Power 2) Ground 3) Signal. In PWM, there are three different pulses in which the motor will work; minimum pulse, maximum pulse and the last one is repetition pulse and it rotates around 180 degrees (both sides 90 degrees). PWM pulse decides the position of the motor shaft stamped with time duration. In the proposed work, three servo motors are employed inside the robotic arm, one for up and down and another one is to grasp the object. The last one is for rotating the arm.

Circuit Diagram

In the proposed study, Raspberry Pi plays a major role in controlling the robotic arm and capturing images while processing according to the requirement. Raspberry Pi is a small credit card based computer operated on Debian based OS. Table 5.1 represents the technical specifications of the Raspberry Pi.

The camera is connected to Raspberry Pi through the USB port. IR sensor is based on three pins: 1) power 2) ground 3) output and connected to Raspberry Pi general- purpose input-output (GPIO) pin. Three servo motors are also connected to GPIO PWM pins(see Figure 5.3).

Speech Recognition

Speech recognition) 16][17] is a language-based program that is used to input human speech, decrypt it and change into readable text. This technique helps to filter words,


Technical Specification of Raspberry Pi






Quad Core Arm Cortex



2 GB



MicroSDHC card



5Volt - 2 ampere



Broadcom VideoCore

Circuit Diagram of Proposed work

FIGURE 5.3 Circuit Diagram of Proposed work.

General Speech Recognition Process

FIGURE 5.4 General Speech Recognition Process.

change them into digitize format, and analyze that sound. Figure 5.4 represents the general speech recognition system.

Simple Speech recognition module is categorized into three parts as follows:

  • 1. Input: Human: Voice is used as source of input for this module.
  • 2. Neural Network: Natural Language Processing (NLP) and Neural Network (NN) break speech into a number of components that can help to easily interpret it. After conversion, these components change into a digital state and analyze these states. NN is used to train the dataset of specific words and phrases and create predictions for the new voice related to input data.
  • 3. Output: In the last stage, it transcribes the input voice into text format.

Tensor Flow

Tensor flow [ 18] is an open-source platform especially designed for machine learning and deep learning that helps to provide end-to-end solutions. It provides a huge number of tools and flexible platforms that motivate and push researchers to extend their creativity limit. Tensor flow gives stable Python and C++ APIs. Moreover, these APIs are compatible with other languages in unstable versions. The main features of Tensor flow are as follows:

ALGORITHM 5.1 Proposed Workflow.

  • 1. Model Building: Training and Building models in machine learning is quite easy compared to other system models. With the introduction of Keras, this helps to improve model debugging and model iteration.
  • 2. Robust: Due to the power of clouds, it is easy to train and train deploy models anywhere.
  • 3. Research Experimentation: This provides a flexible architecture that helps to implement from idea to practical implementation. Especially regarding results, it provides these in the form of graphs and statistical form as well.

Proposed Algorithm

Algorithm 1 presents a proposed workflow. In this system, the arm is activated through the voice and if there is no matching voice, it rechecks. In the next step, voice is converted into plain text, processing this text in the TensorFlow and checking the libraries and detecting the object with the convolutional network. The arm is synchronized with the TensorFlow. A camera captures the image which is processed through the use of installed OpenCV in the Raspberry Pi. Sensors check the location of the object. If the object is within range of the robotic arm it will grasp it; otherwise the robotic arm will not move. With the use of text to speech, it can convert into voice instructions such as object name, specification etc.

Results and Discussion

In this section, the input signal (see Figure 5.5) is used to recognize the particular object. For image recognition, Tensor flow is deployed and measured through the accuracy score of the model. The accuracy of the training dataset is 0.697.

Input Signal of Hello

FIGURE 5.5 Input Signal of Hello.

Left Side

FIGURE 5.6 Left Side: Direction of Robotic Arm; Right Side: Detection of Object.

Activated mode of Voice and Image Recognition

FIGURE 5.7 Activated mode of Voice and Image Recognition.

In this section, Figure 5.6 and Figure 5.7 represent the outcome of the results from our proposed work. In Figure 5.6, the terminal shows the direction of the robotic arm which moves forward or backward. Figure 5.6 shows the detection of an object with a yellow screen. The most noteworthy point of this proposed system is that it works remotely through use of the internet. Through the use of SSH, the robotic arm captures images and grasps crucial objects. Figure 5.7 presents an activated mode of image and speech recognition.

Figure 5.8 represents the system proposed to fetch objects with a camera mounted on the arm. With the aid of speech recognition, it is activated through voice. Below

Robotic arm enabled with Camera

FIGURE 5.8 Robotic arm enabled with Camera.

the arm, IR and ultrasonic sensors are mounted that help camera to locate the object location as well as maintain an exact distance between the object and arm.


In this chapter, a vision- and voice-based robotic arm is proposed. This proposed system provides an opportunity to tackle real-world problems such as remote surveillance; tasks impossible for humans. The integration of day-to-day life problems with complex visions can improve image processing research. OpenCV libraries installed on the Raspberry Pi allow focus on image processing with minimal labor. In this study, the robotic arm is equipped with a camera, ultrasonic sensor and IR sensor that increases the accuracy of selection of objects from exact coordinates. Object detection and location extraction techniques are executed with the aid of image processing methods; namely object extraction techniques, matching preinstalled templates.


  • 1. Juang, J. G., Tsai, Y. J., & Fan, Y. W. (2015). Visual recognition and its application to robot arm control. Applied Sciences, 5(4), 851-880.
  • 2. Manigpan, S. (2010). A Simulation of 6R Industrial Articulated Robot Ann Using Neural Network (Doctoral dissertation, University of the Thai Chamber of Commerce).
  • 3. Ejiri, M. (2007, November). Machine vision in early days: Japan’s pioneering contributions. In Asian Conference on Computer Vision (pp. 35-53). Springer, Berlin, Heidelberg.
  • 4. OpenCV Introduction: https://opencv.org/ (Accessed on 16/11/2019).
  • 5. Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. “O’Reilly Media, Inc.”.
  • 6. Furuta, К. A. T. S. U. H. I. S. A.. Kosuge, K. A. Z. U. H. I. R. O.. & Mukai. N. О. B. U. H. I. K. O. (1988). Control of articulated robot arm with sensory feedback: Laser beam tracking system. IEEE Transactions on Industrial Electronics, 35(1), 31-39.
  • 7. Munasinghe, S. R., Nakamura, M., Goto, S., & Kyura, N. (2001). Optimum contouring of industrial robot arms under assigned velocity and torque constraints. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 31(2), 159-167.
  • 8. Koga, M., Kosuge, K., Furuta, K., & Nosaki, K. (1992). Coordinated motion control of robot arms based on the virtual internal model. IEEE Transactions on Robotics and Automation, 8( 1), 77-85.
  • 9. Efe, M. O. (2008). Fractional fuzzy adaptive sliding-mode control of a 2-DOF direct-drive robot arm. IEEE Transactions on Systems, Man, and Cybernetics, Part В (Cybernetics), 38(6), 1561-1570.
  • 10. Wang, W. J.. Huang, С. H.. Lai, I. H.. & Chen. H. C. (2010. August). A robot arm for pushing elevator buttons. In Proceedings of SICE Annual Conference 2010 (pp. 1844-1848). IEEE.
  • 11. Juang, J. G., Tsai, Y. J., & Fan, Y. W. (2015). Visual recognition and its application to robot arm control. Applied Sciences, 5(4), 851-880.
  • 12. Chen, О. T. C„ Zhang. Y. C.. Lin. Z. K., Kuo, P. I.. & Lee, Y. L. (2019, August). Camera- in-Hand Robotic Arm Using a Deep Neural Network to Realize Unmanned Store Service. In 2019 IEEE Inti Confon Dependable, Autonomic and Secure Computing, Inti Conf on Peiyasive Intelligence and Computing, Inti Conf on Cloud and Big Data Computing, Inti Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/ CyberSciTech) (pp. 833-839). IEEE.
  • 13. Kakde, Y.. Bothe, N.. & Paul, A. (2019). Real Life Implementation of Object Detection and Classification Using Deep Learning and Robotic Arm. Available at SSRN 3372199.
  • 14. Kumbhar, S., Mathurekar, D., & Lobo, D. (2019). Robotic Arm with Vision (Doctoral dissertation).
  • 15. Raspbian Operating System, www.raspbian.org/ (Accessed on 17/11/2019)
  • 16. Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). IEEE.
  • 17. Hinton. G., Deng. L., Yu. D.. Dahl. G.. Mohamed, A. R., Jaitly, N.....& Sainath, T.
  • (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29.
  • 18. TensorFlow: www.tensorflow.org/(Accessed on 10/12/2019)
< Prev   CONTENTS   Source   Next >