What are you going to learn?
I will show you how to detect faces and their features using the Vision Framework in an iOS app. We will receive live frames of the front camera of an iOS device. Next, we will analyze each frame using the Vision framework’s face detection. After analyzing a particular frame it can detect a face and it’s features.
In this blog, you’ll learn how to use the Vision framework to:
- Create requests for face detection and detecting face landmarks.
- Process these requests and return the results.
- Overlay the results on the camera feed to get real-time face detection.
What is face detection?
Face detection just means that a system can identify that there is a human face present in an image or video. For example, Face Detection can be used for auto-focus functionality in cameras.
Why use Vision Framework?
Vision algorithms are more accurate and less prone to return false positives or negatives. Apple claims that the framework leverages the latest machine learning (deep learning) and computer vision techniques which have improved results and performance. The Vision Framework can detect and track rectangles, faces, and other salient objects across a sequence of images.
The Vision framework follows the simple mechanism to obtain computer vision as:
Request, Request Handler, and the Observation of that request.
Let's see the base classes and categories of Vision. Under the roof, there are 3 main class categories:
- VNDetectFaceRectanglesRequest for Face detection.
- VNDetectBarcodesRequest for Barcode detection.
- VNDetectTextRectanglesRequest for Text region.
For the face detection just need to implement VNDetectFaceRectanglesRequest.
How to Stream the front camera feed?
In the first step we want to stream the camera feed from the front camera or the back camera to the screen.
Let’s start with adding the camera feed to the ViewController. First we will require access to the front camera. We will use the AVFoundation framework provided by Apple on the iOS platform to do so. AVFoundation framework allows us to access the camera and facilitates the output of the camera in our desired format for processing. To gain access to the AVFoundation framework add the following line after Import Foundation in ViewController.swift
Now, we create an instance of a class called AVCaptureSession. This class coordinates multiple inputs such as microphone and camera with multiple outputs.
Now we are adding the front camera as an input to our CaptureSession. The function starts by fetching the front camera device. Let’s call this function, At the end of ViewDidLoad.
Now for back camera as an input to CaptureSession, use the following function:
If you want flip camera button to switch between front and back camera use this following function:
For an app to access the camera, the app must declare that it requires to use the camera in its Info.plist file. Open Info.plist and add a new entry to the property list. For key, add NSCameraUsageDescription and for value, enter Required for front camera access.
Once we have the front camera feed, now we have to display it on screen. For such a task we are going to make use of the AVCaptureVideoPreviewLayer class. AVCaptureVideoPreviewLayer is a subclass of CALayer and it is used for displaying the camera feed. Add this as a new property to ViewController. The property is lazy loaded as it requires CaptureSession to be loaded before it. Thus we used the lazy keyword to defer the initialization to a point where the CaptureSession would already be loaded.
We have to add the PreviewLayer as a sublayer of the container UIView of our ViewController. Now call this function at the end of ViewDidLoad.
We adapt the preview layer’s frame when the container’s view frame changes, it can potentially change at different points of the UIViewController instance lifecycle.
The CaptureSession starts coordinating its input which provides preview, and outputs. At the end of ViewDidLoad call the following line.
How to detect faces and draw bounding boxes on the face?
Let’s extract the image. For this task, we will require our CaptureSession to output each image. We will need to make use of AVCaptureVideoDataOutput. Within the ViewController class create an instance of AVCaptureVideoDataOutput.
Additionally, let’s call the VideoDataOutput to deliver each frame to the ViewController.
Now add the function to receive the frames from the CaptureSession. The CaptureOutput function receives the frames from the VideoDataOutput.
Now We will use Vision frameworks VNDetectFaceLandmarksRequest for landmark detection. To access Vision we must first import it in the file using it, at the top of ViewController.
Before the Vision framework can track an object, we should know which object is to be tracked. Determine which face is to be tracked by creating a VNImageRequestHandler and passing it a still image frame. In the case of video, submit individual frames to the request handler as they arrive in the delegate method.
The results returned contain a property named BoundingBox for each observed face. We will take each face in turn and extract the bounding box for each of those.
The face observation result returns a bounding box with the location of the face in the image. However, the image resolution differs from the screen resolution. Therefore we need a conversion function. The Apple provides a conversion function on AVCaptureVideoPreviewLayer instance named layerRectConverted(fromMetadataOutputRect:) to convert from the image coordinates and screen coordinates.
Now call our new handleFaceDetectionResults from the detect face function.
Below function is used for drawing features on the face:
Let's draw some face features. I won’t cover all the face features available. However, what we will cover here can be applied to any face feature. For this post, we will draw the eyes on the screen. We can access the face features path using the face detection result, VNFaceObservation property.
Note in the drawEye function we have to convert each point for the eye contour to screen points as we did for the face bounding box.
It is a new high-level framework for Computer Vision, which is the best among all other frameworks for image processing. It obtains the highest accuracy in a very less processing time without any latency. Features like the privacy of users’ data, consistent interface, no cost, and real-time use case make Vision even more efficient.