Augmented reality (AR) describes digital experiences where virtual content, such as 2D or 3D graphics, is blended with and aligned to the real world. In that sense, AR extends, enhances, or augments a user's view of the real world. When an app superimposes that content on a live camera image, the user experiences augmented reality: the illusion that virtual elements exist as part of the real world.
Working with the camera viewport of mobile devices in this way is called the video-see-through effect. On HoloLens and other mixed reality glasses, which let you see the reality through a transparent display instead of perceiving it through the video stream, this effect of blending is rather called the optical-see-through effect.
Regardless of what device you are going to use, the core requirement for any AR experience is the ability to create and track correspondences between the real world of the user and a virtual space. VisionLib leverages computer vision techniques to enable this matching and tracking of correspondences.
With a multitude of tracking techniques available, from marker- over feature- to edge-based techniques, computer vision is an active and evolving research field, while at the same time, a few popular techniques have found success and are considered state of art.
For you as a developer, creating AR experiences involves several layers. First, the tracking layer: VisionLib manages the real world acquisition through computer vision. The second layer is responsible for rendering visual elements, i.e. your virtual content. With VisionLib's API, you can choose from several development environments, where Unity3D is the most popular and easiest to get started. The first two layers are synchronized through VisionLib's API, which passes the math from tracking to rendering. The third layer is your specific application logic.
In a more technical sense, computer vision tracking incorporates algorithms and mathematical techniques to describe or process images, video or 3D depth-data streams. In Mixed Reality, this is e.g. about the so called pose estimation of a camera, which incorporates extrinsic data (such as camera translation and rotation data) and intrinsic data (such as e.g. focal length and other parameters of the camera's optics). All is needed and used to create a perfect superimposition of the augmented content.
That is, why well calibrated camera optics are essential for mobile AR on tablets and smartphones, and likewise a well eye calibration or camera-to-camera calibration is essential on XR glasses.
Earlier forms of Augmented Reality on mobile devices solely used inertial sensors (like compass, gyro and accelerometer) to put and align information into the real domain. Content augments our view space but, technology-wise, there is no deeper understanding to reality, nor are there any particular correspondences other than the inertial sensor readings. The values of the inertial sensors are used to estimate the user's view and positions as best as possible. As inertial sensors alone tend to drift, they can miss important queues which prevents the AR view from presenting unambiguous allocations.
Tracking images, posters and markers with means of computer vision, all have the same foundation: using a (known) image or pattern for recognition and tracking it within the captured video stream. Based on what is called feature tracking, such 2D materials are good targets, because they result in fixed feature maps. Using these trackers was and is popular for many AR cases. E.g. when you have a printed product catalog, you can use images on particular pages to superimpose a depicted product in a 3D, AR view.
Image tracking usually enable precise augmentation results, but printed images are only 2D. You could "tag" your environment with images upfront and align actual physical 3D objects according to these "spatial markers" in order to mimic a true 3D object tracking. However, when things change - e.g. your aligned objects move - or markers are removed or placed differently, superimposition won't match your reality and the experience will break. In all cases, image markers need pre-preparation.
SLAM (Simultaneous Localization and Mapping) has become a decent AR-enabler. This technique enables you to spontaneously reconstruct maps of your current environment with means of computer vision. It makes it possible to augment and blend content into reality quite stable and works fine for the placement of holograms with some basic environmental understanding.
But SLAM is not capable of precise model detection and as it only reconstructs maps of objects or spaces, it has problems with changing environment or light conditions and is not very stable over time. As a consequence, as a developer it is hard to almost impossible to pinpoint information to particular spots in reality upfront and if you work with stored or anchored SLAM information, it could eventually break.
Whenever you want to create AR apps, where you need information augmented precisely and unambiguously to a specific point or object in reality, model tracking is the state of art to work with. Model tracking localizes and tracks objects with means of 3D and CAD data and is a key enabler for all those AR applications that require to pin information and virtual content exactly to a certain point or location.
And because VisionLib's model tracking overcomes typical AR and computer vision obstacles, such as varying light or moving elements, there is no need for further preparations, i.e. either tagging objects or environments with markers or pre-acquiring SLAM - maps upfront.
As a developer, this is a game changer for AR cases in which you have to rely on a stable tracking and detection. And by using CAD and 3D data as reference, you can reference your AR content in relation to the digital twin.
As a user, this is a game changer, too, because you get reliable and valuable AR apps that will support you in many different areas. For example AR-enhanced manuals, which guide you visually through a procedure step-by-step, putting torque or other special information on the screw they belong to. This puts AR views on a whole new scale: industrial scale.
Next chapter: VisionLib Technical Overview