Informatik
Refine
Document Type
- Conference proceeding (17)
- Journal article (7)
Has full text
- yes (24)
Is part of the Bibliography
- yes (24) (remove)
Institute
- Informatik (24)
Publisher
- IEEE (12)
- De Gruyter (2)
- Springer (2)
- ARVO (1)
- Association for Computing Machinery (1)
- Deutsche Gesellschaft für Computer- und Roboterassistierte Chirurgie e.V. (1)
- PLOS (1)
- Sage Publishing (1)
- Taylor & Francis (1)
- Universität des Saarlandes (1)
With the progress of technology in modern hospitals, an intelligent perioperative situation recognition will gain more relevance due to its potential to substantially improve surgical workflows by providing situation knowledge in real-time. Such knowledge can be extracted from image data by machine learning techniques but poses a privacy threat to the staff’s and patients’ personal data. De-identification is a possible solution for removing visual sensitive information. In this work, we developed a YOLO v3 based prototype to detect sensitive areas in the image in real-time. These are then deidentified using common image obfuscation techniques. Our approach shows that it is principle suitable for de-identifying sensitive data in OR images and contributes to a privacyrespectful way of processing in the context of situation recognition in the OR.
The basic idea behind a wearable robotic grasp assistancesystem is to support people that suffer from severe motor impairments in daily activities. Such a system needs to act mostly autonomously and according to the user’s intent. Vision-based hand pose estimation could be an integral part of a larger control and assistance framework. In this paper we evaluate the performance of egocentric monocular hand pose estimation for a robot-controlled hand exoskeleton in a simulation. For hand pose estimation we adopt a Convolutional Neural Network (CNN). We train and evaluate this network with computer graphics, created by our own data generator. In order to guide further design decisions we focus in our experiments on two egocentric camera viewpoints tested on synthetic data with the help of a 3D-scanned hand model, with and without an exoskeleton attached to it.We observe that hand pose estimation with a wrist-mounted camera performs more accurate than with a head-mounted camera in the context of our simulation. Further, a grasp assistance system attached to the hand alters visual appearance and can improve hand pose estimation. Our experiment provides useful insights for the integration of sensors into a context sensitive analysis framework for intelligent assistance.
Distraction of the driver is one of the most frequent causes for car accidents. We aim for a computational cognitive model predicting the driver’s degree of distraction during driving while performing a secondary task, such as talking with co-passengers. The secondary task might cognitively involve the driver to differing degrees depending on the topic of the conversation or the number of co-passengers. In order to detect these subtle differences in everyday driving situations, we aim to analyse in-car audio signals and combine this information with head pose and face tracking information. In the first step, we will assess driving, video and audio parameters reliably predicting cognitive distraction of the driver. These parameters will be used to train the cognitive model in estimating the degree of the driver’s distraction. In the second step, we will train and test the cognitive model during conversations of the driver with co-passengers during active driving. This paper describes the work in progress of our first experiment with preliminary results concerning driving parameters corresponding to the driver’s degree of distraction. In addition, the technical implementation of our experiment combining driving, video and audio data and first methodological results concerning the auditory analysis will be presented. The overall aim for the application of the cognitive distraction model is the development of a mobile user profile computing the individual distraction degree and being applicable also to other systems.
Human pose estimation (HPE) is integral to scene understanding in numerous safety-critical domains involving human-machine interaction, such as autonomous driving or semi-automated work environments. Avoiding costly mistakes is synonymous with anticipating failure in model predictions, which necessitates meta-judgments on the accuracy of the applied models. Here, we propose a straightforward human pose regression framework to examine the behavior of two established methods for simultaneous aleatoric and epistemic uncertainty estimation: maximum a-posteriori (MAP) estimation with Monte-Carlo variational inference and deep evidential regression (DER). First, we evaluate both approaches on the quality of their predicted variances and whether these truly capture the expected model error. The initial assessment indicates that both methods exhibit the overconfidence issue common in deep probabilistic models. This observation motivates our implementation of an additional recalibration step to extract reliable confidence intervals. We then take a closer look at deep evidential regression, which, to our knowledge, is applied comprehensively for the first time to the HPE problem. Experimental results indicate that DER behaves as expected in challenging and adverse conditions commonly occurring in HPE and that the predicted uncertainties match their purported aleatoric and epistemic sources. Notably, DER achieves smooth uncertainty estimates without the need for a costly sampling step, making it an attractive candidate for uncertainty estimation on resource-limited platforms.
Reliable and accurate car driver head pose estimation is an important function for the next generation of advanced driver assistance systems that need to consider the driver state in their analysis. For optimal performance, head pose estimation needs to be non-invasive, calibration-free and accurate for varying driving and illumination conditions. In this pilot study we investigate a 3D head pose estimation system that automatically fits a statistical 3D face model to measurements of a driver’s face, acquired with a low-cost depth sensor on challenging real-world data. We evaluate the results of our sensor-independent, driver-adaptive approach to those of a state-of-the-art camera-based 2D face tracking system as well as a non-adaptive 3D model relative to own ground-truth data, and compare to other 3D benchmarks. We find large accuracy benefits of the adaptive 3D approach.
We present a multitask network that supports various deep neural network based pedestrian detection functions. Besides 2D and 3D human pose, it also supports body and head orientation estimation based on full body bounding box input. This eliminates the need for explicit face recognition. We show that the performance of 3D human pose estimation and orientation estimation is comparable to the state-of-the-art. Since very few data sets exist for 3D human pose and in particular body and head orientation estimation based on full body data, we further show the benefit of particular simulation data to train the network. The network architecture is relatively simple, yet powerful, and easily adaptable for further research and applications.
Perceptual integration of kinematic components in the recognition of emotional facial expressions
(2018)
According to a long-standing hypothesis in motor control, complex body motion is organized in terms of movement primitives, reducing massively the dimensionality of the underlying control problems. For body movements, this low dimensional organization has been convincingly demonstrated by the learning of low-dimensional representations from kinematic and EMG data. In contrast, the effective dimensionality of dynamic facial expressions is unknown, and dominant analysis approaches have been based on heuristically defined facial ‘‘action units,’’ which reflect contributions of individual face muscles. We determined the effective dimensionality of dynamic facial expressions by learning of a low dimensional model from 11 facial expressions. We found an amazingly low dimensionality with only two movement primitives being sufficient to simulate these dynamic expressions with high accuracy. This low dimensionality is confirmed statistically, by Bayesian model comparison of models with different numbers of primitives, and by a psychophysical experiment that demonstrates that expressions, simulated with only two primitives, are indistinguishable from natural ones.
In addition, we find statistically optimal integration of the emotion information specified by these primitives in visual perception. Taken together, our results indicate that facial expressions might be controlled by a very small number of independent control units, permitting very low dimensional parametrization of the associated facial expression.
Prominent theories of action recognition suggest that during the recognition of actions the physical patterns of the action is associated with only one action interpretation (e.g., a person waving his arm is recognized as waving). In contrast to this view, studies examining the visual categorization of objects show that objects are recognized in multiple ways (e.g., a VW Beetle can be recognized as a car or a beetle) and that categorization performance is based on the visual and motor movement similarity between objects. Here, we studied whether we find evidence for multiple levels of categorization for social interactions (physical interactions with another person, e.g., handshakes). To do so, we compared visual categorization of objects and social interactions (Experiments 1 and 2) in a grouping task and assessed the usefulness of motor and visual cues (Experiments 3, 4, and 5) for object and social interaction categorization. Additionally, we measured recognition performance associated with recognizing objects and social interactions at different categorization levels (Experiment 6). We found that basic level object categories were associated with a clear recognition advantage compared to subordinate recognition but basic level social interaction categories provided only a little recognition advantage. Moreover, basic level object categories were more strongly associated with similar visual and motor cues than basic level social interaction categories. The results suggest that cognitive categories underlying the recognition of objects and social interactions are associated with different performances. These results are in line with the idea that the same action can be associated with several action interpretations (e.g., a person waving his arm can be recognized as waving or greeting).
Motor-based theories of facial expression recognition propose that the visual perception of facial expression is aided by sensorimotor processes that are also used for the production of the same expression. Accordingly, sensorimotor and visual processes should provide congruent emotional information about a facial expression. Here, we report evidence that challenges this view. Specifically, the repeated execution of facial expressions has the opposite effect on the recognition of a subsequent facial expression than the repeated viewing of facial expressions. Moreover, the findings of the motor condition, but not of the visual condition, were correlated with a nonsensory condition in which participants imagined an emotional situation. These results can be well accounted for by the idea that facial expression recognition is not always mediated by motor processes but can also be recognized on visual information alone.
Putting actions in context: visual action adaptation aftereffects are modulated by social contexts
(2014)
The social context in which an action is embedded provides important information for the interpretation of an action. Is this social context integrated during the visual recognition of an action? We used a behavioural visual adaptation paradigm to address this question and measured participants’ perceptual bias of a test action after they were adapted to one of two adaptors (adaptation after-effect). The action adaptation after effect was measured for the same set of adaptors in two different social contexts. Our results indicate that the size of the adaptation effect varied with social context (social context modulation) although the physical appearance of the adaptors remained unchanged. Three additional experiments provided evidence that the observed social context modulation of the adaptation effect are owed to the adaptation of visual action recognition processes. We found that adaptation is critical for the social context modulation (experiment 2). Moreover, the effect is not mediated by emotional content of the action alone (experiment 3) and visual information about the action seems to be critical for the emergence of action adaptation effects (experiment 4). Taken together these results suggest that processes underlying visual action recognition are sensitive to the social context of an action.
Learning to translate between real world and simulated 3D sensors while transferring task models
(2019)
Learning-based vision tasks are usually specialized on the sensor technology for which data has been labeled. The knowledge of a learned model is simply useless when it comes to data which differs from the data on which the model has been initially trained or if the model should be applied to a totally different imaging or sensor source. New labeled data has to be acquired on which a new model can be trained. Depending on the sensor, this can even get more complicated when the sensor data becomes more abstract and hard to be interpreted and labeled by humans. To enable reuse of models trained for a specific task across different sensors minimizes the data acquisition effort. Therefore, this work focuses on learning sensor models and translating between them, thus aiming for sensor interoperability. We show that even for the complex task of human pose estimation from 3D depth data recorded with different sensors, i.e. a simulated and a Kinect 2TM depth sensor, human pose estimation can greatly improve by translating between sensor models without modifying the original task model. This process especially benefits sensors and applications for which labels and models are difficult if at all possible to retrieve from raw sensor data.
We present an approach for segmenting individual cells and lamellipodia in epithelial cell clusters using fully convolutional neural networks. The method will set the basis for measuring cell cluster dynamics and expansion to improve the investigation of collective cell migration phenomena. The fully learning-based front-end avoids classical feature engineering, yet the network architecture needs to be designed carefully. Our network predicts how likely each pixel belongs to one of the classes and, thus, is able to segment the image. Besides characterizing segmentation performance, we discuss how the network will be further employed.
Based on well-established robotic concepts of autonomous localization and navigation we present a system prototype to assist camera-based indoor navigation for human utilization implemented in the Robot Operating System (ROS). Our prototype takes advantage of state-of-the-art computer vision and robotic methods. Our system is designed for assistive indoor guidance. We employ a vibro tactile belt to serve as a guiding device to render derived motion suggestions to the user via vibration patterns. We evaluated the effectiveness of a variety of vibro-tactile feedback patterns for guidance of blindfolded users. Our prototype demonstrates that a vision-based system can support human navigation, and may also assist the visually impaired in a human-centered way.
RoPose-Real: real world dataset acquisition for data-driven industrial robot arm pose estimation
(2019)
It is necessary to employ smart sensory systems in dynamic and mobile workspaces where industrial robots are mounted on mobile platforms. Such systems should be aware of flexible and non-stationary workspaces and able to react autonomously to changing situations. Building upon our previously presented RoPose-system, which employs a convolutional neural network architecture that has been trained on pure synthetic data to estimate the kinematic chain of an industrial robot arm system, we now present RoPose-Real. RoPose-Real extends the prior system with a comfortable and targetless extrinsic calibration tool, to allow for the production of automatically annotated datasets for real robot systems. Furthermore, we use the novel datasets to train the estimation network with real world data. The extracted pose information is used to automatically estimate the observing sensor pose relative to the robot system. Finally we evaluate the performance of the presented subsystems in a real world robotic scenario.
As production workspaces become more mobile and dynamic it becomes increasingly important to reliably monitor the overall state of the environment. Therein manipulators or other robotic systems likely have to be able to act autonomously together with humans and other systems within a joint workspace. Such interactions require that all components in non-stationary environments are able to perceive the state relative to each other. As vision-sensors provide a rich source of information to accomplish this, we present RoPose, a convolutional neural network (CNN) based approach, to estimate the two dimensional joint configuration of a simulated industrial manipulator from a camera image. This pose information can further be used by a novel targetless calibration setup to estimate the pose of the camera relative to the manipulator’s space. We present a pipeline to automatically generate synthetic training data and conclude with a discussion of the potential usage of the same pipeline to acquire real image datasets of physically existent robots.
Die Segmentierung und das Tracking von minimal-invasiven robotergeführten Instrumenten ist ein wesentlicher Bestandteil für verschiedene computer assistierte Eingriffe. Allerdings treten in der minimal-invasiven Chirurgie, die das Anwendungsfeld für den hier beschriebenen Ansatz darstellt, häufig Schwierigkeiten durch Reflexionen, Schatten oder visuelle Verdeckungen durch Rauch und Organe auf und erschweren die Segmentierung und das Tracking der Instrumente.
Dieser Beitrag stellt einen Deep Learning Ansatz für ein markerloses Tracking von minimal-invasiven Instrumenten vor und wird sowohl auf simulierten als auch realen Daten getestet. Es wird ein simulierter als auch realer Datensatz mit Ground Truth Kennzeichnung für die binäre Segmentierung von Instrument und Hintergrund erstellt. Für den simulierten Datensatz werden Bilder aus einem simulierten Instrument und realem Hintergrund zusammengesetzt. Im Falle des realen Datensatzes spricht man von der Zusammensetzung der Bilder aus einem realen Instrument und Hintergrund. Insgesamt wird auf den simulierten Daten eine Pixelgenauigkeit von 94.70 Prozent und auf den realen Daten eine Pixelgenauigkeit von 87.30 Prozent erreicht.
Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. To train the corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrate a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we transform noisy human pose estimates in an image like format we call Encoded Human Pose Image (EHPI). This encoded information can further be classified using standard methods from the computer vision community. With this simple procedure, we achieve competitive state-of-the-art performance in pose based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.
Enhancing data-driven algorithms for human pose estimation and action recognition through simulation
(2020)
Recognizing human actions, reliably inferring their meaning and being able to potentially exchange mutual social information are core challenges for autonomous systems when they directly share the same space with humans. Intelligent transport systems in particular face this challenge, as interactions with people are often required. The development and testing of technical perception solutions is done mostly on standard vision benchmark datasets for which manual labelling of sensory ground truth has been a tedious but necessary task. Furthermore, rarely occurring human activities are underrepresented in these datasets, leading to algorithms not recognizing such activities. For this purpose, we introduce a modular simulation framework, which offers to train and validate algorithms on various human-centred scenarios. We describe the usage of simulation data to train a state-of-the-art human pose estimation algorithm to recognize unusual human activities in urban areas. Since the recognition of human actions can be an important component of intelligent transport systems, we investigated how simulations can be applied for his purpose. Laboratory experiments show that we can train a recurrent neural network with only simulated data based on motion capture data and 3D avatars, which achieves an almost perfect performance in the classification of those human actions on real data.
Recognizing actions of humans, reliably inferring their meaning and being able to potentially exchange mutual social information are core challenges for autonomous systems when they directly share the same space with humans. Today’s technical perception solutions have been developed and tested mostly on standard vision benchmark datasets where manual labeling of sensory ground truth is a tedious but necessary task. Furthermore, rarely occurring human activities are underrepresented in such data leading to algorithms not recognizing such activities. For this purpose, we introduce a modular simulation framework which offers to train and validate algorithms on various environmental conditions. For this paper we created a dataset, containing rare human activities in urban areas, on which a current state of the art algorithm for pose estimation fails and demonstrate how to train such rare poses with simulated data only.
Avatars are in use when interacting in virtual environments in different contexts, in collaborative work, as well as in gaming and also in virtual meetings with friends. Therefore it is important to understand how the relationship between user and avatar works. In this study, an online survey is used to determine how the perception of an avatar changes in different contexts by relating it to existing avatar relationship typologies. Additionally, it is determined whether in each context a realistic, abstract or comic-like representation is preferred by the participants. One result was a preference of low poly representations in the work context, which are associated with the perception of the avatar as a tool. In the context of meeting friends, a realistic representation is perceived as more appropriate, which is perceived as an accurate self-representation. In the gaming context, the results are less clear, which can be attributed to different gaming preferences. Here, unlike in the other contexts, a comic-like representation is also perceived as appropriate, which is associated with the perception of the avatar as a friend. A symbiotic user-avatar relationship is not directly related to any form of representation, but always lies in the midfield, which is attributed to the fact that it represents a whole spectrum between other categories.