The experiment was conducted for a total of 20 days in 2017 (November 27; December 2, 4, 5), 2018 (April 1, 2, 16; September 10, 11, 20, 21; December 3, 5, 6, 10, 11), and 2019 (April 17, 20, 22, 25), and provided service to 330 groups of customers.
3.3.1. Starting a Conversation with a Customer
According to previous studies, some researches have been conducted with robots to estimate the internal state of humans [
14,
15] and to obtain information on conversation partners before a robot starts a conversation [
16]. Allowing a robot to read social cues is one of the indispensable means for communicating with humans. There is also a study demonstrating that a robot’s ability to recognize and respond to human behavior is important for the robot to successfully engage with people and persuade them to nod and reply to its comments [
10]. Based on this finding, we developed an attention estimation model (AEM) based on data of customers’ nonverbal cues in the field experiment, and examined whether the model was effective in a real-world environment. This field experiment was divided into two steps: (
1) training the AEM and (
2) verifying the effectiveness of the AEM in a real-world environment. We used the data of the first step of the field experiment in 2017 for 10 days, and performed the second step of the field experiment in 2018 for 9 days.
According to Poggi [
17], engagement is defined as the value that a participant in an interaction attributes to the goal of being together with other participants and continuing the interaction. In this study, we defined customers’ engagement as the probability that they responded to Pepper, the robotic salesperson. According to the following previous studies, we determined that one of the most common factors for estimating people’s engagement is eye contact, which is a strong sign of paying attention to a conversation partner [
18,
19]. Moreover, eye contact makes it possible to see facial expressions, such as smiles and frowns. However, reactions to the robot, such as nodding and posture, can also provide vital information [
19,
20]. In addition, the physical distance between the robot and customers is another important cue [
21].
From the above, we selected the following five nonverbal cues: (1) eye contact, (2) duration of eye contact, (3) distance, (4) approaching, and (5) laughing. These could be detected by Pepper’s sensors to calculate the degree of engagement. In this study, verbal cues were not used because there was a large amount of noise in the shop and the robot could not accurately detect verbal cues. We collected data from the video of the first step of the field experiment, in which 86 groups had conversations with Pepper. However, in 11 of the 86 customer groups, the conversations were started by the customers. Therefore, we excluded these 11 groups from the data because their interactions were not consistent with the situations we aimed to investigate. If there were several customers in one group, we focused on the customer closest to the robot. Nonverbal cues from the time at which the customers entered the shop to the robot’s first spoken words were defined as the following variables:
- Y:
Engagement, binary data that defined whether the customer responded to the robot’s speech (1 if yes, 0 if no).
- X1:
Eye contact, binary data that defined whether the customer looked at the robot when the robot greeted him or her (1 if yes, 0 if no).
- X2:
Duration of eye contact, linear data that defined the length of eye contact measured in seconds. This was the period between the customers’ entrance into the shop and the robot’s first greeting.
- X3:
Distance, linear data that defined the distance between the robot and the customer measured in meters during the robot’s greetings.
- X4:
Approaching, binary data that defined whether the customer approached the robot when it greeted him or her (1 if yes, 0 if no).
- X5:
Laughing, binary data that defined whether the customer laughed when the robot greeted him or her (1 if yes, 0 if no).
In this study, inter-rater reliability was assessed for 40 groups of customers (approximately 25% of the entire customer data). The micro-behaviors of the customers were scored by two coders: one who scored the entire customer data, an another who scored 40 groups of customers. The inter-rater reliability results were as follows: global inter-rater reliability (kappa = 0.76), engagement (kappa = 0.79), eye contact (kappa = 0.75), duration of eye contact (kappa = 0.77), approaching (kappa = 0.83), distance (kappa = 0.71), and laughing (kappa = 0.69).
To estimate the engagement of customers, we adopted logistic regression, a simple machine learning model for solving binary classification problems. The results of the training are presented in
Table 1. The table indicates that eye contact, duration of eye contact, approaching, and laughing all had a positive correlation with engagement, whereas distance had a negative correlation with engagement. In addition, the bias was −0.20. In this study, we conducted 5-fold cross validation to validate the attention estimation model in our datasets. We achieved 88.9% accuracy, 87.1% precision, and 90.7% recall, which we considered sufficient to estimate the engagement of customers.
Before the second step of the field experiment, we installed an automatic greeting mode based on the AEM in the robot, which is illustrated in
Figure 4. The customers’ faces were captured by the camera on Pepper’s head to judge whether the customers had eye contact with Pepper and detect their facial expressions. In addition, we determined whether the customers approached Pepper, measuring the distance by laser radar on Pepper’s feet. The sensors could only detect customers approximately 3 m away from Pepper.
In
Figure 4, the output of the AEM was a probability from 0 to 1, and we designed two thresholds to divide the engagement levels of the customers into three categories. If the engagement was lower than 0.33, we programmed Pepper to do nothing because the customers were not interested in the robot at all. If the engagement was between 0.33 and 0.66, we programmed Pepper to say “Hey” because spoken words can attract customers’ attention without the customers being forced to reply. In addition, if the customers’ engagement rose above 0.66, Pepper said “Hello” or asked customers to shake hands, as the customers’ engagement was sufficiently high to lead the customers to respond to the robot’s statements.
In this experiment, we compared two groups of customers: those who responded to Pepper after it greeted them automatically based on the AEM, and those who responded when an experimenter controlled Pepper remotely. The results of the second step of the field experiment are presented in
Figure 5. The baseline represents the condition in which the experimenters controlled Pepper remotely (only when the customers approached the robot within 3 m, as this was the distance from the robot to the entrance of the shop). During the experiments, we alternately performed the baseline and automatic greeting mode condition every 30 min. In the AEM condition illustrated in
Figure 5, 30 out of 39 customer groups responded to the robot with a proportion of 76.9%. In the baseline condition, 22 out of 42 customer groups responded to the robot with a proportion of 52.3%. We thus determined that the proportion of customers’ responses in the AEM condition was higher than that in the baseline condition. In addition, we performed a chi-squared test to validate our data. The results indicated that there was a significant difference between the two conditions (
X2 = 5.29,
p = 0.021 < 0.05). In addition, during the first step of the field experiment, 75 out of 153 customer groups responded to the robot’s remarks with a proportion of 49.0%. Therefore, the automatic greeting mode led to a higher proportion of customer responses than the baseline mode. To investigate the cause of these results, we analyzed the interaction of the following groups.
A conversation with a customer group in which the robot was remotely controlled is presented in
Table 2. This transcript indicates that when the robot said “Hello,” although the customer (C1) approached the robot, she did not look at the robot (lines 4 and 5). Thereafter, the robot was ignored. We considered that during the remote control experiments, it was difficult for the controller to determine what the customers were looking at.
A conversation with a customer group in which the robot was set up in the automatic greeting mode is presented in
Table 3. It can be seen that when the robot said “Hey,” the customer (C1) turned her head and looked at the robot (lines 3 and 4). Thereafter, the robot perceived the eye contact of C1 and said “Hello” (line 5) rapidly, and C1 then responded to the robot and a conversation was initiated (line 6). The automatic greeting mode appeared to perceive people’s states more precisely, allowing the robot to make decisions more quickly. Therefore, when the robot greeted customers based on the AEM, customers were likely to feel that the robot could understand their degree of attention to the robot. The customers may have then felt a sense of guilt if they ignored the robot. Consequently, they may have been more willing to respond to the robot [
10]. Thus, an automatic greeting based on the AEM was effective in a real-world environment and was able to strengthen the robot’s social presence. The robot’s greeting based on the AEM led customers to believe that the robot could understand their behaviors due to the robot’s ability to turn its head toward the customers or make eye contact with them. Thus, creating the impression that a robot can detect people’s behaviors can lead to a higher probability of customers responding to the robot’s remarks [
22].
However, there were some customers whose engagement value did not increase enough for the robot to greet them. If the automatic greeting mode is used for such customers, the robot must take action to attract the customers’ attention. For example, rhythmic speech can be novel and interesting. Therefore, in this experiment, rap singing was adopted to further attract customers’ attention. Rap singing was performed by the robot when the value of the target customer’s engagement with the robot was not high enough for the robot to greet him or her automatically and when the customer was likely to leave the shop if the robot did nothing. Thus, to identify these scenarios, we counted the number of groups that the robot greeted every 10 s after entering the shop and the number of groups that left the shop before the robot’s automatic greeting function was activated, and examined their relation.
The results presented in
Figure 6 indicate that 73 customers visited during using the automatic greeting mode, with 42 groups greeted by the robot and 31 groups leaving the shop before the robot could greet them. The horizontal axis in
Figure 6 represents the elapsed time from when the customer entered the shop when the robot greeted him or her and when the customer left the shop, while the vertical axis represents the ratio of the number of customers every 10 s. This figure indicates that after 21 to 30 s passed after entering the shop, the number of groups leaving the shop before the robot could greet them was greater than the number of groups with a high engagement level. This signifies that if the robot did not greet the customers using the automatic greeting mode 21 to 30 s after their entrance into the shop, the customers were likely to leave. Therefore, this was considered the appropriate time to perform rap singing to attract the customers’ attention. If the engagement value was not high enough for the robot to greet the customers even after 25 s (which is between 21 and 30 s), it was deemed that the robot should sing a rap song.
3.3.3. Shifting Customers’ Attention to Products
We also considered a motion to suggest tasting a sample as a method to shift attention from the robot to the products. Sample tasting has been shown to be effective in promoting sales [
24], where the psychological effect called the norm of reciprocity affects the customer, who feels that something must be returned after receiving something from another person. In addition, a study has been conducted in which tasting was performed using a robot in a field experiment [
25]. In the study, the authors compared the abilities of a robot and humans to successfully advertise sample tastings to customers in large shopping centers, and demonstrated that the robot was more effective than humans. When the robot offered a sample to the customers, it said “Would you like to try a sample? You can taste here!” and indicated the sample using its hand and eyes, as illustrated in
Figure 8.
However, even when the robot suggests tasting a sample, the customers do not always respond to the request. Therefore, an aggressive approach, such as handing a sample to customers, can lead customers to behave according to the consistency principle, which is a psychological process in which people tend to accept purchase requests in order to keep their behavior consistent. However, it is not easy for a robot to take something with its mechanical hand and hand it to the customer. The reason is not only that dexterous hand movements are difficult for a robot, but also that customers may not believe that the robot can respond accordingly even if they express a desire to try a sample. In attempting to solve these problems, it is necessary to consider a method of handing the sample directly to the customers. We decided that it would be sufficient to complete the task of having the customers agree to a sample tasting. Therefore, in this experiment, we installed a function for the robot to request a salesperson to give customers a sample for tasting. Specifically, the robot called a salesperson with the motion illustrated in
Figure 9 while saying “Excuse me. Please give them a sample.”.
To determine when to activate the motion to suggest tasting a sample and the request for a salesperson’s assistance, we compared a group that agreed to taste a sample and a group that did not agree after the robot’s suggestive motions.
Table 4 presents a conversation with a group to which the robot suggested tasting a sample. In this conversation, the robot requested the help of a salesperson to distribute the sample to the customers, and the customers then purchased the product.
In the example displayed in
Table 4, when the customer (C1) looked at the robot while talking, the robot suggested tasting a sample by saying “Would you like to try a sample? You can taste here” (line 16), and the customers turned their heads to the sample and replied to the robot (line 18). Seven seconds later, while the customer was looking at the sample, the robot requested help from the salesperson (line 21), and the salesperson passed the sample directly to the customer (line 25). The customer then tasted and purchased the product. In this example, the customers were obviously interested in the robot, as they took pictures of it; however, the robot appeared to successfully promote sales by directing their attention to the products with suggestions to taste a sample.
We also analyzed a conversation with a group that did not taste a sample after the robot’s suggestion.
Table 5 presents a conversation with a group that responded to the suggestion to taste a sample but did not taste it even after the robot asked the salesperson for assistance.
In
Table 5, the customer (C1) looked at the robot in response to the robot’s suggestion “Would you like to try a sample?” (line 9). When the robot said “You can taste here” (line 10), the customer looked at the sample (line 11). However, during the robot’s description of the sample, the customer turned away from the sample, seeming uninterested (lines 14 and 15). At this time, the robot requested assistance from the salesperson (line 18), but the customer ignored the sample and left the shop. Comparing the results of the conversation in
Table 4, in which the robot was able to achieve a sample tasting, we observed that there was a difference in the line of sight of the customers when the robot requested assistance from the salesperson.
Next,
Table 6 presents a conversation with a group that did not taste the sample when the robot offered to taste it.
In
Table 6, the customer (C1) immediately engaged with the robot and responded to the request for a handshake (line 13). Thereafter, when C1 looked away from the robot (line 19), the robot suggested tasting the sample (lines 20 and 22). It should be noted that C1 responded to the statement of the robot’s suggestion to taste (line 21), indicating that he had heard the robot’s suggestion. However, he did not taste the sample and did not even look at it. In this example, when comparing this conversation with the conversations in
Table 4 and
Table 5, there was a difference in the line of sight of the customers when the robot suggested tasting the sample. Therefore, this analysis reveals the following two points:
If a robot offers a sample while the customer is looking at the robot, the customer often looks at the sample for tasting.
If the robot asks the salesperson for assistance while a customer is looking at a sample for tasting, the customer often tastes the sample.
Therefore, the results presented in
Table 2 were obtained by examining the line of sight of 16 customer groups at the time of the robot requesting assistance from the salesperson.
According to
Table 7, seven groups of customers who looked at the sample before the robot’s request received a sample from the salesperson and tasted it after the robot requested assistance. In contrast, eight groups who did not taste the sample were not looking at the sample before the robot’s request to the salesperson. When Fisher’s exact test [
26] was applied to these results, as illustrated in
Figure 10, we determined that the groups of customers who were looking at the sample when the robot requested assistance had a significantly higher tasting rate (
p < 0.05).
Therefore, the results indicate that if the robot requests assistance from the salesperson when customers are looking at the sample and the salesperson gives the sample to the customers, the customers will taste the sample. From these results, it is also possible that if the robot additionally describes the tasting while the customers are looking at the sample (i.e., if they are interested in tasting), the customers may taste the sample. We also determined that the line of sight of the customer is important when the robot requests the salesperson’s assistance. Therefore, we examined the line of sight of 96 groups of customers to whom the robot suggested tasting a sample during the experiment, and
Table 8 presents the results.
According to
Table 3, 39 of the 42 customer groups (92.9% of cases) who were looking at the robot during the robot’s suggestion looked at the sample afterwards, while 8 of the 54 customer groups (14.8% of cases) who were not looking at the robot during the suggestion looked at the sample afterwards. A chi-squared test was performed on these results, and the results indicated that the customers who were looking at the robot when the robot suggested tasting a sample were much more likely to look at the sample after the suggestion than customers who were not looking at the robot (see
Figure 11). In addition, all 16 groups who tasted a sample were looking at the robot when the robot made the suggestion. Therefore, we determined that it is important for the robot to suggest a tasting while the customers are looking at it to successfully persuade the customers to taste the sample.