How do Humans Interact with Chatbots?: An Analysis of Transcripts

Chatbots enable machines to emulate human conversation, and recent developments have resulted in many online systems for the public to use. Although a few studies have investigated how humans interact with such programs, we are not aware of any that have analyzed transcripts in depth. In this study, students interacted with two Web-based chatbots, Rose and Mitsuku, for five minutes and evaluated how well they thought the software emulated human conversation. We reviewed the transcripts and found that students used fairly simple language and made many text errors. There were no significant differences between the two systems in our experimental measures, but we found that Rose tended to change the topic more often and Mitsuku seemed more argumentative.

actual women to provide the illusion that there was a large number of females using the site, and thus, increase the number of male customers [17]. In addition, these systems have been used in chat rooms to entice people into revealing personal information, such as bank account numbers.
The use of chatbots is fairly new and still growing, and relatively little research has been conducted into how they are used by people. In the next section, we review prior research and then describe an experiment in which students conversed with two prominent chatbots, Rose and Mitsuku. We analyze the results and discuss future possible studies.

Background Chatbots
The study of conversational agents is over 50 years old as the first chatbot (Eliza) was developed in 1966 [29]. At this early stage in artificial intelligence, Eliza did not truly 'understand' the user's conversation but merely rephrased the typed text or generated generic responses in an attempt to appear to be human. Subsequently, dozens of chatbots were developed, but relatively few are available publicly, online. Two of these include: • Mitsuku (http://www.square-bear.co.uk/mitsuku/chat.htm). Mitsuku is perhaps the best chatbot available online (most adept at emulating a human conversation) as it achieved first place in the 2018 competition for the Loebner Prize, the fourth time it has done so [26,30]. In this replication of the Turing Test [6,21,23,27], four human judges typed messages to two chatbots and a human, without knowing which was which, and scores were assigned to each entity based upon the naturalness of responses (most human-like). Mitsuku takes the perspective of an 18-year-old girl, and a cartoon image of her appears on the Web page (see Figure 1). Development on Mitsuku was begun in 2005, and the site claims to have had about 10 million visitors since 2010. Like many chatbots, knowledge is stored in AIML files, and visitors are allowed to teach the system.

Conversation
Visitors to the sites may type comments and questions, and some allow users to talk rather than type, receiving responses via computer-synthesized speech [31]. Many chatbots answer questions, and even perform simple math. Some, such as Mitsuku, include a limited amount of reasoning, such as being able to correctly answer, "Which is bigger, Chicago or a bread box?" The following conversation with Mitsuku provides an illustration of chatbot use: Mitsuku: Much the same as it would be for you.

Prior studies
Research has investigated how users converse with these agents by analyzing the transcripts generated [8,19]. In one study [9], 100 human conversations were compared to 100 exchanges with the Cleverbot chatbot. Results showed that people used more, but shorter, messages when communicating with chatbots. People also used a more restricted vocabulary and greater profanity with chatbots. Microsoft discovered obscene and rude text in conversations with its Tay chatbot, and in 2016, the company shut down access to it [20]. Users of these systems are often disinhibited because their identities are hidden, and they may enter unsavory text, perhaps for entertainment or to test the system to see how it will respond [1]. Another study [4] found that 10% of the text used in chatbot conversations contained abusive language, and 11% was related to hardcore sex. As a result, several chatbot websites have admonishments for users to keep the conversation 'clean'.
People might behave differently in other ways when communicating with chatbots. For example, even though some might perceive them as living entities [19], at least one study [28] found that chatbots are perceived more negatively. In addition, people communicating with chatbots might be less open, agreeable, extroverted, and conscientious than when communicating with others [18].

Purpose
As discussed earlier, prior analyses of conversations between humans and chatbots have found that humans type less text when dealing with chatbots than they do when communicating with other humans, and they tend to be more abusive in language. In this study, we attempt to replicate these results and to investigate relationships among the variables more in depth [22,25].

Description
A group of 29 students at a Business school in the northeast region of the United States exchanged text with Rose as if they were conversing with a friend. Another group of 29 students exchanged comments with Mitsuku. Approximately 34% of the two groups were women. Both groups interacted with the systems for about five minutes, an interval we assumed would be sufficient for an informal conversation. After the sessions, students were asked to evaluate the experience on a 1 (strongly disagree) to 7 (strongly agree) scale: "When communicating with the chatbot, the responses were very natural, similar to what a human would say." An objective reviewer analyzed the transcripts of all the conversations by counting the number of comments, the number of words, the number of spelling and grammatical errors (as determined by the Microsoft Word spell checker), and the Flesch Reading Ease [5] score of the text generated by the students (also determined by MS Word). This score measures average sentence length in words and average word length in syllables in an attempt to determine the overall difficulty of the text on a scale of 0=difficult to 100=easy. In addition, the reviewer counted the number of comments she deemed to be rude, insulting, vulgar, or sexually explicit. Finally, the reviewer determined the appropriateness of the chatbots' replies to users' comments on a scale of 1=nonsensical reply to 7=good reply [10,12,13].

Experimental Results
Results of the experiment are shown in Table 1. There was a significant negative correlation between the number of comments and reading ease (R= -0.31, p=0.01), (that is, more comments indicated easier text), but there was no significant correlation between the number of words and reading ease. Thus, many comments with few words were associated with less difficult text, as this is how the reading ease score is calculated. There was also a significant correlation between the number of comments and the number of errors (R=0.32, p=0.01) and the number of words and errors (R=0.27, p=0.04). That is, there were more opportunities for errors with more text. Also, the number of inappropriate comments was significantly correlated with text errors (R= 0.31, p=0.02). However, because there were so few inappropriate comments, this statistic might not be reliable. There was no significant difference between the two groups on any variable.
The students were relatively informal in their conversations with the chatbot, as the average Reading Ease measure of 81.2 is the equivalent of about 5 th grade level, and there was a significant difference with a reading level of grade 12 or college level (Reading Ease = 46) (t = 8.6, p < 0.01 ). This is not surprising because students often chat with each other with instant messaging using short sentences and simple words.
The student evaluations of the chatbots' performance was neither good nor bad as there was no significant difference between the mean rating of 3.9 and a median level of 4 on the 1-to-7-point scale.

Transcript Analysis
A further detailed analysis of the transcripts revealed the following characteristics: Very few rude comments. Unlike in previous studies, e.g. [4], there were very few comments judged inappropriate by the reviewer (1.2% of all comments as compared to 10% in the other study). Perhaps this was because the students were told that the transcripts would be reviewed later, and thus were more careful of what they wrote.
Many users were very informal and made many errors. As noted earlier, the readability of their text was on the equivalent of a 5 th -grade level rather than college level. In addition, the students made several spelling and grammatical errors, as determined by MS Word. For example, many comments had a lack of capitalization at the beginning of sentences, and there were many missing apostrophes in contractions. Some comments with errors include: thats just the name ive been given i don't really like it its kind of slow and boring Poor transitions and responses. The reviewer gave an overall, neutral score of 3.5 for appropriate responses to the users' comments, with Rose scoring slightly less than Mitsuku. That is, several chatbot comments had little if anything to do with what the user typed and could be considered a change of topic. However, humans also exhibit such behavior in normal conversations. Or, the system's reply was not something a human would normally say. For example, the following conversations with Mitsuku seemed odd: Human: Do you think global warming is real?
Mitsuku: Global warming is when the Earth heats up due to humans causing too much pollution.
Human: So is that a yes?
Mitsuku: It seems we have something in common.
Human: Okay good, you're smart. It appeared that Rose changed topic more frequently than Mitsuku, perhaps as a mechanism to avoid replying to a comment and making a mistake.
Knowledge of current events. The chatbots seemed to be abreast of several current topics, as indicated: Human: What do you think of Donald Trump?
Mitsuku: I don't understand all the fuss about him wanting to build a wall.
Rose: I'm amazed at the price of real estate in London, especially after the real estate crash of 2008. Why is it so nuts?
Rose: I think illegal immigrants should not be entitled to services in America.
Argumentative. While Rose tended to change the topic often, Mitsuku was often argumentative, perhaps too much so: Human: Not that expensive, but energy is pricey here. Especially gas.

Conclusion
In a review of two leading online chatbots, Rose and Mitsuku, we found that students interacting with the systems communicated informally with many spelling and grammatical errors but were not rude. We also found that Mitsuku was often combative and tended to identify itself as a chatbot whereas Rose pretended to be human and more frequently changed the topic. Both seemed to have some knowledge of current events in order to further the conversations.
A more thorough examination with different, longer transcripts is necessary. In addition, other online chatbots and more human contributors should be included in the evaluation. Because some of the conversation seemed argumentative, perhaps research can be focused on altering the mood of the chatbot (friendly/combative) and how it communicates (formal/informal). Other avenues for research include evaluating how 'intelligent' chatbots are by comparing their answers to those from a sample of humans.