Microsoft PowerPoint - 6-HCI-Recommender
Filetype:

pdf
Filesize: 1423617
1
Recommender Systems and HCI
Francesco Ricci
eCommerce and Tourism Research Laboratory
ITC-irst
Trento – Italy
ricci@itc.it
http://ectrl.itc.it
2
Content
Usability study of 6 recommender systems
–
What are the factors that have impact on user
satisfaction? – go beyond accuracy
Explanation in collaborative filtering
Explanations can change the “performance” of the
algorithm
Usability study in a travel recommender system
–
Multiple decision styles
2
3
HCI and Recommender
The accuracy of a recommendation (? How it is
measured) depends on the recommendation algorithm
But the effectiveness of a RS is dependent on factors
that go beyond the quality of the algorithm
The ultimate goal of a RS is to introduce users to items
that might interest them and convince users to consider
those items
[Swearingen & Sinha, 2001]
4
Swearingen & Sinha
Have shown that an effective recommender system:
Must inspire trust in the system
Has system logic that is somewhat transparent
Points user towards new, not yet-experienced items
Provides details about recommended items (e.g.
pictures and community ratings)
Provides ways to refine recommendations (e.g. by
including or excluding particular genres).
3
5
Experimental Study
A total of 19 people participated in the experiment
Each participant tested either 3 book or 3 movie systems, as well as
evaluating recommendations made by 3 friends
For each of the three book/movie recommender systems (presented in a
random order), users completed the following tasks:
–
(a) Completed online registration process
–
(b) Rated items on each RS in order to get recommendations
–
(c) Reviewed list of recommendations
–
(d) If the initial set of recommendations did not provide anything that
was both new and interesting, users were asked to look at additional
items
–
(e) Completed satisfaction and usability questionnaire for each RS.
After the user had tested and evaluated all three systems, a post-test
interview.
6
Measures for the Evaluation
Good Recommendations: Percentage of recommended items
that the user liked. Good Recommendations were divided into
the following two subcategories
–
Useful Recommendations were “good” recommendations
that the user had not experienced before
–
Previously Liked Recommendations (Trust-Generating
Recommendations) were “good” recommendations that
the user had already experienced and enjoyed - such
items indexed users’ confidence in the RS
Overall satisfaction with recommendations and with RS
(survey)
Time spent registering and receiving recommendations from
the system.
4
7
Perceived Usefulness = Overall Satisfaction
Users perceived RS as being useful
Users did not like all RS equally
And this is not because of the recommendation algorithm!
8
Factors that predict usefulness
Correlations between usefulness and other aspects
Good and useful recommendations (accuracy) are important
and strongly correlated with usefulness, BUT …
5
9
Good and Useful Recommendations
These recommender systems are very similar in term of
good and useful recommendations, but (as seen before) quite
different in term of overall satisfaction
There are other factors.
10
Impact of the Time dimension
A moderate increase in the number of ratings required
does not have a strong negative impact
Users appeared to be willing to invest a little more time
and effort if that outcome seemed likely
Users express some impatience, but this seems not
related to number of ratings, but with the way the
information was displayed (e.g. many movies on a
screen or no detailed information).
6
11
Time
12
Trust-Generating Recommendations
Recommendations that the user has previously had a
positive experience correlate with perceived usability
These recommendations are not useful – do not offer new
information – but they index the degree of confidence a
user can feel
7
13
Unexpected Items
Recommender systems are better than friends in
recommending unexpected items
Recommender systems are useful because they “expand
the horizons”
14
Exploratory Search
[Marchionini, 2006]
8
15
Information about recommended items
Two versions of rating zone are compared: no
descriptions and descriptions
Description of individual items correlates
positively with both perceived usefulness and
ease of use
16
Interface issues
Interface matters when it gets in the way
Navigation and layout seemed to be the most important
factors
–
Correlate with the ease of use and perceived
usefulness
It is important to invest time in user-testing the
navigational structure of the RS
–
This can have a great impact on the user satisfaction
9
17
Predicting the degree of liking
Recommending an item and predicting the degree
of liking is not the same
Ratings can make users more critical of the
recommendations (“why such a rating?”)
And if the system recommends items with low or
medium “predicted liking” ratings?
–
The user may be confused about why this item is
recommended
Presenting degree of liking is high-risk feature –
the system would need to have a very high
degree of accuracy for users to benefit
18
Reflections
If predicting the degree of liking is not the same of
recommending an item, then why to bother about the
MAE?
Why the system should be right in predicting the
degree of liking for items that have an average (or low)
ratings?
The recommendation algorithm is only a “component”
of the recommender system
Instead of building sophisticated prediction algorithms
we should build sophisticated “user manipulation”
methods?
10
19
System Transparency
Users like to understand what was driving a system’s
recommendation
The reasoning of the RS should be at least somewhat
transparent
Users are confused if all recommendations are
unrelated to the items they rated
20
Conclusion of [Swearingen & Sinha, 2001]
The “goodness” of a recommendation and perceived
usefulness of a RS depends heavily upon the user’s
expectation
–
i.e. the expected range of information that the
recommender can provide
E.g. in this study they found users interested in
–
Reminder recommendations = recommendations for items
that the user already though about it
–
“more like this” recommendations (e.g. more movies in a
particular genre)
–
New items (e.g. those recently released in a particular
genre)
–
“broaden my horizon” – really new and unexpected items
11
21
User Tasks
Annotation in Context: GroupLens – suggest what news are worth reading
[Resnick et al. 1994]
Find Good Items: suggest some items as a ranked list [Shardanand and Maes,
1995]
Find all Good Items: all items satisfying some user needs and wants [Ricci et al,
2002]
Recommend Sequence: recommending a sequence that is pleasing as a whole
[Hayes and Cunningham, 2001] [Aguzzoli et al., 2002]
Recommend a Bundle: suggest a group of products that fits well together [Ricci
et al, 2002]
Just Browsing: user find pleasant to browse products’ recommendations
Find Credible Recommender: users try to change the input (e.g. user profile)
to find bias in the recommender algorithms
Improve Profile: add rates or other user info to “improve” their profile
Express Self: feel good to contribute to the system performance by adding their
comments or ratings
Help Others or Influence Others.
Extended from [Herlocker et al. 2004]
22
User Tasks from a HCI perspective
The previous list is biased by the technology (what we can
provide)
User tasks and goals should be analyzed case by case
For instance in a Travel Recommender System
–
A place similar
–
Other offers “like this”
–
A quieter place
–
The attraction that I should not miss in Girona
–
The hotel closest to Main Square but not on a high traffic
road …
Different task require different evaluation approaches.
12
23
Explanations in Recommender Systems
Transparency of the reasoning process improves user
satisfaction [Swearingen & Sinha, 2001]
Most of the recommendation technologies shown so far
are far too complex to be explained
Collaborative filtering is mainly used as a black box
If we want to use RS in high risk domains (recommend
a camera - a travel – an investment plan …) we must
add an explanation component.
24
Benefits of explanations
Can build trust between the user and the system
Can increase credibility of the system and confidence in the
recommendations
Can reduce the errors (an explanation makes clear why the
system is making an error)
–
The error may be due to lack of data (e.g. missing ratings
or missing user information or not enough products)
–
The error may be due to the process (wrong similarity
function, or ACF not considering the context in the
prediction)
[Herlocker et al., 2000]
13
25
Benefits of explanations (2)
Can increase user involvement – this can push the user
to further add her knowledge (ratings) to the system
Can increase the educational role of a recommender –
the information provided becomes source of new
knowledge
Can increase acceptance
26
Case study in Collaborative Filtering
Investigation about the roles of the explanation in
collaborative filtering – 3 questions
What models and techniques are effective in supporting
explanation in an ACF system?
Can explanation facilities increase the acceptance of
automated collaborative filtering systems?
Can explanation facilities increase the filtering
performance of ACF system users?
14
27
White and Black models of explanation
White box model – the ACF recommendation model is
“simple” – there are three steps
–
User enter ratings
–
ACF locates people with similar interests
–
Neighbors’ ratings are combined to form recommendations
The explanations can be linked to the process/algorithm used
to generate the recommendations – White Model
Black box model – generate explanations independent from
the algorithm that is really used
28
White model
1) user enter ratings
–
Explain what is the current content of the user
profile (ratings)
–
Explain the ratings that have been used the most
–
Explain that the user has (not) rated enough items
to make the recommendations enough reliable
–
Indicate products that should be rated to improve
the quality of recommendation
15
29
White model
2) ACF locates people with similar interests
–
Explaining the behavior of the similarity metric
–
Illustrate the concept of “closeness” used by the
similarity metric
–
Illustrate how many neighbors are considered
–
Illustrate the profile (ratings) of the neighbor users
30
White model
3) Neighbors’ ratings are combined to form
recommendations
–
Illustrate the ratings of the neighbors for the target
item
–
Illustrate the distribution of these ratings
–
Show the combination of a neighbor rating and
neighbor closeness
–
Illustrate the method used to combine the ratings of
the neighbors in a single prediction
16
31
Black box model
Black box model – generate explanations independent
from the algorithm that is really used
–
Explain that the recommender was correct x% of the
time in the past
–
Bring information that has not been used in the
prediction
E.g. show the product reviews collected from
another web site
Or explain how many examples of that item have
been sold in the last month
32
Experimental Study 1
Each user is provided with 21 individual movie
recommendations each with a different explanation
component
The 21 different explanation interfaces all describe the
same movie recommendation (!)
The user was then asked to rate on a 1-7 scale “how
likely they would be to go and see the movie”
The 21 different interfaces were presented in a random
order for each user (to account for learning effects)
17
33
Results
Past performance is
the accuracy of movie
lens in the past
Explanation 5 = “this
movie is similar to 4
other movies that you
rated 4 stars of higher
Explanation 6 – the
importance of
providing additional
content info
Explanation 17 –
worked bad !
34
Histogram with grouping: 1
st
choice
Neighbor ratings
histogram (explanation
3) is similar to this –
one bar for each rating
Histogram with
grouping performs
better than full
histogram because it
reduces the
dimensionality
18
35
Table of neighbor ratings: 4
th
choice
36
Influence of explanation on the user
Prior to the main study, in a small pilot study participants
were interviewed after they took the survey
–
many users perceived each “recommendation” as having
been generated using a different model – which was then
explained
–
Each explanation was changing the user’s internal
conceptual model of how the recommender
computed predictions
In the primary study they attempted to control for this effect
by clearly stating to study participants up front that he model
was going to be the same in each case.
19
37
Conclusion 1
What models and techniques are effective in supporting
explanation in an ACF system?
–
There are differences in explanation techniques effects
–
Rating histograms seem to be the most compelling ways
–
Other good methods
Indication of past performance
Comparison with similar (highly rated) items
Domain specific content features
38
The other two hypothesis
Hypothesis 1: adding explanation interfaces to an ACF
system will improve the acceptance of that system among
users
Hypothesis 2: adding explanation interfaces to an ACF
system will improve the performance of filtering decisions
made by users of the ACF system.
–
This means that one can measure differences in the
prediction accuracy of the system when using different
explanation capabilities
–
In principle this should not be true
–
Unless the explanation capability can convince you that
the system prediction is correct – and change your true
evaluation
20
39
New experiment
7 alternative systems are compared
–
2 are: the old system, the old system with aesthetic
changes
–
5 different explanation functionalities – mixtures of
the following two
confidence
Distribution of ratings
40
Procedure
A survey at the beginning and a survey on exit
The subjects were asked to return to MovieLens whenever they saw a
new movie and fill out a mini-survey
1.
Which movie did you see?
2.
Did you go because you thought you would enjoy the movie or
did you go for other reasons (such as other viewers)?
3.
Did you consult MovieLens before going?
4.
If you consulted MovieLens, what did MovieLens predict?
5.
How much did MovieLens influence your decision?
6.
Was the movie worth seeing?
7.
What would you now rate the movie?
–
4 and 7 where use to compute the accuracy of the prediction
21
41
Results
210 users (210 standard surveys)
743 mini-surveys
–
In 315 cases the users consulted MovieLens before
seeing the movie
–
In 257 cases MovieLens had some effect on user
decision
–
In 213 of the cases above (83%) – the MovieLens
recommendation was not the sole reason for
choosing a movie
42
Effect on performance
NO statistically significant difference between any two
experimental groups
Hypothesis 2 is rejected
22
43
Effect on user acceptance
In exit surveys given at the end of the study, users in
non-control groups were asked if they would like to see
the explanation interface they had experienced added
to the main MovieLens interface.
97 experimental subjects filled out the exit survey
86% of these users said that they would like to see
their explanation interface added to the system
Hypothesis 1: adding explanation interfaces to an ACF
system will improve the acceptance of that system
among users
–
Is accepted
44
Dietorecs development and evaluation
Steps in the development process
–
Development of a user decision model
–
Design of the recommendation technologies
–
First prototype design
Iterative design and evaluation (mock up)
key technologies implementation
Prototype management and evaluation
Technology improvement
Final recommender system
[Zins et al., 2004]
[Bauernfeind et al., 2003]
23
45
Observational study – real travel planning
sessions
N = 200
10% dialogues in travel agents (Berlin
)
40% trip planning from catalogues (Berlin)
50% trip planning on the Internet (Vienna)
–
25% TisCover
–
25% AllesReisen.com
46
24
47
48
Exit survey
Immediately after the trip planning task
Attended computer-interactive interview
Perceptions and reflections about the planning process
Characteristics of the prepared trip (main purpose, travel
budget, experience, organisation)
General travel decision making
25
49
Coding
Study material: written transcripts, videos, screen clips,
catalogues
15 coders
2 independent observations
matched afterwards
50
Content I
Which trip elements are initially verbalized by the
customer?
Which elements determine the trip at the end of the
planning process?
What drives information delivery: The user request or
information shown by the medium?
Timing of trip elements: earlier/later?
The way of processing: fixed or flexible?
26
51
Content II
Role of additional travel characteristics: Travel
motivations, travel experience
Technical process characteristics:
–
e.g. length of interview, interrupts, interface problems,
number of alternatives
Additional process characteristics:
–
e.g. decision mode, decision role, involvement
52
Average
Frequencies in % of respondents
What?
Who?
When?
How?
Trip elements
Start
End
User
System
Earlier
Later
Fixed
Variable
Activities/facilities
47
63
59
54
53
18
46
20
Type of transportation
42
77
51
56
54
22
55
10
Attractions
45
24
51
45
44
13
38
14
Length of stay
29
77
60
64
62
22
39
27
Destination: country
78
94
87
72
93
1
58
21
Destination: community
19
81
49
79
71
16
13
56
Destination: region
53
88
77
82
90
4
26
46
Accessibility of the destination
13
48
39
43
30
28
35
13
Geographical area
71
84
73
62
81
0
70
6
Natural factors
52
78
59
55
67
9
64
9
Price
29
83
79
89
69
29
24
63
Travel party
60
88
63
59
66
15
70
5
Travel type in general
40
81
72
59
81
3
60
18
Travel type: All lnclusive
14
13
8
10
9
2
6
4
Travel type: Independent Traveller
22
43
39
34
43
4
33
10
Travel type: Last Minute
9
10
16
9
14
3
5
10
Travel type: Low Budget
5
15
15
9
14
4
14
3
Travel type: Tour operator product
15
33
34
38
42
4
22
17
Travel type: Special Offer
2
7
14
9
9
8
5
11
Transfer to accommodation
11
30
19
23
9
24
13
11
Accommodation: equipment
15
74
47
77
47
40
22
52
Accommodation: pictures
10
74
54
87
53
36
36
34
Accommodation: category
25
73
43
81
61
24
18
46
Accommodation: place
26
81
55
75
45
34
47
21
Accommodation: catering
28
80
53
85
55
38
35
41
Type of accommodation
51
89
75
86
78
14
41
37
Time of travel
45
77
70
70
76
12
40
42
Additional geographic information
n.a.
64
35
30
31
18
22
14
Additional information
n.a.
42
52
18
25
34
40
11
Get in contact
n.a.
30
42
n.a.
9
22
35
8
Number of elements
8.5
17.9
14.6
15.6
14.8
5.0
10.3
6.8
63
How to define decision styles?
27
53
Six Decision Styles found
DS1: Highly pre-defined users (15%)
DS2: Accommodation-oriented users (18%)
DS3: Recommendation-oriented users (10%)
DS4: Geography-oriented users (18%)
DS5: Price-oriented users (18%)
DS6: The individual traveler (32%)
54
Decision Styles I/II
Name
Decision style characteristics
Recommendation/
Reduction strategy
Highly pre-
defined
Many trip attributes pre-defined
Natural resources very important
Let user specify many
attributes, maybe phased:
first destination, then
accommodation and price,
then further details
Accommodation
oriented
Highest importance on
accommodation; high quality, not
price sensitive
Only broad geographical
area, then ask for
characteristics of
accommodation; list
attributes of recommended
destinations for comparison
Recommendation
oriented
Few trip attributes pre-defined;
affinity for certain travel types
Come up quickly with
pictures, let user ‘feel’
recommendations
28
55
Decision Styles II/II
Name
Decision style characteristics
Recommendation/
Reduction strategy
Geography
oriented
Clear conception of geographical
area and region
Let user search by map
(giving detailed information
about the areas clicked);
concrete accommodation
offers not before village is
determined
Price-oriented
Price as most important feature,
searching for benefits within a
certain price range
Ask for price range and
natural resources sought;
begin list from cheapest
Activity-driven
traveller
Destination as cue for benefits and
activities sought
Ask for benefits and activities
sought; determine travel
typology; describe offers
detailed
56
Six Decision styles ...
... are not:
–
exhaustive
–
homogeneous in their preferred travel ‘product’
–
easily predictable
however, they:
–
have similar search strategies
–
have specific needs for a specific travel arrangement
–
are prototypes which may be used particularly in the
initially phase of a search/reduction process
29
57
Common Sequence of a TR Session
Filtering
Specification
Selection/
Sorting
User
System
Specification of details
according to styles
Show # of avail. alternatives
Show alternatives (pictures)
Proceed to ‘specification’
Recommend action(s) for
relaxing constraints
Alternatives may be shown
User
System
Specification of further details
Ask for more information
Show # of avail. alternatives
Show alternatives (pictures)
Proceed to ‘selection/sorting’
Recommend action(s) for
relaxing constraints
Alternatives must be shown
User
System
Ask for more information
Browse through ordered list
Compare alternatives
Get recommendations
from others
Show # of avail. alternatives
Show alternatives (pictures)
Present recommendations
Recommend action(s) for
relaxing constraints
Learning from others, products
?
58
The Ladder of Intelligence in
Recommendation Systems
30
59
Other things learned ...
Most of the users do not like a long procedure of
answering questions but want to see things quickly
They become very impatient when they specify their
needs and the system does not contain one single offer
They generally show scepticism because they suppose
that there must be more
–
Important features: Trust, Competence, Usability
–
System implications: Be fast, easy and transparent
60
Recommendations
Facilitate tourist life
–
Take account of different ‘decision styles’
–
Enhance adaptivity, add capability of learning and real-
time personalization
–
Reduce the user’s effort & arouse excitement
–
Avoid eliciting redundant user input
–
Mediate between language levels (consumption goal &
experience oriented versus package production
oriented)
31
61
Challenges raised by the findings
Different decision styles require complex system design
that can cope with all these variations
–
Design issues are very important
Decision styles are fuzzy concepts, i.e. users never
follow only one decision style
–
Technical approach to switching behaviour between
decision styles is unclear
–
Number and characteristics of decision styles may
change over time
62
GUI Design
32
63
64
1
st
GUI
Mock-up
2
nd
GUI
Mock-up
33
65
V0.5 GUI
V1.0 GUI
66
Cognitive Walkthrough and Heuristic Inspection
Applied to a GUI mock-up without functionalities
Qualitative assessment of some user interface design choices
The goal was to detect substantial weaknesses of the user
interface design
exploratory learning while solving the user’s problem
identifying violations of heuristics
Applying guidelines from Nielson (2000): know the user,
reduce cognitive work, avoid design errors, keep consistency
34
67
Examples: problems found in the walkthrough
“there is no reason for the link to “recommendation market
place” appears in the main area (and not on the left as the
other functions”
“A usability problem can be envisaged for the registration
goal. The achievement of this goal seems a pre-condition for
accessing SA1 and SA2, these choices should be deactivated
unless the user is registered and logged in. “
“Another problem may arise from the fact that most of the
choices are duplicated (for example, “kind of accommodation”
in “advanced travel wish” and “accommodation”) and the
interface does not seem to help the user is keeping the
consistency (P3). “
…
68
Heuristic evaluation
Conducted on the Prototype V0.5 (with partially
implemented functions)
5 Experts (2 Trento, 2 Linz, 1 Vienna)
“evaluated the system functioning, the interface, and
the user-system interaction, according to their
preferred heuristic procedure” by
answering the PUTQ questionnaire
providing a list of comments (including any problem or
error message, improvement suggestions and any other
remarks and observations relevant to the usability of
the system
35
69
PUTQ
composed of 100 questions on system interface
structured by eight factors that are relevant to human-
computer interaction
–
compatibility, consistency, flexibility, learnability,
minimal action, minimal memory load, perceptual
limitation, and user guidance
It is possible to compute an index based on the ratings
and put into relation to the possible perfect score
http://www.acm.org/~perlman/question.cgi?form=PUTQ
70
PUTQ - Summary of Results
1,2
65,0 (18,9)
4,9 (1,3)
Total
1,5
30,5 (27,9)
2)
4,5 (1,1)
User guidance
1,1
75,6 (15,9)
5,9 (1,2)
Perceptual limitat.
1,5
67,0 (15,5)
4,9 (1,1)
Min. memory load
1,5
63,6 (20,4)
4,9 (1,8)
Minimal action
0,7
72,9 (16,9)
5,4 (1,2)
Learnability
1,1
41,8 (8,5)
2)
3,8 (1,1)
Flexibility
0,8
70,3 (24,9)
5,1 (1,7)
Consistency
1,0
73,5 (18,4)
5,1 (1,4)
Compatibility
Not Applicable
1)
Average
PUTQ Index
Average (Std. Dev.)
Effectiveness
Average (Std. Dev.)
1)
Excluded not available
2)
Expert 5 was excluded from the analysis because of too many "not applicable" values
1= bad, 7= good
36
71
PUTQ index
A direct way to asses the usability of a system
100 is the maximum – “item” is a question in the survey
Computed for each user (and each factor) then averaged
100
Item
7
)
Penalty
Score
(
Index
PUTQ
i
×
×
?
×
?
×
?
=
i
i
i
i
w
w
where:
i
= the ith item
Score
i
= the rating score of item i
Penalty
i
= 1, if the item i is applicable but not available (N/A)
0, if the item i is not available
Item
i
= 1, if the item i is applicable
= 0, if the item i is not applicable
w
i
= weighting of the importance i.
72
Results PUTQ
Compatibility: Expert evaluation indicated a good
compatibility (PUTQ Index = 73.5). Especially coding
and wording were compatible with familiar conventions
Consistency: the experts identified inconsistencies in
displayed symbols and data, feedback and the required
user actions.
–
some of the displayed symbols, data, feedback and
required user actions did not fit in user expectations
and were not clearly understandable
Flexibility: the PUTQ Index of 41.2 is low
37
73
Results PUTQ (2)
Learnability: the prototype was judged as being easy to learn
(PUTQ Index = 72.9)
Minimal action: (the number of actions required for the user
to perform a task is minimal) the experts suggested that
improvements are still necessary (PUTQ Index = 63.6)
Minimal (long-term) memory load: (assists the user in
learning an interface fast) overall, the minimal memory load
requirements were evaluated quite favorably (PUTQ Index =
67.0)
Perceptual limitations: (consider the limitations of human
perceptual organization capacities) best criterion (PUTQ Index
= 75.6)
User guidance: very low (PUTQ Index of 30.5 is the lowest of
all)
74
Detailed Expert Evaluation
Collected problems and remarks concerning
•
General problems / remarks
•
Start page
•
Navigation (user registration, left menu)
•
Layout and design
•
Travel planning process
•
Recommendation process
•
Results
•
Searching for inspiration
Many problems solved before experimental evaluation
Changes in interface and design (start page and menu bars)
Extension of explanations
Consistency checks
38
75
Experimental Evaluation by potential Users
Rigorous test of system value under experimental
conditions
Within- and between subject design: 2 consecutive,
weakly structured travel planning tasks
Testing against a highly developed operative system in
the market (Tiscover)
Testing the performance across variants of different
potential of recommender functions
76
Two interaction styles
Traditional query form
Single item
recommendation
Recommendation by proposing
Complete bundle recommendation
39
77
System Variants
DTR-A: Interactive Query Management only (i.e. empty
case base and no recommendation support via smart
sorting or through other means);
DTR-B: Single Item Recommendation with Interactive
Query Management and Ranking based on a
representative case base;
DTR-C: A variant with all the recommendation functions
enabled (SingleItemRecommendation,
BundleRecommendation, SeekingForInspiration).
TISCOVER: a fully operational system
78
Hypotheses
H1 - The recommendation-enhanced system is able to deliver useful
recommendations
–
The position of the selected item for DTR-B should be nearer than
DTR-A to the top of the visualized result list
H2 - The recommendation-enhanced system is able to foster the
construction of good travel plans
–
Analyze the differences between the three systems (the Dietorecs
variants and TISCover) on the users’ ratings of the selected items
H3 - The recommendation-enhanced system allows a more efficient
search
–
User should perform fewer queries, examine fewer pages and
should reduce the search and decision time
H4 - The recommendation-enhanced system heighten the user
satisfaction
–
We should find significant differences between DTR-B and DTR-A
on the questionnaire.
40
79
Experimental procedure
Demographic
Questionnaire:
5 min
System 1:
Familiarization
5 min
Training
5 min
Story + Test phase
30 min
Satisfaction Questionnaire
5 min
System 2:
Familiarization
5 min
Training
5 min
Story + Test phase
30 min
Satisfaction Questionnaire
5 min
80
Training Task
Imagine you want to search for an accommodation for
two persons in the Zillertal in the price range of 30 to
70 Euro (per person and day)
Please take five minutes to perform this task using the
system
41
81
First test task
You won a trip to Tyrol, Austria. All transportation necessities will be
arranged according to your travel plans and will not debit your given
travel budget. This travel budget amounts to euro 150 per person per
day. You may allocate this budget to accommodation, events, sports,
cultural activities or anything else you may want to do during this
vacation trip. The budget you did not allocate in advance you will
receive as pocket money for other trip expenses. You may exceed the
total budget if you want to spend additional money on this trip.
Now, it is your task to plan your individual trip on the travel site to
which you were assigned by the tutor. The trip is only restricted to
last at least 7 days and is limited to a maximum of 4 persons
(including yourself) in your travel party. The trip can be taken any
time between May and October 2003. Please, avoid locations that you
have already selected in previous tasks.
Before you start looking for information on the system please
describe in a few sentences the specifics (travel wishes) of the trip
you are going to plan (when, how, travel group, destination,
accommodation, activities, etc.) with the help of this travel
recommender system taking the above-mentioned criteria into
account.
82
Second Planning Task
After having completed the first travel planning task, we would like to
invite you to repeat a quite similar trip preparation task on a second
travel web site. The following restrictions apply to this task:
–
The travel destination is Tyrol, Austria
–
You are already back home from the previously planned trip to
Tyrol
–
Budget handling and travel party conditions are the same as with
the first task
–
Please, avoid the locations that you have already selected in
previous tasks.
Before you start looking for information on the system please
describe in a few sentences the specifics (travel wishes) of the trip
you are going to plan (when, how, travel group, destination,
accommodation, activities, etc.) with the help of this travel
recommender system taking the above-mentioned criteria into
account.
42
83
Design and sample size
Sequence
Group 1
Group 2
Group 3
Group 4
Group 5
Group 6
First System TISCover
DTR-A
TISCover
DTR-B
TISCover
DTR-C
Second
System
DTR-A
TISCover
DTR-B
TISCover
DTR-C
TISCover
N = 47
10
11
10
10
2
4
84
Socio-demographics
How familiar are you with Tyrol?
Not familiar
30%
Quite familiar 51%
Very familiar
15%
No answer
4%
0%
10%
20%
30%
40%
50%
60%
never
once a year several times
a year
once a month once a w eek several times
a w eek
How often do you inform yourself or purchase travels via
the Internet?
P
e
r
c
e
n
t
a
g
e
o
f
t
e
s
t
p
e
r
s
o
n
s
Travel Information
Travel Purchase
GENDER
AGE <25
AGE >25
TOTAL
F
26 (56%)
3 (7%)
29 (63%)
M
9 (20%)
8 (17 %)
17 (37%)
TOTAL
35 (76%)
11 (24%)
46 (100%)
43
85
Travel wish specification
before using the system (DTR)
Essay on Travel Plan
Finished Plan
Average
yes: 30%
no: 70%
p-value
Destination: yes
87%
79%
91%
n.s.
detailed by attributes
11%
number of attributes
1.2
2
1
n.s.
Accommodation: yes
94%
79%
100%
0.05
detailed by attributes
89%
number of attributes
1.8
2.6
1.5
0.10
Activities: one
49%
50%
49%
n.s.
two
40%
43%
39%
n.s.
detailed by attributes
21%
number of attributes
2.4
1.0
2.8
0.10
Needs specified before the trial
Those that can finish a plan seem to have better specified their
needs (destination and accommodation) before searching
86
Finished Travel Plans
Travel Plan Elements
Not found
as
intended
by element specified
Average
yes
no
p-value
Destination:
DieToRecs
78%
79%
67%
n.s.
TISCover
88%
93%
50%
n.s.
n.s.
Accommodation: DieToRecs
31%
31%
0%
---
TISCover
56%
56%
0%
---
n.s.
Activities specified:
DieToRecs
49%
50%
48%
n.s.
TISCover
50%
50%
0%
n.s.
n.s.
Finished the travel planning process: 64% TISCover vs. 30%
DieToRecs
A large percentage was not able to find the destination as intended
(especially if the had specified the elements)
The data base (the true content) is very important with or without
recommendations!
44
87
H1 - Average Position for Items in the Result List
by DieToRecs Variants
DTR-A
DTR-B
t-test
Average
Std.Dev.
Average
Std.Dev.
Items in general
4.3
4.6
2.9
2.8
n.s.
Accommodation items
5.0
0.4
2.2
1.2
n.s.
Destination items
3.9
0.1
2.5
1.3
n.s.
Interest items
4.0
4.8
3.5
3.0
n.s.
?
Cautious confirmation of H1: Item ratings are substantially better
for DTR-B
88
H2 - Item ratings by DieToRecs variants
Travel Plan Element
System Variants
Average DTR-A DTR-B DTR-C p-value
Finished plans
30%
10%
30%
100%
0.001
Ratings
Destination
4.0
2.8
4.5
5.3
0.10
significant difference
0.10
Accommodation
4.1
4.1
3.6
5.9
0.15
significant difference
0.01
significant difference
0.05
Activities
4.2
3.2
4.9
7.0
0.05
significant difference
0.1
significant difference
0.01
significant difference
0.001
Note: “1”: very dissatisfied, “7”: very satisfied
=> Ratings on the selected products are better the
more recommendation functions the variants have
45
89
PSSUQ
1) I liked using the interface of the system.
X
2) The organization of information on the systems screen was clear.
X
3) The interface of this system was pleasant.
X
4) This system has all the functions and capabilities that I expect it to
X
5) The information retrieved by the system was effective in helping
complete the tasks.
6) The products listed by the system as a reply to my request were
X
7) I found the “recommend travel ” function useful.
Dietorecs GR Only
8) I found the “seeking for inspiration ” function useful.
Dietorecs GR Only
9) It was simple to use this system.
X
10) It was easy to find the information I needed
X
11) The information (such as online-help, on-screen messages, and
X
12) Overall, this system was easy to use.
X
13) It was easy to learn to use the system.
X
14) There is too much information to read before I can use the system
X
15) The information provided for the system was easy to understand.
X
16) I felt comfortable using this system
X
17) I enjoyed constructing my travel plans through this system.
X
18) Overall, I am satisfied with this system.
X
19) I was able to complete the tasks quickly using this system.
X
20) I could not complete the tasks in the preset time frame.
X
21) I believe I could become productive quickly using this system.
X
22) The system was able to convince me about the goodness of the
X
23) From my current experience with the system, I think I would use it
X
24) Whenever I made a mistake using the system, I could recover
X
25) The system gave error messages that clearly told me how to fix
X
Questions
Additional
Questions
Design / Layout
Functionality
Satisfaction
Outcome / Future Use
Errors / System Reliability
X
Ease of Use
Learnability
90
Usability and Satisfaction Evaluation
Ease-of-use/
Learnability
Effectiveness/
Outcome
Reliability
User/System
Satisfaction
DTR: 0.30
TIS: 0.37
DTR: 0.73
TIS: 0.61
DTR: n.s.
TIS: n.s.
46
91
H4 - Average Usability and Satisfaction Scores
TISCover
Ø
DTR
Ø
DTR-A
DTR-B
DTR-C
User Satisfaction
3.2
4.6
5.2
4.5
3.3
Ease-of-use
2.8
3.6
3.9
3.5
3.1
Effectiveness/Outcome
3.4
4.6
4.9
4.6
3.4
Reliability
3.5
3.7
4.0
3.4
3.7
Note: “1”: strongly agree, “7”: strongly disagree
Smaller numbers are better
=> H4 confirmed: the more recommendation-
enhanced the better the user satisfaction
DTR is the average of all DTR-? users
92
Conclusion
Differences in subjective evaluations between a system
without ranking support (DTR-A) and with ranking (DTR-B)
are substantial
Comparison between the DTR-C variant (recommending full
travel plans) and the baseline system TISCover demonstrated
that almost no performance difference arose
Complex products like tourism destinations challenge the
evaluation procedures
Performance evaluations should be run within an environment
as realistic as possible
No adequate usability and satisfaction instruments available
User satisfaction is expected to be higher after having
improved the GUI and navigation facilities
47
93
Conclusion II
Evaluating recommender systems entails a higher level
of sophistication
–
For experts
–
In user modelling
–
For experimental tasks
–
For evaluation instruments
–
For logging data procedures
94
Recommendation Evaluation
eval
Predicted rating
accept
r
e
j
e
c
t
Pre-consumption
user rating
recommendation
p
o
st
-
c
o
n
su
m
p
t
i
o
n
u
se
r
r
a
t
i
n
g
48
95
Recommendation Evaluation
There are two goals of the recommender system
1) – to have a large acceptance rate:
–
the user must accept the recommendation and buy
the product
–
He must evaluate the suggested item as useful
–
He must trust the recommender
2) the post-consumption rating must be high
–
the user must be really satisfied of the product
96
Impact of the recommender
The recommender system (and the predicted rating) may
have an impact on
–
the accept/reject decision
–
The pre-consumption rating
The recommender system has NO impact on the post-
consumption rating
The system MUST predict correctly the post-consumption
rating
But at the same time must convince the user to accept a
recommendation, i.e., must raise the pre-consumption rating
These two goals may be conflicting (e.g. it is easy to convince
someone to buy a blockbuster movie, but it is not easy to
guess that the user will really like it).
49
97
Conclusions
Recommender systems are more than a recommendation
algorithm
The success of a recommender system is due to HCI factors
Usability is a major issue
Explanation of the recommendations plays an important role
in user satisfaction
Recommender systems should support multiple user task
Recommender systems should support tasks with multiple
interaction styles (decision styles)
98
Questions
How to define an evaluation metric that takes into account
the trust that the system may generate?
Think about this statement: “Recommending an item and
predicting the degree of liking is not the same”. How
this impact on a recommendation algorithm?
Is it feasible a “white model” approach for an hybrid
recommender system?
In the design of a recommender system is it better to focus
on acceptance of the recommendation or in post consumption
rating?