Clustering on this reinforcement learning approach? -

July 15, 2015

i trying create agent selects action depending on state gives maximum reward.

to keep things simple keep 2 actions , 24 different states.

the states mimic hours in day , 2 actions web pages displayed user.

i still trying figure out how reward given , how policy depending on reward. plausible following:

between 0 , 1 determine 100% of probability. action should taken 1 chance of reward.

very simple example same state x:

if user shown page 1(action) , stays on (the action) reward due page one.

x = amount of rewards given state page 1 = 1 y = amount of rewards given state page 2 = 0

page 1 + page 2 = 1.0 chance x = ((x+y)/x) = 1/1 = 1.0 chance y = ((x+y)/y) = 1/0 = 0.0 1.0 chance page 1 correct action state 0.0 chance page 2 correct action state

the user shown page 1(action) due risk of reward being higher if displaying page 1 @ state. if user navigates instead page 2, page 2 reward.

x = amount of rewards given state page 1 = 1 y = amount of rewards given state page 2 = 1

page 1 + page 2 = 1.0 chance x = ((x+y)/x) = 2/1 = 0.5 chance y = ((x+y)/y) = 2/1 = 0.5 0.5 chance page 1 correct action state 0.5 chance page 2 correct action state

if user shown page 1(action) , stays on page 1, page 1 reward.

x = amount of rewards given state page 1 = 2 y = amount of rewards given state page 2 = 1

page 1 + page 2 = 1.0 chance x = 1.0 / ((x+y)*x) = 1.0 / 3*2 = 2/3 chance y = 1.0 / ((x+y)*x) = 1.0 / 3*1 = 1/3 2/3 chance page 1 correct action state 1/3 chance page 2 correct action state

as see updates , learns.

clustering

this work if days same, , know aren't. user might use page 1 week 1 , next week page 2 , week after page 1 , one. finding pattern needed somehow.

what trying achieve

i have following input data (state):

{     location: 'möllevångstorget, 21424, malmö',     weekday: 'monday',     time: '07:31' }

alternatively:

{     lat: 55.591538,     lon: 13.007153,     timestamp: '2015-03-03 07:31' }

or:

{     lat: 55.591538,     lon: 13.007153,     timestamp: 1427864271 // unix epoch time }

as can see, can manipulate inputs. it's important though include location , when occurred.

as mentioned before, finding patterns i'm worried about. wish predict when user going use application (displayed page), state created when user uses application.

another problem can see let's user uses application @ 07:30 1 week, uses 07:35 next , third week uses 07:32 around same location, algorithm should able determine user(environment) pick specific page(action).

basically predicting action user going choose.

i don’t think should use clustering. need implement function approximation. if have geo’s, reverse encode country. use country & city state inputs, e.g.: features might end being: is_america is_africa is_middle_east is_new_york is_morning is_afternoon etc

if country , city list grows big considering doing via relational reinforcement learning

Search This Blog

Plus Code