Monday, 22 June 2015

YMMBT On Sale Postmortem

Firstly: apologies if you're reading this because you lost your place in the queue. Whilst we found and fixed the problem quickly, it wasn't quickly enough to stop a lot of sessions timing out. It was more of a scrum than a queue, and the difference between position 1 and 3000 might only have been a fraction of a second, but we appreciate it's frustrating and disheartening to see a nice low number turn into a horrible big one.

What happened to the queue?

There was a problem with the stored procedure that calculated where you were in the queue. It previously read

UPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId

SELECT 
SessionId, 
IsQueued, 
(SELECT COUNT(*) FROM UserSessions AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue, 
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions  AS USthis WHERE USthis.SessionId = @SessionId

and we were unable to generate sufficient load to make this go wrong in testing, but with over 10,000 real people hammering it it started to generate deadlocks. That's when so many people are trying to read and write to the same bit of the database at the same time that it gets stuck. The fix was simple: it now reads

UPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId

SELECT 
SessionId, 
IsQueued, 
(SELECT COUNT(*) FROM UserSessions  WITH (NOLOCK) AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue, 
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions  AS USthis WITH (NOLOCK) WHERE USthis.SessionId = @SessionId

which avoids locking the UserSessions table whilst counting up how many people are ahead of you in the queue.

Unfortunately, diagnosing the problem took just over 5 minutes. And the SessionExpiry, which is updated every time the queue is checked, is set to 5 minutes from now - if someone joins the queue and then decides not to wait and closes their browser, we don't want them hanging around blocking everyone else for ages.

So if the problem had been resolved within 5 minutes, everyone's queue position would have been unchanged - which is what we expected when we told people to refresh. As it was, sessions which hadn't managed to contact the server and hadn't updated their expiry time for 5 minutes expired, which means their place in the queue was lost (and refreshing or not would have made no difference). As soon as we noticed this happening we extended the SessionExpiry for all the remaining sessions whilst we worked on the problem, but by then we'd already lost a chunk of the first sessions.

Once the queue was fixed, the rest of the system bore the load reasonable well.We peaked at letting 500 simultaneous sessions onto the site - which was unnecessary. Even if the site could have supported everyone who was interested browsing tickets all at once, it would have been a terrible experience; every time you looked at a ticket someone would have snatched it out from under you, and actually getting something in the basket would be luck for a few, and frustration for everyone else. For future on sales with this type of product (lots of small distinct blocks) we think we'll limit the number of simultaneous users to 100 or so; we can handle more, but expect it would feel better if less crowded.

Sold Out?

One of the things that proved difficult on the night - and contentious afterwards - was deciding when to tell everyone the show had sold out. The problem is that some people get past the queue, get tickets in their basket, get as far as the credit card page, and then decide that perhaps they can't afford it after all, so they close their browser and walk away. We give people a bit longer before the session times out once people are on the credit card page, which means that there can be quite a long period during which not all tickets have been sold - so we aren't "sold out" - but no tickets are available, so there's very little point in people sticking around. It's hard to communicate this in 140 characters or less, though. "There are no more tickets available right now but some are stuck in baskets and might become available in an hour or two when they time out, but we can't say for sure" is an accurate but not a very punchy message.

What next?

The main problem - reading the queue position causing deadlocks - is fixed.

We'll leave the site at 100 simultaneous users, so that they can browse the performances and times without someone nabbing the tickets out from under them.

We're going to move the IsQueued status out of the main UserSessions table and into a separate table to join to at the earliest opportunity, and change how we calculate the queue position so that it's easier on the database server.

We are considering not displaying the queue position at all. Not only is it relatively expensive to calculate, it makes people feel worse when things go wrong - or even, when things go right. Just being told that you're an a queue and not knowing if an internal hiccup shuffles your position would be much less upsetting than seeing your number go from 600 to 9000. The very term "Queue" implies an orderly line, but when huge numbers of people arrive at the site at the same moment, it's not really about who got there first; getting a lower queue position is a matter of fractions of a second and pure luck.

Whether we display the queue position or not, we're going to work with YMBBT to rethink the communications, how we give people information on their chances of getting tickets, and perhaps define criteria to trigger messages ("Only N tickets left", "No more tickets are available - some are in baskets", "All tickets actually sold") so we don't have to decide on the fly.

We're also going to improve the layout of the queue page, moving the dynamic message up to the top of the page, move some of the currently hard-coded text "Hang in there" into the message so we can get rid of it when no longer appropriate, and add the facility to redirect everyone en-masse to an arbitrary URL once the tickets have sold out (or nearly sold out).

When an error occurred, the queue position was displayed as "999", which was just an error place holder, and not a real queue position. 999 is far too low for a placeholder, and it's been updated to 999999.

And finally, we will continue to suggest and encourage the use of ticket ballots for heavily over-subscribed shows, avoiding the problems and queues inherent in first-come-first-served on-sale dates.

8 comments:

  1. First if wall, well done for being so open. Much appreciated.
    Secondly, please keep the queue positioning numbers. Without them we have no way of knowing if the sysdterm has crashed, which would result in many more refreshes and reloiads, and fast more hanging sessions.

    ReplyDelete
  2. 'First of all' etc.
    Bloody typos

    ReplyDelete
  3. Would it not just be easier to do away with the tech, have a X minute window for people to register interest (1 or 2 tickets + name / email) and then ballot all entries after? That way it leaves no bad taste from UX (queue position), doesn't generate ridiculousness in pretty much DDOS'ing yourself and means it's fair on all? Not a sexy solution but a fair one.

    ReplyDelete
    Replies
    1. "And finally, we will continue to suggest and encourage the use of ticket ballots for heavily over-subscribed shows, avoiding the problems and queues inherent in first-come-first-served on-sale dates."

      Delete
    2. This comment has been removed by the author.

      Delete
  4. Ah interesting and yeah sorry missed that last part but interesting it was dismissed - tbh I think sometimes that adds to the mystique of a obvious sell out show and gathers more press but do wonder what the long term negative consequences are - cheers for the reply :)

    ReplyDelete
  5. A balloting system is great for single-date shows with lots of tickets, but how would it work for something like this where you've got many dates, each with relatively low ticket quantities? You can't create more winners than there are tickets (bad experience), but you have to assume that the winners aren't going to fit neatly into the dates & times available, so you still end up having some kind of free-for-all to get the most popular days followed by a subset of 'winners' who end up not being able to make any of the dates that remain.

    Not trying to poo-poo the pursuit of a fairer system, just genuinely curious.

    ReplyDelete
    Replies
    1. Sorry, didn't see this reply; notifications go to an email account I don't check much.

      Basically, the plan would be, everyone signs up to the list, and then if there are N tickets available, we select N/2 of the signups randomly, and send them an email saying "You have been selected in the ticket ballot; you will have 24 hours to buy tickets starting at midnight tomorrow. Log into the site then and you will be able to choose a performance and buy tickets".

      There won't be 100% uptake on this; some people will miss the window, which is sad for them, but after the 24 hours we can count up the remaining M tickets, and invite M/2 people to come for another round of purchase the day after. And repeat until they're all gone. There's plenty of time, after all.

      Yes, there will still be some competition for the "best" slots, but much less incentive for far more people than there are tickets for to all pile into the site at the same time.

      Delete