The plan is to further improve our queue (which stood up to the load once the stored procedure was fixed) rather than cut over to a 3rd party stand-alone solution, but here's a useful and much more detailed breakdown of the difficulty of announcing a show as "Sold Out":
https://www.linkedin.com/pulse/ticketing-onsales-can-you-sell-out-2-minutes-learn-why-sodemann
from the CEO of https://queue-it.com/
Tuesday, 23 June 2015
Monday, 22 June 2015
YMMBT On Sale Postmortem
Firstly: apologies if you're reading this because you lost your place in the queue. Whilst we found and fixed the problem quickly, it wasn't quickly enough to stop a lot of sessions timing out. It was more of a scrum than a queue, and the difference between position 1 and 3000 might only have been a fraction of a second, but we appreciate it's frustrating and disheartening to see a nice low number turn into a horrible big one.
UPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId
SELECT
SessionId,
IsQueued,
(SELECT COUNT(*) FROM UserSessions AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue,
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions AS USthis WHERE USthis.SessionId = @SessionId
and we were unable to generate sufficient load to make this go wrong in testing, but with over 10,000 real people hammering it it started to generate deadlocks. That's when so many people are trying to read and write to the same bit of the database at the same time that it gets stuck. The fix was simple: it now reads
UPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId
SELECT
SessionId,
IsQueued,
(SELECT COUNT(*) FROM UserSessions WITH (NOLOCK) AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue,
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions AS USthis WITH (NOLOCK) WHERE USthis.SessionId = @SessionId
which avoids locking the UserSessions table whilst counting up how many people are ahead of you in the queue.
Unfortunately, diagnosing the problem took just over 5 minutes. And the SessionExpiry, which is updated every time the queue is checked, is set to 5 minutes from now - if someone joins the queue and then decides not to wait and closes their browser, we don't want them hanging around blocking everyone else for ages.
So if the problem had been resolved within 5 minutes, everyone's queue position would have been unchanged - which is what we expected when we told people to refresh. As it was, sessions which hadn't managed to contact the server and hadn't updated their expiry time for 5 minutes expired, which means their place in the queue was lost (and refreshing or not would have made no difference). As soon as we noticed this happening we extended the SessionExpiry for all the remaining sessions whilst we worked on the problem, but by then we'd already lost a chunk of the first sessions.
Once the queue was fixed, the rest of the system bore the load reasonable well.We peaked at letting 500 simultaneous sessions onto the site - which was unnecessary. Even if the site could have supported everyone who was interested browsing tickets all at once, it would have been a terrible experience; every time you looked at a ticket someone would have snatched it out from under you, and actually getting something in the basket would be luck for a few, and frustration for everyone else. For future on sales with this type of product (lots of small distinct blocks) we think we'll limit the number of simultaneous users to 100 or so; we can handle more, but expect it would feel better if less crowded.
We'll leave the site at 100 simultaneous users, so that they can browse the performances and times without someone nabbing the tickets out from under them.
We're going to move the IsQueued status out of the main UserSessions table and into a separate table to join to at the earliest opportunity, and change how we calculate the queue position so that it's easier on the database server.
We are considering not displaying the queue position at all. Not only is it relatively expensive to calculate, it makes people feel worse when things go wrong - or even, when things go right. Just being told that you're an a queue and not knowing if an internal hiccup shuffles your position would be much less upsetting than seeing your number go from 600 to 9000. The very term "Queue" implies an orderly line, but when huge numbers of people arrive at the site at the same moment, it's not really about who got there first; getting a lower queue position is a matter of fractions of a second and pure luck.
Whether we display the queue position or not, we're going to work with YMBBT to rethink the communications, how we give people information on their chances of getting tickets, and perhaps define criteria to trigger messages ("Only N tickets left", "No more tickets are available - some are in baskets", "All tickets actually sold") so we don't have to decide on the fly.
We're also going to improve the layout of the queue page, moving the dynamic message up to the top of the page, move some of the currently hard-coded text "Hang in there" into the message so we can get rid of it when no longer appropriate, and add the facility to redirect everyone en-masse to an arbitrary URL once the tickets have sold out (or nearly sold out).
When an error occurred, the queue position was displayed as "999", which was just an error place holder, and not a real queue position. 999 is far too low for a placeholder, and it's been updated to 999999.
And finally, we will continue to suggest and encourage the use of ticket ballots for heavily over-subscribed shows, avoiding the problems and queues inherent in first-come-first-served on-sale dates.
What happened to the queue?
There was a problem with the stored procedure that calculated where you were in the queue. It previously readUPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId
SELECT
SessionId,
IsQueued,
(SELECT COUNT(*) FROM UserSessions AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue,
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions AS USthis WHERE USthis.SessionId = @SessionId
and we were unable to generate sufficient load to make this go wrong in testing, but with over 10,000 real people hammering it it started to generate deadlocks. That's when so many people are trying to read and write to the same bit of the database at the same time that it gets stuck. The fix was simple: it now reads
UPDATE UserSessions SET SessionLastActivity = GETDATE(), SessionExpires = DATEADD(mi, 5, GETDATE()) WHERE SessionId = @SessionId
SELECT
SessionId,
IsQueued,
(SELECT COUNT(*) FROM UserSessions WITH (NOLOCK) AS USother WHERE IsQueued = 1 AND USother.SessionStart < USthis.SessionStart) AS AheadInQueue,
(SELECT SettingValue FROM SettingValues WHERE SettingGroupId = '00000000-0000-0000-0000-000000000000' AND SettingName = 'QueueMessage') AS Message
FROM UserSessions AS USthis WITH (NOLOCK) WHERE USthis.SessionId = @SessionId
Unfortunately, diagnosing the problem took just over 5 minutes. And the SessionExpiry, which is updated every time the queue is checked, is set to 5 minutes from now - if someone joins the queue and then decides not to wait and closes their browser, we don't want them hanging around blocking everyone else for ages.
So if the problem had been resolved within 5 minutes, everyone's queue position would have been unchanged - which is what we expected when we told people to refresh. As it was, sessions which hadn't managed to contact the server and hadn't updated their expiry time for 5 minutes expired, which means their place in the queue was lost (and refreshing or not would have made no difference). As soon as we noticed this happening we extended the SessionExpiry for all the remaining sessions whilst we worked on the problem, but by then we'd already lost a chunk of the first sessions.
Once the queue was fixed, the rest of the system bore the load reasonable well.We peaked at letting 500 simultaneous sessions onto the site - which was unnecessary. Even if the site could have supported everyone who was interested browsing tickets all at once, it would have been a terrible experience; every time you looked at a ticket someone would have snatched it out from under you, and actually getting something in the basket would be luck for a few, and frustration for everyone else. For future on sales with this type of product (lots of small distinct blocks) we think we'll limit the number of simultaneous users to 100 or so; we can handle more, but expect it would feel better if less crowded.
Sold Out?
One of the things that proved difficult on the night - and contentious afterwards - was deciding when to tell everyone the show had sold out. The problem is that some people get past the queue, get tickets in their basket, get as far as the credit card page, and then decide that perhaps they can't afford it after all, so they close their browser and walk away. We give people a bit longer before the session times out once people are on the credit card page, which means that there can be quite a long period during which not all tickets have been sold - so we aren't "sold out" - but no tickets are available, so there's very little point in people sticking around. It's hard to communicate this in 140 characters or less, though. "There are no more tickets available right now but some are stuck in baskets and might become available in an hour or two when they time out, but we can't say for sure" is an accurate but not a very punchy message.
What next?
The main problem - reading the queue position causing deadlocks - is fixed.We'll leave the site at 100 simultaneous users, so that they can browse the performances and times without someone nabbing the tickets out from under them.
We're going to move the IsQueued status out of the main UserSessions table and into a separate table to join to at the earliest opportunity, and change how we calculate the queue position so that it's easier on the database server.
We are considering not displaying the queue position at all. Not only is it relatively expensive to calculate, it makes people feel worse when things go wrong - or even, when things go right. Just being told that you're an a queue and not knowing if an internal hiccup shuffles your position would be much less upsetting than seeing your number go from 600 to 9000. The very term "Queue" implies an orderly line, but when huge numbers of people arrive at the site at the same moment, it's not really about who got there first; getting a lower queue position is a matter of fractions of a second and pure luck.
Whether we display the queue position or not, we're going to work with YMBBT to rethink the communications, how we give people information on their chances of getting tickets, and perhaps define criteria to trigger messages ("Only N tickets left", "No more tickets are available - some are in baskets", "All tickets actually sold") so we don't have to decide on the fly.
We're also going to improve the layout of the queue page, moving the dynamic message up to the top of the page, move some of the currently hard-coded text "Hang in there" into the message so we can get rid of it when no longer appropriate, and add the facility to redirect everyone en-masse to an arbitrary URL once the tickets have sold out (or nearly sold out).
When an error occurred, the queue position was displayed as "999", which was just an error place holder, and not a real queue position. 999 is far too low for a placeholder, and it's been updated to 999999.
And finally, we will continue to suggest and encourage the use of ticket ballots for heavily over-subscribed shows, avoiding the problems and queues inherent in first-come-first-served on-sale dates.
Subscribe to:
Posts (Atom)