Delivery Delays and Lost Deliveries

IMPORTANT: Upcoming change that could cause some webhooks to fail because of timeouts.

Hi everyone,

Some users have recently complained about unexplained delays in the trigger of their webhook events or even losing some events.

While digging into these issues we found that the max amount of time allowed for each trigger (currently 6 seconds) was not being enforced. This allowed webhook triggers to take much longer than expected to complete. Under load, this can lead to the slowdown of webhooks deliveries on other sites.

Next Monday we will be releasing a fix to put back the time limit check. This could cause some existing webhooks to fail on some events (if the webhook endpoint takes more than 6 seconds to return an answer to Shotgun).

We are also rolling out a fix for lost deliveries. So, starting today, no delivery should ever be lost.
Please do let us know if you see missing deliveries after this date.
(Note: We built the system so no delivery would ever be lost. The tradeoff for this is that it is possible, in extreme cases, that the same delivery is sent twice to a webhook.)

Regards,

Stéphane

PS: We are still investigating an issue which can also cause events to be delayed. We are working on getting this fixed asap and will provide an update here when the fix is ready.

5 Likes

Hi everyone, just a quick update here to let you know what’s going on…
First, we delayed the plan to inforce the maximum of 6 second allowed to process a webhook trigger. This is because we’re folding a fix for deliveries tagged as “failed” even if they had not reached the webhook endpoint (because of network hickups) and a fix for random delays that we noticed.
We’ve been testing this and things look pretty good so the plan is now to roll out the new stack next Monday (july 6th).

Note that, although this should not occur very often, as part of this change you may see a few deliveries that get repeated. We had to make the tradeoff between deliveries possibly not reaching the webhook endpoint and repeated ones so we chose the latter.

Regards,

-Stéphane

2 Likes

Hi everyone!
This was roled out a few minutes ago… 6 seconds delay are now enforced on webhook triggers and there should not be any missing deliveries anymore.

Please do let us know if you see any issues!

Thank you!

3 Likes

Hi everyone!
New updates on delays and lost deliveries…
About a week ago we fixed an issue that could cause events to get stuck in the pipeline. This occured about a few weeks ago and caused some events to stalled for more than a day (ouch!). This happened because the issue flew under our radar.
So, we did the following:

  1. Fixed the underlying issue
  2. We added a “canary” system that tells us right away if events are stuck in the pipeline.
  3. We’ve been monitoring the change
    From what we can tell, events do not get stuck in the pipeline anymore. Please do let us know if you encounter any issue like this in the future.
    (Specifically, @Stephen I know you hit this one. Things should be back to normal now…)
1 Like

Hi,

It seems our webhook enviroment outputs this error quite often. Webhook event triggers twice in most cases(rarely this does not happen and other webhook function dese not output the errors.).
So a problem actually happens: exact same Slack notification message is sent twice by single webhook event. Is that possible to be fixed?

Hi @kishikawa_takanori,
This should not happen very often. Definitely not on a regular basis.
Can you give me the name of your site and webhook and also an example of a delivery that occured twice?

Hi @daigles,

I sent a direct message. Please check the info.