Event Daemon Slowness?

Chris_Bennett · October 17, 2019, 11:42pm

I am not sure if we just have to many events registered or something else, but it is taking 40+min from when an event happens to when our event daemon finally triggers it.

I was curious if anyone had any issues with this or had best daemon practices for performance.

TIA

jing.liu · October 18, 2019, 1:17am

Hi Chris,

How many events are registered? What about the performance if you use API to access shotgun server?

Thanks

Loney

philip.scadding · October 18, 2019, 8:08am

Just to add my thoughts as well. It sounds as though your event daemon is running behind, and has a backlog of events to catch up on.

If you have many event plugins or your events are likely to get triggered frequently and respond to large volumes of events, you need to make sure they are written and efficiently as possible.

You should limit the number of calls you make to Shotgun, try to use batch calls where possible.

When not debugging, try to ensure they produce as little logging output as possible.

You can also enable a timer logging on your plugins to see which ones are taking the most time:

github.com

shotgunsoftware/shotgunEvents/blob/master/src/shotgunEventDaemon.conf.example#L46


# - 50 - Critical
logging: 20


# Enable Timing logging
# Timing logging is a separate log file that will log timing information regarding
# event dispatching and processing run time. This is to help diagnose which plugins
# are taking the most amount of time and where any potential queue processing
# delay might be coming from. Valid values are `on` or to enable or anything else
# to disable.
# timing_log: on
timing_log: off


# If the connection to shotgun fails, number of seconds to wait until we retry.
# This allows for occasional network hiccups, server restarts, application maintenance,
# etc.
conn_retry_sleep = 60


# Number of times to retry connection before logging an error level message (which
# sends an email in the default configuration)
max_conn_retries = 5

If you have any plugins that need to do heavy lifting, consider threading them out or sending them as jobs to a farm to process.

Shotgun API calls can be configured as to whether they generate events, by setting it on the Script entity you’re using to authenticate within Shotgun. If your plugins are making changes to the Shotgun site and they are further triggering plugins needlessly then consider perhaps using a script that doesn’t generate events.

This may not apply in your situation, but when I was a using Shotgun in production we had a tool that ingested a load of data into Shotgun, creating up wards of about 300 assets in one go. We had an event daemon plugin that we wanted to respond to some of the changes, but when ever we ingested a large amount of data this would create a backlog. So instead we made it so the tool didn’t generate events automatically, and at the end of ingestion, it manually created a custom event in Shotgun and provided a dictionary of the entities it created via the metadata field. Then our plugin would respond to the single custom event and read the assets out of the meta data and respond accordingly.
You could argue that why didn’t the ingest tool handle the logic rather than deferring it to the event daemon, but the reason was that we had other tools where similar operations could occur and so we opted to have these tools create custom events representing multiple changes and have it processed in one place.

Anyway enough of my rambling !

Chris_Bennett · October 23, 2019, 1:08am

Thanks Phillip and Loney,

Just to provide the info here:

We have about 23 callbacks registered. The shotgun API performance seems pretty quick.

Looking into Philip’s point, the evidence does seem to point to a lagging behind behavior.
Thank you very much for the helpful tips! Will try to implement and post back if we find any interesting notes.

François_Touvet · October 29, 2019, 2:58pm

Hi Philip,

I do not seem to have such a timing_log option in my current configuration, is it something that can be added manually or do I need to update my configuration?

Thanks a lot,
Francois

bouchep · October 29, 2019, 5:41pm

You might want to update from master.

François_Touvet · October 31, 2019, 3:35pm

Yep, I will have to

I would like to have some input about two parameters I noticed in shotgunEventDaemon.conf:
fetch_interval = 5 and
max_event_batch_size = 500

If I get it right, the _mainloop of shotgunEventDaemon :

fetches events with _get_NewEvents, which takes the last processed event as starting point and fetches all events with IDs above this reference,
processes these events with the plugins
checks the amount of processed events and sleeps for fetch_interval seconds if this amount is under max_event_batch_size

So now I’m wondering how changing max_event_batch_size to 300 (for ex) would impact the event handler. It would sleep less, so from my point of view there would be less risk of lag; but I guess there are some reasons not visible from the shotgunEventHandler code to choose these two values of 500 events and 5s. Would it remain stable? would the batch operations be still optimum?

bouchep · November 6, 2019, 3:39am

Hi François,

The max_event_batch_size is a default value for the number of events to fetch which is a compromise between the time it takes for the app to produce results and the time for them to be sent over the wire. You don’t want to fetch too many events that your request blocks one of the app’s threads too long but you don’t want to fetch too few that you’re spending your time waiting for tiny batches of data over your network link’s latency.

This being said, the key parameter will be the fetch_interval. If the last time the daemon loaded data, if it got fewer than max_event_batch_size events, it’ll wait the amount of time prescribed by fetch_interval so increasing this value will process things slightly faster and will increase the probability your next fetch gets fewer than max_event_batch_size.

There’s a third value you can tweak but isn’t exposed in the config as it is much more delicate.

There is a phenomenon by which there can be a gap in the event sequence and sometimes that gap can be filled in later. Gaps aren’t super frequent but are regular, gaps that get filled in are rarer. Even if this later phenomenon is rare, we need to ensure that events are processed linearly otherwise we can introduce bad data. To do this, each missing event is given a time window in which it can reappear. We’ll process all events up to a gap, wait there, if and when the gap fills up we continue until the next gap. This build up what we call a backlog. The amount of time we wait is defined by a constant you can find the source code called BACKLOG_TIMEOUT. If you decrease this, gaps won’t slow down your daemon as much but the probability you miss a gap that fills up will increase. Ultimately, tweaking this value depends on your level of risk aversion.

Sure hope this helps you tweak your daemon!

François_Touvet · November 7, 2019, 2:30pm

Hi Patrick, thanks for your help.

I will do my best not to reach the BACKLOG_TIMEOUT point

Just to be sure: you mean 'decreasing", right?

Francois

François_Touvet · November 19, 2019, 9:30am

Hello,

I need some update here. Sorry for being a bit stubborn.

Francois

bouchep · November 20, 2019, 5:58am

Hi François, apologies for the delay and no offence taken to your pinging this thread.

You’re right. I did make a mistake. My original intention was to write:

The last time the daemon loaded data, if it got fewer than max_event_batch_size events, it’ll wait the amount of time prescribed by fetch_interval so decreasing this value will process things slightly faster and will increase the probability your next fetch gets fewer than max_event_batch_size.

Thanks for noticing that and calling it out (I won’t edit the original post for posterity as the correction is in your message and this one as well).

For anyone curious about how tweaking the fetch_interval value would affect things, here’s an extremely simplified example with some ridiculous numbers for illustrative purposes.

If your site generates 100 events per minute and the max_even_batch_size is 500 with a fetch_interval of 1 minute here is how that would play out.

The site generates 100 events in one minute
The daemon asks for 500 events and gets 100
Because the daemon received less than 500 events it will now wait 1 minute
During this minute, 100 events are generated
And so on… By the end of the 5th minute, you would have processed 500 events.

Let’s have another example with all values being equal but your fetch_interval is 8 minutes.

The site generates 100 events in one minute
The daemon asks for 500 events and gets 100
Because the daemon received less than 500 events it will now wait 8 minutes
During this 8 minutes, 800 events are generated
The daemon asks for 500 events and gets 500.
After processing, the daemon immediately asks for another 500 events and gets 300.
At the end of the 9th minute, you would have processed 900 events.

In both scenarios, you’ll have processed, on average, the same 100 events per minute that are generated by your Shotgun site. In the second scenario with the longer fetch interval, there is a bigger gap where the daemon waits for events to be generated and it has to catch up afterwards by fetching two consecutive batches.

In the first scenario, the daemon waits for shorter periods and never needs to catch up as much as in the second scenario because it can request events and process them almost as fast as they are generated.

It’s a balancing act.

François_Touvet · November 20, 2019, 8:42am

Hi Patrick,

thanks for the detailed explanations, it makes things much more clear !

François