This is more technical than my previous posts. In terms of Castro itself, the app is doing great. Last week we again hit new user highs for 2024, presumably spurred by the pumpkin icon. The app is how users experience Castro, and it needs the most improvement, so that takes up 90% of our time, effort, and attention. We'll have more on the client soon, but that other 10% is also very important and is our subject here.
The Problem
You can broadly think of Castro’s backend in two parts: the endpoints Castro interacts with when you use the app and the workers we use to update your podcast feeds. The workers actually do many things, but ~99% of their clock cycles are spent checking podcast feeds and updating them. This is not directly user facing, so in theory it's not a prime candidate for optimization. Most update jobs don't change anything at all, and when a feed does change, a few hundred milliseconds to update our database is hardly noticeable to the average user.
But in aggregate these jobs add up, and I've noticed as we've tacked on various new checks and features to the worker jobs, their execution time has crept upward from ~1 second to ~1.3-1.4 seconds. Given my knowledge of what’s actually happening even 1 second seems a bit too long. I want to improve this number, but I don’t want to spend much time on it, because the marginal worker is pretty cheap and we have many other things to focus on.
So what are these jobs doing and how can we make it better?
The Setup
I inherited the system more than I designed it but the Castro backend is fairly typical per my understanding. We run a Ruby on Rails app with a Postgres database, and we use Sidekiq to update our podcast feeds. There are many types of workers, but a podcast update job does 3 things:
- Lookup some info about a podcast in the database
- Query the podcast URL
- Most of the time, this is it, we’re done. Other times we have to record some data about the run, and less often we have to make some database writes for new episodes or updated podcast metadata.1
So it’s a lazy afternoon and I decide I’m going to make this better. But I’m not going to refactor anything or make any large changes. Instead I’m going to give myself an hour or two, poke at things a little bit, and abandon my efforts if they prove fruitless.
I don’t know ruby2 or rails well. I’m much more comfortable with statically typed languages with meaningful function names, but I worked on performance optimizations for a living at one time so I have some relevant experience. The biggest thing I've learned working on performance is that 80% of the gain is going to come from <20% of the effort. For the vast majority of software systems, open source or otherwise, nobody is monitoring what every line of code in production is doing, so just looking carefully at things with fresh eyes will usually yield something that can be improved.
Debugging
The first thing I need to do is google how to profile ruby in production. Rbspy quickly comes up as the best tool for the job, and indeed it proves incredibly helpful. As I said, almost all the worker time is spent updating feeds, and we're doing things the lazy way. So I don't bother isolating a specific job or any other setup. I just run rbspy for a few minutes in production to see what those workers are actually doing:
Roughly speaking we can separate the resulting graph into four distinct parts:
- 18% sidekiq overhead / redis calls (far left)
- 8% rails / active record overhead
- 49% network request (large block in center)
- 25% parsing feeds, database updates (bottom right)
On the surface I’m not sure this is bad. I didn’t have strong priors. Redis overhead seems high, so we'll check that in a moment. Most of our network requests are very quick, but it's also the thing we're doing the most and all requests are to random third parties. 49% could be reasonable, let's check what's actually happening in there.
- ~18% in the request block itself (well yeah that checks out, gotta start the request and wait)
- ~10% here (shrug, idk, seems legit)
- ~13% in a function called set_default_paths. Hmmm. What is that doing?
If the reader clicked that last link they know just as much as I do, but from a cursory glance at the code it seems like it's just setting up the trust store for the request. Mind you this is 13% of all Castro's worker time. I took several traces to ensure it was representative (we are trying to be lazy but not stupid, going down a rabbithole based on an outlier would waste even more time).
Trust stores should generally be fine to reuse on a request, at least for our purposes. I guess what is happening is every part of the network stack is being torn down and reconstructed every time. If you were just making an occasional network request, it might not matter very much. Since we’re doing this 10s of millions of times per day, that setup is adding up. From the perspective of a client engineer, this is an unexpected source of performance issues. When writing a client side http library, reusing heavy files that don’t change would be an obvious thing to do. (e.g. Here is OkHttp setting it up once for the whole client.)
Improving the network request
We can do better.
I could set up a trust store in advance and reuse it. Net::Http looks like it’s checking for an existing file in that line of code, so there must be a config option. But do I really want to be creating trust stores? I do not want to start looking at OpenSSL API documentation. Think lazier.
Wait. Shouldn’t the networking stack just handle this? Maybe there's a simple way to turn on resource pooling. Luckily others have done actual work on this topic so we can just breeze through some blog posts. We stand on the shoulders of giants and late 2017 WeWork was extremely helpful so I hope everything went well for them the next couple years.
Anyway the upshot after thinking about this:
- This is reasonably well known and documented behavior, it's just really not what we want in this case
- The persistent gem would probably solve our problem. Even though "persistent" refers to reusing connections to the same server, which is very much not what we want, presumably they're also pooling resources better.
- Swapping out the underlying http client sounds like a scary change, but we are lucky that all the worker code is written against Faraday and it actually is a fairly small implementation detail
- If we have to swap out the http client anyway... the more I read about this http stack the less l like it.
- Maybe we just take WeWork's advice above and use Typhoeus.
I add Typhoeus to the gem file and it’s just one line of code to swap out the adapter.
Faraday.new ... builder.adapter Faraday.default_adapter end
Becomes
Faraday.new ... builder.adapter :typhoeus end
I'm not kidding that was the whole change. The tests pass. After deploying to a test worker and making sure everything works, I give it a production workload for an hour. While it’s running, I look into a few small issues in the flamegraph and add some better Redis connection pooling to cut down on some of that initial 18% above. I also disable the Typheous http cache (add cache: false to the above code snippet) as I notice cache setup is showing up on new traces, and we have custom cache handling outside the http layer anyway. Test everything more, deploy all this to production and let it sit overnight.
That’s maybe 2 hours of work. Large improvement, we’re back to ~1 second, which was my whole goal anyway. I can go back to working on Castro’s client. Mission Accomplished.
Or is it?
The next day I just can’t resist taking a few more production traces.
The network request has gone from 49% of our time to 19%. But I’m pretty shocked to find out that with the improved networking speed, now 20% of worker time is spent in the active record connection pool. That can’t be right, what is happening? Taking more traces reveals this was actually an outlier on the lower side, most traces are spending 25-30% of all time waiting on an active record connection.
I verify we're following the advice here. We have a very standard Sidekiq setup, with 8 threads per worker. and each of those threads should have a dedicated connection. What are these threads all waiting on? I could add more connections to the pool, but why would I need to? I feel like I'm missing something more fundamental.
I'm basically treating ActiveRecord as a black box, but of course it isn't. The right thing to do might be to read more blog posts, crawl through github issues, read the active record source code, and figure out why a connection wouldn't be freed. (Perhaps if you've worked with AR, you've already guessed the solution.) But let's just try a couple things first.
Maybe ActiveRecord is not very good at closing the connection after a job runs? What if we try to clear them proactively. Google for the right API (clear_all_active_connections!), make sure it only affects the current thread, and add it after each run.
Run that on some representative data on a test worker.
Nope. If anything this is worse. Let's take a look at the job code above. Where would active record be holding an unused connection? ... Ah, when we're querying the 3rd party podcast server, we don't need to do anything with our database, and we don't know how long that will take. Seems like we might want to release the connection before querying. Acquiring a new connection afterward will have overhead but it's not going to outweigh 20-30% of all server time. (I asked ChatGPT if that's a good idea and it said no, but this still seems like a good idea to me.)
Idea #2
We try this:
Run some tests locally and deploy to a worker for testing. I don’t even need a flamegraph to know this was a good change, as I notice the number of jobs we’re completing is significantly higher.
Jackpot. Acquiring a connection effectively disappears from the trace. (The impact was much larger than I’d anticipated. I should've tried optimizing this sooner.) It seems like the issue is ActiveRecord is not always reusing the same connection within a thread and it doesn’t free them very quickly, so releasing it proactively makes a huge difference. Of course, I am not the only person to notice this.
We’ve just cut Castro’s entire backend workload in half in two sessions of debugging and what amounts to two lines of code. Further tweaking got the average down to ~.50 seconds. In theory we can update every feed much more frequently, and indeed we're already doing that. You might have noticed over the past week or two. We freed up so much worker capacity that we can't use it all yet, as I’m not sure the rest of the system could handle all the load, meaning we'll run less workers going forward and save server costs as well.
Can we do better?
- Almost certainly. I never even looked at the feed parsing or database write portions of the graph. Naively, those should take the most time, and I’m fairly sure there would some low hanging fruit if we went looking for it.
- Not all feeds or feed jobs are the same. If you dig into this data a little bit in non lazy way, we have only improved the easy case. But most jobs are quick and freeing up the fast case just overwhelms everything else.
- Truly optimizing this would require isolating certain types of jobs and really digging into what individual runs are doing. We have enough throughput now that the juice isn’t worth the squeeze compared to everything else we have to do, but rest assured we'll continue improving this as needed.
Focus on Impact
The graphic is less popular feeds that have only a few subscribers, so it's the worst case scenario. (The bump was slowing things down to ensure nothing broke before we ramped things up in production.)
Historically, there have been complaints about Castro feeds sometimes falling days or hours behind. This does not happen anymore3 and hasn't for a long time. Today, every active4 feed is updated on our server every 10-12 minutes, which is an improvement from ~20 minutes before last week. At peak times this number may slip a bit but honestly not much and we're getting better there as well.
Hope you enjoy the faster feed updates!