Congratulations on the launch! I've been following fly.io ever since I stumbled on it 2 years ago.
A few questions, if I may:
> We run a mesh Wireguard network for backhaul, so in flight data is encrypted all the way into a user application. This is the same kind of network infrastructure the good content delivery networks use.
Does it mean the backhaul is private and not tunneling through the public internet?
> fly.io is really a way to run Docker images on servers in different cities and a global router to connect users to the nearest avaible instance.
I use Cloudflare Workers and I find that at times they load-balance the traffic away from the nearest location [0][1] to some location half-way around the world adding up to 8x to the usual latency we'd rather not have. I understand the point of not running an app in all locations esp for low traffic or cold apps, but do you also "load-balance" away the traffic to data-centers with higher capacity? If so, is there a documentation around this? I'm asking because for my use-case, I'd rather have the app running in the next-nearest location and not the least-load location.
> The router terminates TLS when necessary and then hands the connection off to the best available Firecracker VM, which is frequently in a different city.
Frequently? Are these server-routers running in more locations than data centers that run apps?
Out of curiosity, are these server-routers eBPF-based or dpdk or...?
> Networking took us a lot of time to get right.
Interesting, and if you're okay sharing more-- is it that the anycast setup and routing that took time, or figuring out networking wrt the app/containers?
Hey, I'm the tech lead of Workers. I don't want to intrude too much on this thread, but just wanted to say: we don't do any special load-balancing for Workers requests; they are treated the same as any other Cloudflare request. We use Anycast routing (where all our datacenters advertise the same IP addresses), which has a lot of benefits, but occasionally produces weird routes. Often this relates to specific ISPs having unusual routing logic that, for whatever reason, doesn't choose the shortest route. We put a lot of effort into tracking these down and fixing them (if the ISP is willing to cooperate). We do sometimes re-route a fraction of traffic away from an overloaded datacenter by having it stop advertising some IPs, but if the internet is working as it should, that traffic should end up going to the next-closest datacenter, not around the world. When you see requests going around the world, feel free to file a support request and tell us about your ISP so we can try to track down the problem and fix it.
> Does it mean the backhaul is private and not tunneling through the public internet?
Backhaul runs only through the encrypted tunnel. The Wireguard connection itself _can_ go over the public internet, but the data within the tunnel is encrypted and never exposed.
> I use Cloudflare Workers and I find that at times they load-balance the traffic away from the nearest location [0][1] to some location half-way around the world adding up to 8x to the usual latency we'd rather not have. I understand the point of not running an app in all locations esp for low traffic or cold apps, but do you also "load-balance" away the traffic to data-centers with higher capacity?
This is actually a few different problems. Anycast can be confusing and sometimes you'll see weird internet routes, we've seen people from Michigan get routed to Tokyo for some reason. This is especially bad when you have hundreds of locations announcing an IP block.
Server capacity is a slightly different issue. We put apps where we see the most "users" (based on connection volumes). If we get a spike that fills up a region and can't put your app there, we'll put it in the next nearest region, which I think is what you want!
CDNs are notorious for forcing traffic to their cheapest locations, which they can do because they're pretty opaque. We probably couldn't get away with that even if we wanted to.
> Frequently? Are these server-routers running in more locations than data centers that run apps?
We run routers + apps in all the regions we're in, but it's somewhat common to see apps with VMs in, say, 3 regions. This happens when they don't get enough traffic to run in every region (based on the scaling settings), or occasionally when they have _so much_ traffic in a few regions all their VMs get migrated there.
> Interesting, and if you're okay sharing more-- is it that the anycast setup and routing that took time, or figuring out networking wrt the app/containers?
Anycast was a giant pain to get going right, then Wireguard + backhaul were tricky (we use a tool called autowire to maintain wireguard settings across all the servers). The actual container networking was pretty simple since we started with ipv6. When you have more IP addresses than atoms in the universe you can be a little inefficient with them. :)
(Also I owe you an email, I will absolutely respond to you and I'm sorry it's taken so long)
Any chance you have more details on GP's question about the tech basis of the router (ebpf, dpdk)? I didn't find this component among the OSS in the superfly org.
Doh, missed that. We're not doing eBPF it's just user land TCP proxying right now. This will likely change, right now it's fast enough but as we get bigger I think we'll have more time to really tighten up some of this stuff.
A few questions, if I may:
> We run a mesh Wireguard network for backhaul, so in flight data is encrypted all the way into a user application. This is the same kind of network infrastructure the good content delivery networks use.
Does it mean the backhaul is private and not tunneling through the public internet?
> fly.io is really a way to run Docker images on servers in different cities and a global router to connect users to the nearest avaible instance.
I use Cloudflare Workers and I find that at times they load-balance the traffic away from the nearest location [0][1] to some location half-way around the world adding up to 8x to the usual latency we'd rather not have. I understand the point of not running an app in all locations esp for low traffic or cold apps, but do you also "load-balance" away the traffic to data-centers with higher capacity? If so, is there a documentation around this? I'm asking because for my use-case, I'd rather have the app running in the next-nearest location and not the least-load location.
> The router terminates TLS when necessary and then hands the connection off to the best available Firecracker VM, which is frequently in a different city.
Frequently? Are these server-routers running in more locations than data centers that run apps?
Out of curiosity, are these server-routers eBPF-based or dpdk or...?
> Networking took us a lot of time to get right.
Interesting, and if you're okay sharing more-- is it that the anycast setup and routing that took time, or figuring out networking wrt the app/containers?
Thanks a lot.
[0] https://community.cloudflare.com/t/caveat-emptor-code-runs-i...
[1] https://cloudflare-test.judge.sh/