How does Apercite works? Why should I use it instead of implementing my own service?

Managing browsers and screenshot at scale implies a set of challenges that, although not so hard, can be tricky to manage correctly. Apercite need to handle a significant amount of traffic, and service abuses, with the smallest possible impact for external users.

Whether you wonder or not why it's not so easy to do, we compiled a little overview of the services behind Apercite so you understand what happens when you request a picture (or at least, if you wish so).

Architecture

Apercite is made of several different services, which all have redundancy in place and each one has a single purpose. Here are the most important components that make up Apercite.

  • Gateway: A front reverse proxy receive and routes all traffic. This proxy is based on the battle-tested nginx web server and can handle very high loads of traffic, much higher than the average Apercite traffic. This gateway also blocks traffic when people sends requests at a much faster rate than what a regular user would do (yes, there is a significant margin here that makes sure we don't block legit traffic). If you get an HTTP 429 error, there are high chances that it's just the gateway asking you to slow down. If you think your usage is legit but get 429 errors, please contact us providing the details.

  • Website: This website is handled separately. Although it's important for our clients to manage their accounts, we consider this part as "not critical" (although we make sure it's reliable) as having the website down would not cause the API to stop working.

  • API Server: The API Server is in charge of serving the pictures. This is one of the most critical parts of the service, and the part that must run the fastest (because thumbnails can't wait!). There are actually 4 replicas of the API Server, and this service uses no database and asynchronous I/O code to run lightning fast. All orders are passed through the Message Queue, and a fast and proven object storage service allows it to retrieve the pictures when it needs it.

  • Janitor: This service is not really critical, although new pictures generation would be paused if the Janitor is not available (all pending miniatures will be sent on service restart). The Janitor is in charge of housekeeping, logging and tearing apart legitimate orders from fraudulent orders that passed our first barriers. We don't disclose much of what we're doing here, but this is the service that makes sure that we keep a good quality of service whatever happens. Amongst other tasks, it implement quotas and rate-limits at a finer level than what does the front gateway, evaluate urls, deduplicate requests, ensure we do not spam-crawl websites, and a lot more security and sanity related tasks that we prefer not to reveal, at least for now.

  • Message Queue: We're passing all messages through services using a message queue implemented using RabbitMQ. It ensures that we're not relying on hard dependencies between services, and enhances the overall resilience of Apercite.

  • Spiders: The actual servers crawling the Web and taking screenshots, using a slightly modified Google Chrome browser. This is the CPU and Memory intensive task, so we run them on a different server cluster. The actual count of spiders varies, but we actually have a capacity of ~200 screenshot workers at a given time. This number increases if we need to sustain a higher load than usual, and even this current number allows us to handle a load vastly higher than our average (200 workers can crawl more than 1 million pages per day, and are only triggered if a picture is missing or if an update is required). Once a picture is taken, the spider uploads it to our object storage.

  • Data storage: We use different storages for different purpose. Let's just say that all critical storage services are backup-ed daily by an external service provider to ensure we can handle disaster recovery in case something really bad happens. And of course, we encrypt the data at rest (on the disks) for all sensible storage services.

  • Monitoring: We monitor extensively all services described above, both for status (up, down) and system metrics (cpu, load, memory, disk, ...) but also for service metrics (requests and response statuses, timings, queues, error rates, failure rates, ...). It's an essential part of providing a good quality of service and we're dead serious about that. There are a lot of alerts set up that will ring our phones in case anything bad happens, and humans will take commands whenever the system goes beyond acceptable limits.

  • External Monitoring: In addition to our internal monitoring tools, we also have an two external service providers that hits our endpoints every minutes, and you can have an external overview of one of them at https://status.apercite.fr/ (and we can't lie, that's not our servers). We believe it's essential to rely on a third party for this, both for honesty and reliability of the data.

Infrastructure and Service Scheduling

All services are run in linux containers, using a scheduler called Kubernetes. It means that for the simplest failures, either during deployment (wops, small bug there...) or because something went wrong (memory leak eating all the RAM, etc.), the system will be able to self-heal by restarting faulty processes that does not answer our health checks anymore. As most services are redundant, the end user won't see much of this as the load balancers won't send traffic to faulty services, until the system repaired itself.

More?

This is a pretty high level overview, and should convince you that it's not so easy to run this seriously (of course, not everybody needs serious systems, but we went for reliability).

Would you like to know more? Let us know what you'd like to know, and we'll enhance this FAQ entry with whatever we can disclose about the system.