Presentation

title: Use Python to process 12mil events per minute and still keep it simple
author: Teodor Dima
company: East Vision Systems
url: http://eastvisionsystems.com
twitter: EVSystems

toknow: give thanks for every question; ask at the end if you have answered their question; never say you're having stage fright or that you can't see anybody in the public (nobody cares); if you're stuck, gather yourself, take a peek at the screen, say "what was I talking about? oh yes!" and continue;

class: center, middle
count: false

# Use Python to process 12mil events per minute and still keep it simple

### __Teodor Dima__

---

Ad tech, big data, video technologies, RTB.

A LOT of _events_ processing!

???

- Hi, I'm Teodor Dima and I am a developer at east vision systems, a company from Romania.
- We do a lot of things there, but we mainly handle Ad tech, big data, video technologies, RTB.
- For those that do not work in this field, this means that we must handle a lot of user events every second, which must be continuously persisted in a database somewhere.

---

- An _event_: a GET request made from a user's browser to the event processing system of the company that supplies the video content.

- The request contains data that must be saved as quickly as possible into a database, usually one that is highly _write_-performant.

- A single user that watches video content releases at least 10 events.

- A lot of users mean a lot of requests from a lot of concurrent clients.

- __c10k__! Yay!

???

- So what is an event? Simply put, an event is represented by a simple GET request made from a user's browser, specifically by a video player, to the event processing system that must analyse the data and save it into a database.
- The ammount of data generated by such a system can reach staggering levels.
- A single user generates at least 10 events by opening just one video.
- These events are not sent all at the same time, but continuously, like a stream of data, in different moments in time.
- So a lot of users mean a lot of requests being made from a lot of concurrent clients.
- This problem has been seen long before this talk and it is called c10k, meaning "how to handle 10.000 concurrent connections on the same machine"

---

- During my work in the company we had averages of about 12 million requests per minute, not including the peaks, which had to be recorded in the database.

- 12 million requests per minute means 200.000 requests per second supported by a cluster of servers.

- And this is worse in bigger companies!

- How can you sustain such a traffic using as little resources as possible?

???

- During my time in the company we had averages of about 12 million requests per minute, that is, 200.000 requests per second that must be handled by a server cluster.
- The question is: How can you handle such a traffic?
- Moreover, how can you handle it with as little resources as possible?

---

### Solutions

Naive: insert every event into the database, without aggregation.

- <2000 request / second on a single _c4.xlarge_ commodity machine, with a MongoDB cluster;

Alternative: Apache Kafka + Apache Storm + Apache Zookeeper

- more difficult to configure, tune and maintain;
- uses a lot of moving parts (services) and resources (virtual machines);

???

- The naive solution would be to simply handle each event and insert every one of them into the database, without aggregation.
- This works for small ammounts of data and various tweaks can be made to improve performance a bit, but it is ultimately unmaintainable and consumes a lot of resources.
- A solution often used is to use the Apache trio Kafka, Storm and Zookeeper or some similar alternative to build a scalable, high-performant system.
- However, configuring and tuning these moving parts into a coherent whole takes time and a lot of resources, and is often non-Pythonic, although a lot of work has been done in this direction by the people at Parse.ly.

---

## Naive (ship it!)

- Initially, a quick implementation was made which used the simplest approach.
- In order to check data consistency, we built a Python tool which checked the Web Server access logs and compared the data from there with the database;
- An idea was born: why not use the access log as a queue for the events processing service?

???

- Initially, we had to ship a product which handled streams of events and we made a simple solution which solved the problem.
- in order to check for data consistency and to make sure that no event was dropped, although it reached the server, we built a Python tool which checked the web server access logs (in our case, nginx) and compared the data from there with the database.
- this lead to a simple idea - what would be like if we could use the access logs as a queue for the event processing service?

---

## Thinking about architecture

Data flow:

1. Web Server (nginx);
2. `access.log`;
3. Python daemon service which consumes events from logs and aggregates them
    - we affectionately call it "Logbunker";
4. Database (MongoDB);

???

- The idea was that when the requests reach the machine, they are received in our system by the nginx web server which can handle a lot of concurrent connections and offers a solution to c10k.
- Between the access log and the persistence layer, the database, a service could be built which takes this data and simply pushes it forward.
- We began to thing about the implementation details of such a project and if it would be resilient and feasible enough.
- after some prototypes and new ideas, we came up with a clean structure that could be made quickly with the help of Python and we named it Logbunker.

---
class: middle, center

#### Processing events - machine software architecture

???

Now, this is a data flow diagram which shows a simplified schema of the data flow in a single machine.

As I said, the HTTP requests are handled by the nginx server which writes them into the access log.

In order to ensure that the data was easily swappable between the virtual machines, we ended up using the Amazon EBS (elastic block store) service to store the access log data.

Inside the Logbunker component there are actually 3 processes that work at the same time.

---
class: center middle

#### Processing events - internal flow of data

???

Two of the processes are the parser and the upserter.

The parser reads the events from the access log file while the file is modified by the web server. It then caches the data read for a fixed period of time and then pushes the data into a queue, the multiprocessing queue from the Python standard library.

The second process, the upserter, pops the data from the queue and inserts it into the database.

Every single time the upserter process inserts the data upstream, it will also keep a log of all writes done into a special file, specific to this service, which is used in case of a catastrophic failure. We ended up calling this file 'binlog'.

---

### Admin process

- process manager for the other 2 processes
- `fsync()` on data files to protect the data in case of a crash;
- status server;
    - serves at a socket with a configurable port with a simple protocol - a JSON message with status metrics;
    - courtesy of the awesome `json` and `socket` libraries;
    - status data is transferred between processes with the help of shared memory which is controlled by a `Lock`;

???

The third and final process is the main one.

- it checks periodically if the other 2 processes are alive
- if they are not, it will shutdown the whole service
- so, why does it not restart the process or any other action that could save the service?
- the possibility of corrupted data
- it also calls fsync() on the data files - the access log and the binfile
- fsync is a function which synchronizes the data from the file in the volatile memory cache with the disk - extremely important if you want data persistence
- the last important function is to serve status data at a configurable port; this was done with a simple protocol which responds with the status data;
- this data is collected from the other processes with the help of shared memory
- the access to the shared memory is controlled by a `multiprocessing.Lock`

---

## Thinking about obstacles

Will Python be stable in reading from a continuously changing file?

- Yes, making a NIH `tail` is completely feasible in just a few lines of code.
- Offsets are hard to calculate efficiently.
- Using unbuffered text file objects, `file.tell()` is slow.
- It's actually quicker to open the file in byte mode, add the number of bytes read, and then convert the bytes to UTF-8 strings using `.decode("utf-8")`.

???

- one of the first issues that we thought about
- how stable would the service be in tailing a file and how easy would it be to implement that
- it turns out it is very easy and completely stable;
- however, the offsets are hard to calculate efficiently and we needed those because in the binlog file we must have the offsets of the events that we insert into the database to be able to recover in case of a crash
- using buffered text files is almost useless because there is no way to get the offset of specific lines
- unbuffered text files are slow
- it's easier to actually open the file in byte mode, add the number of bytes read and then convert them to strings

---

## Thinking about obstacles

Will the queue between the `parser` and the `upserter` processes be stable enough?

- Yes, it works with a pipe in background and it never corrupted data.
- The `multiprocessing.Queue` spawns a thread - `feeder` - in background, though.
- It needs CPU to transfer the data between processes.
- If the parser is too hungry, it will starve the queue thread.

???

So between the parser and the upserter processes we have a `multiprocessing.Queue` which is used to transfer data one-way.

- the queue is using a pipe in the background and it never corrupted data;
- however, there is a problem with the data transfer speed;
- the queue needs a separate thread which is used to transfer the data from the queue internal buffer into the pipe to the other side
- if this thread does not have the GIL, it will starve and even if it has data to transfer, it will not do so;
- there are ways to minimize the damages of this situation, as I will describe in a later slide

---

## Thinking about obstacles

How could a catastrophic crash be handled securely and efficiently?

- There are basically 2 sensible points:
    - data from `access.log` could be cached in memory and not fsync-ed to disk;
    - the current position of reading (and upserting) is saved into a special file; this file could also be desynchronized with the disk data;
- Both are managed through periodical `fsync()` calls.
- Basically, the frequency of the fsync calls will affect the performance.

???

So how could a catastrophic crash be handled securely and efficiently?

- if a crash happens, then there are 2 files which can be corrupted or incomplete - the access log and the binlog file;
- in order to protect these files, the service makes periodical fsync() calls to them
- if the machine reboots and restarts the service, then it will be able to remember where it has left off with the help of the binlog;

---

## Thinking about performance

Will Python ingest and store events fast enough?

- Yes! With some performance optimizations, CPython 3.4 could ingest about 20k events per second in a `c4.xlarge` AWS virtual machine.
- This included reading from file, request processing and storing in the internal cache.
- Rethought of the data model allowed the reimplementation of the whole request processing routines in __Cython__.
- Cython raised the performance by at least 100%, leading to _40k_ requests processed per second. This was done in a week, excluding testing, without prior Cython experience.
- Cython for the win!

???

So is Python fast enough to ingest and store events and the same pace with nginx?

- Yes, with some performance tweaks, CPython could ingest about 20.000 GET requests per second in a `c4.xlarge` AWS VM;
- included reading from file, string parsing, data validation, regex matching and the business logic associated with this process, then caching it in RAM;

- Reimplementation in Cython of the request parsing submodule brought an increase in performance of about 100%;
- this was done in a week, without prior cython experience, and it was a really pleasant experience, without strange bugs or non-pythonic code;
- so Cython - way to go!

---

## Thinking about performance

How much will `fsync()` affect the stability of the system?

- Usually, an interval of 1 second between calls is almost unnoticeable.
- However... AWS has file systems that are mounted over the network.
- If the network drops or has lag, the filesystem calls will be delayed. That means that the fsync call will block.

---

## Thinking about performance

How much will the queue between the first and the second process affect performance?

- The `multiprocessing.Queue` between the processes is mostly I/O bound, while the parser process is mostly CPU bound.
- The Queue does not affect the parser, but the other way around.
- The parser must `sleep()` forcefully, periodically, even if it could gobble up data, to reserve CPU for the other thread.
- Surprising and nasty side-effect of the GIL.

???

- during one of the testing phases we observed a strange behaviour with the upserter
- the parser was reading continuously from the access log file, it parsed data perfectly, the cache was full, the queue to the upserter was full, but the number of requests to the database was very low
- what happened there was thet the upserter did not get the data from the parser at the same throughput that it could have sent it to the database
- the feeder thread that has the job of pushing data into the Unix pipe did not obtain the GIL as fast as it could because of the CPU-bound nature of the parser thread.
- this could not be fully fixed and a good enough solution was to force the parser to sleep for a fixed ammount of time, periodically

---

## Maintenance

A single machine that answers to requests contains:
    
1. nginx service and configuration file;
2. The Python daemon (Logbunker) with its own configuration file.

__15__ virtual machines with 4 cores and identical settings serve 12 million requests per minute.

There is no SPOF (Single Point Of Failure).

???

- if a commodity machine would crash and burn, then its negative impact on the whole system would be a lot less powerful than, say, a gigantic, 32-cores VM;

---

The machines are not throttled to 100%, but at a maximum of __50%__ in a normal situation and at most __75% on peaks__. This reduces the possibility of low availability and hardware failure.

Even if the peaks reach 100% of system capacity, then the `access.log` event queue will ensure the fact that the _events are not lost_.

- the service will continue to parse and it will remain behind the real time data, but it will catch up once the heavy load is gone
- the same thing applies to the database connection.

---

## Further information

- https://eastvisionsystems.com/production-software-part-ii-good-coding-reduces-clients-bill/
- https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.Queue
- http://aws.amazon.com/ebs/details/#piops
- http://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/
- http://docs.cython.org/src/tutorial/cython_tutorial.html
- http://www.dabeaz.com/GIL/
- http://www.slashroot.in/nginx-web-server-performance-tuning-how-to-do-it