Scalable SNMP polling
One of the major challenges when building a modern network monitoring
system is the need to constantly collect vast amounts of SNMP data for
statistical and alerting purposes. A scalable SNMP poller is one of the
key requirements for a monitoring system to be successful.
SNMP is a fairly simplistic protocol where the monitoring application sends a
request to a remote device for a specific piece of data (eg. interface transmit
octets) and waits for a response. Due to things like network latency and CPU
load of the remote device, it may take some time for the device to respond,
therefore getting through the poller workload in a timely fashion can be a
massive challenge in large network infrastructures.
Monolithic Poller Architecture
A monolithic poller typically sends its requests synchronously.
That is, it sends a request to device A, waits for a response, then
sends a request to device B. The major flaw with this architecture
is that network latency significantly restricts the number
of requests the poller can perform each minute.
For example, a network latency of 100 milliseconds means the poller
would only be able to perform 600 requests per minute. At 30 MIB
objects per request, that is only 18,000 objects, or data for only
1,800 interfaces (assuming a minimum of 10 MIB objects per
interface).
Multi-Process Poller Architecture
A multi-process poller architecture is a crude method of increasing
scalability. This is typically done by dividing the workload into
N lumps, where N is the number of poller processes fired off. 10
poller processes would typically increase the scale of the above
example to 180,000 objects, or 18,000 interfaces.
Pros:
Cons:
- Many more processes to manage
- Complex poller configuration
- Increase in system memory usage
- Each poller process is still affected by network latency
Multi-Threaded Poller Architecture
A multi-threaded poller is another crude method of increasing
scalability. This is done by firing off many threads, each thread
handling a single request / response transaction.
Threads consume less memory than processes because they share the
same virtual memory address space, file descriptors and process
state information. They are faster to start up and context switch.
Refer to this Wikipedia - Threads vs Processes article.
Pros:
- Linear increase in scale
- Consumes less memory than a multi-process poller
Cons:
- Complex poller configuration
- Added complexity of managing many threads
- Threads have to perform locking so they don't clobber each other
- Each poller thread is still affected by network latency
- Significantly more complicated to implement, test and debug
- One rogue thread can crash all other threads because they all share
the same memory and process state information.
- Has no scale advantage over a multi-process poller
Distributed Poller Architecture
A distributed poller architecture is another way of increasing
scale. By deploying multiple pollers closer to the end devices, you
can significantly reduce the underlying network latency issues,
therefore the pollers are able to perform many more requests each
minute.
Pros:
- Faster to get through its polling workload due to lower network latency
Cons:
- Significantly more expensive to deploy and maintain
- Requires deploying many remote physical or virtual servers
In Summary ...
None of the above architectures come anywhere close to providing a cost
effective scalable solution.
- Single monolithic synchronous pollers do not scale.
- Multi-threaded pollers are grossly inefficient due to all the thread
management overheads.
- Distributed pollers are plainly too expensive to deploy and maintain.
The AKIPS Poller Architecture
- Monolithic architecture - one process, no threads
- Asynchronous polling algorithm - polls 1000s of devices at the same time.
- Synchronous device requests algorithm - N outstanding requests
to each device at any one time (defaults to 1). This significantly
reduces the chances of over running the CPU in the remote devices.
- In-flight window tuning on a per device basis. This feature completely
negates the issue of polling large remote devices with high network latency.
- Direct database insertion - achieved by tightly coupling the configuration,
time-series and event databases into a single process.
- MIB database integrated directly into the SNMP engine for high speed
packet encoding and decoding.
- 60 second polling interval for every MIB object
- Scales to over 20+ million MIB objects per minute
- 30 second polling window each minute
- Integrated Ping Poller (RTT measured in microseconds)
- 15 second Ping interval for every device
- Interleaving of Ping and SNMP requests for good packet mix
Example Deployment
Test System:
- ASUS® S2066 ATX TUF X299 MARK 2
- Intel® Core i7-7800X 3.5GHz Hex core + Hyperthreading
- 128 gigabytes of 2666 MHz DDR4 RAM
- 6 x 2 Terabyte Seagate ST2000DM006 7200 rpm SATA disks
(setup in a single ZFS striped pool)
- Intel PRO/10GbE PCI-Express
- Operating system - FreeBSD ® 11.2-p2
- Total cost approximately $4000 AUD
Network Topology Monitored:
- 60,000 devices (23856 SNMPv2, 36144 SNMPv3)
- 240,000 pings per minute (4000 per second)
- 1.560 million interfaces (13 objects per interface, CPU, Mem, Temp, etc...)
- 21.44 million polled MIB objects per minute (714666/sec
in the 30 second polling window)
Poller CPU usage (1 second samples)
The poller has a 30 second SNMP polling window, starting at second 5 and ending
at second 35. This leaves the system mostly idle at the start and end of each
minute for other processes to do work with little CPU and I/O contention.
You'll see from the Poller CPU graph, there is still a lot of available
headroom on this system to at least double the number of polled MIB objects.
AES encryption to 36144 SNMPv3 devices has very little impact because the
encryption is offloaded to hardware. The poller constantly tracks the
Engine ID, BootTime and BootCount for each SNMPv3 device, therefore rarely
needs to perform Engine Discovery.
Poller Context Switches (1 second samples)
Involuntary context switching
can significantly impact performance, which is one of the major drawbacks
of multi-threaded pollers. Since the AKIPS Ping/SNMP poller is a single
monolithic process, it does not experience high levels of involuntary context
switching when there are ample idle CPU cores available.
Some Observations ...
- A single AKIPS poller process easily scales to 20+ million MIB objects per
minute using commodity grade hardware. There will always be an upper limit
but we have the luxury of firing off a second and third poller process.
- A well engineered asynchronous poller absolutely blows away a poller running
hundreds of processes or threads. It's like turning up to drag race: one car
with a well tuned supercharged V8 whilst the other with hundreds of weed
wacker motors, each with their own fuel tank, spark plug and starter cord.
- How well does AKIPS scale against other commercial or open source products ?
That's very difficult to determine because every vendor uses their own
"gobbly goop" terminology, so you are not comparing apples with
apples. AKIPS collects 13 MIB objects for every interface every minute,
while most other vendors collect a smaller subset at five minute intervals.
- With network security such an important issue these days, we highly recommend
moving entirely to authenticated and encrypted SNMPv3. There is very little
measurable impact to data collection.
- AKIPS runs solely on FreeBSD®,
whilst most of other vendors use one of the many Linux® distributions.
- The AKIPS SNMP implementation has been engineered from scratch, whilst
most other commercial and open source software are based on the legacy
implementations like Net-SNMP.