There are two types of AKIPS CDROM ISO images:
The "USB Install Memstick" is a file which can be burned to a USB stick for installing on physical hardware.
To upgrade an existing system:
kind
attribute to alert site scripting parameters.
show_headers
parameter to http_result()
.
https://akips.example.com/cgi?...username=guest;password=secret
In v21.6 we changed the cbgpPeerAcceptedPrefixes type to match
CISCO-BGP4-MIB. This was not the correct type, and we
inadvertently broke this report.
Customers who upgraded to v21.6 will need to restore from a
prior backup to retrieve historic cbgpPeerAcceptedPrefixes data.
This fixes a regression to RADIUS authentication introduced in 21.4.
Added button for downloading a summary of group membership statistics.
This will be used to help development of an upcoming redesign of the grouping database.
To help, download your report in Admin > System > System Log Viewer. Click the Entity-Group Statistics button and upload the result at akips.com/upload.
Fixed a critical performance issue when updating interface speed groups (ifspeed_XXX).
The Tune Interface State feature runs every 5 minutes. If any interfaces come up, the interface speed is collected and updated, which requires updating the ifspeed groups.
To update ifspeed groups, previously we took a "sledgehammer" approach of clearing all ifspeed groups, then running bulk commands to assign interfaces back into the appropriate groups. For large configurations, this could take many minutes to run.
This has been changed to only update interfaces which have recently changed speed.
It is not appropriate to be alerting on that OID as it will always return the "noTestDiagnostics" value.
Dual socket poller - a workaround for firewalls (Palo Alto, Check Point, Cisco ASA) which experience broken/stuck SNMP sessions.
For several years, customers have experienced issues where firewalls lose session state for a device or group of devices. The symptom shows up as an SNMP unreachable device, yet an SNMP walk will succeed. Packet traces always show the polling packets going out, but no response from the devices.
Palo Alto firewalls seem to get confused when there is short routing change. The session displays an incorrect outgoing interface, which means the SNMP requests never reach the device, or the responses never reach the AKIPS server. Check Point has a similar issue, whereas we've only seen issues with Cisco ASAs when they failover between primary/secondary nodes.
The reason why an SNMP walk will succeed when this situation arises is because the walk will create a new UDP source port number, therefore the firewall will create a new session and not use the broken session.
The poller will now disguise these firewall issues by opening two UDP sockets at start-up and switch between them every two minutes. This will give the firewalls enough time to expire those broken/stuck sessions.
Added the ability to run multiple poller processes. For very large configurations, a single polling process has hit its practical upper limits due to CPU requirements. Spreading the polling load across multiple CPUs significantly extends the poller's scalability.
Note that this is an expert-only feature for massive AKIPS installations (i.e. monitoring 1 million+ interfaces from a single server).
Added CSV export to the Event Reporter.
Note: For the "summary" report, only the third table is exported. (Device, Description, Count, Last Change, Current Status.)
nm-http
not providing query string parameters in URL.
This breaks Opsgenie integration and any calls to http_send()/http_result()
with URLs containing the question mark character.nm-http
failing for some hosts.
This fixes a regression introduced in v20.15.
nm-http
client to HTTP 1.1 due to
recent Opsgenie API changes.
nm-http
client.
nm-flow-reporter
, nm-flow-timeseries
,
and through the Web API.
nm-availability
command line arguments.nm-httpd
web server.
!10.1.2.0/24
sched_1d_
Site Scripting functionality.
Use sched_HHMM_
instead as this gives you flexibility to run
daily at a specific time.
sched_1d_
scripts will be automatically converted to run at
midnight localtime.
NOTE: Exablaze does not support the standard IF-MIB.
http_send ({ method => 'DELETE', ... });
The UCS SNMP agent returns an unknownEngineID Report packet with its snmpEngineID filled in, but zero values for snmpEngineBoots and snmpEngineTime. In response to the next request packet from AKIPS, it sends a notInTimeWindow Report packet with all three values correctly filled in.
mget * * sys ip4addr value 10.132.0.0/16
10.1.1.1 10.1.2.1 10.1.3.1 ...
The fail early option tracks the number of failed requests to a device. Once the failed limit is reached, all queued and additional requests for that device are immediately deleted.
The rewalk loads a list of devices to rewalk configs and removes all devices which are currently SNMP unreachable. If a device becomes unreachable during the rewalk, then the snmp library continues to send/retry every queued request. This significantly increases the time the rewalk takes.
NOTE: This feature is turned off by default.
mget * * sys ip4addr value /10\.1\.50\./
mget * * sys ip4addr value 10.1.50.0/24
mget * * sys ip4addr value 10.1.50.0-50
mget * * sys ip4addr value 10.1.*.0/24
Note: The IP address of each monitored device is stored as text type in the AKIPS configuration. This is a design error as it would have been much more efficient to store them in their natural host order binary format to allow efficient IP range/mask queries. The storage type will be changed in a future release, and regex queries will be deprecated.
The auto grouping in large sites was taking a long time to complete (e.g. 20 minutes). This causes usability and event notification issues. A significant speedup, in the range of 100 times, has been achieved for commonly used grouping rules.
Speedups have been implemented for the following attributes:
Speedups have been implemented for the following interface grouping functions:
Cisco AP monitoring is an opt-in feature. To enable:
AKIPS uses a purpose built compression algorithm to store much of its structured data (e.g. time-series, events, NetFlow records). The algorithm implementation has been re-engineered to achieve a higher level of speed and compression level when running on modern CPU architectures. The re-engineered compression has only been implemented in the NetFlow rewrite. It will be introduced into other parts of the product in future releases.
The AKIPS NetFlow architecture has been completely re-engineered. The original thinking was that customers would point a limited number of centrally located flow exporters at AKIPS. In reality, the core routers are just too busy to do 1:1 flow exports - often a site requirement because sampled flow has limited practical use. Therefore, many customers have instead configured their regional and end point routers to send the flow exports.
In the previous architecture, a flow meter process was started for each flow exporter. Some customers need to configure 1000s of exporters, which resulted in thousands of running meter processes, causing CPU and memory contention. The new architecture reduces the number of flow meter processes to a default of half the number of CPU cores of the server (tunable). Each meter process handles many exporter flows. This significantly reduces context switching, CPU load and memory usage. The goal is to scale to somewhere around 5000+ exporters and 500k flows/sec on a single server.
Part of the reason for the rewrite was to also to enhance the information being collected, stored and reported. This includes:
Note: The NetFlow report GUI will be enhanced in the next release to include interface filtering. Scale and functionality testing has been carried out using the AKIPS flow simulator, which can simulator 1000s of flow exporters.
A config crawler and diff tool has been added. The goal is to efficiently collect and store router/switch configurations using ssh and store them in a version control system. This "working prototype" feature includes the following functionality:
assign interface * * all group A B = C
http_result ({ ... proxy => 'proxy.example.com:3129', });
The poller performs engine discovery for each device at start up, and reuses this information for each subsequent polling cycle so it does not constantly go through the engine discover process.
SNMPv3 uses engine ID, boot count and boot time values in its authentication mechanism. When a device is rebooted, its boot count is incremented and boot time reset. On the next poll request, there will be a mismatch between the poller and the agent, therefore the agent should return a SNMPv3 Report packet to force engine rediscovery.
There is a defect in APC devices where the agent never generates a report packet when it is rebooted. It just silently discards all requests which do not match its new time/count values. The poller has no way of knowing what has occurred.
Since the poller and nm-snmp programs share the new engine database, the next snmp walk will force a rediscovery of the APC engine information. AKIPS performs regularly walks of every device to update various configuration information.
http://{server}/api-spm?password={pw};mac={MAC}
nm-http "http://www.example.com:8080/"
http_send ({ url => "http://www.example.com:8080/" });
http_result ({ ... headers => [ "X-First-Name: Joe", "X-Last-Name: Smith" ], });
add child Atlanta-ro asset add text Atlanta-ro asset Asset_Tag = 1234 add text Atlanta-ro asset SSH = "<a href='ssh://10.1.2.3'>SSH</a>" add text Atlanta-ro asset Wiki = "<a href='https://mywiki.example.com/device/Atlanta-ro.html'>link</a>"
add device group nexus
assign * * sys SNMPv2-MIB.sysDescr value /NX-OS/ = nexus
assign device * any group nexus = spm_exclude_arp_context
Previously, the Syslog and Trap Reporters would start a 30 second refresh timer no matter what filter options were selected. This can make the reporters frustrating to use when doing queries over long periods of time and/or with a lot of syslog/trap data. To start a 60 second refresh timer on a LastN report, click the "Reload" button in the graph.
Fixed missing IPv4/v6 ping poller configuration when an IPv6 address is set in the device editor. The editor now adds the ping6 child and all appropriate attributes for the ping poller to start polling it. Previously it was only possible to add IPv6 ping polling to a device via a full discover.
The algorithm calculates transmit and receive 95th percentile of five minute averages. Different telcos reportedly bill on either Sum or Max, so both columns are calculated and displayed in the report.
Added a new From/To datepicker because a Telco billing period typically starts on a given day of the month.
You can not sort on Sum or Max columns because the backend database does not support that functionality.
NOTE: 95th percentile is mainly used in the USA for billing purposes by telcos and ISPs. It is just a '95th median' value and has no useful purpose for any type of capacity analysis. Links with vastly different traffic profiles can end up having the same 95th percentile.
calc interval avg 300 median 95 time "last30d" ...
The 'wait' logic was intended for enumerations with good/bad states. If an enumeration flips good -> bad -> good within the wait time window then no alert is generated. An uptime only has a 'reset' event, so the same logic cannot be applied. So it doesn't make sense to apply 'wait' rules to uptime objects.
The backup mechanism now keeps multiple historical copies of the data. This is achieved by applying a ZFS snapshot to the backup server at specific times or days. These times include:
NOTE: You must update your backup server to at least version 16.17 for the new backup mechanism to work. The upgrade needs to re-arrange the backed up data.
Added workaround for broken Check Point MIB and SNMP agents. The Check Point MIB has errors from top to bottom. Many objects that should be defined as counters, gauges, integers or enumerated types are incorrectly defined as a plain DisplayString. The SNMP agent often returns a different object type than what is defined in the MIB.
The SNMP agent also appears to get itself in a knot and loops when responding to an SNMP walk in the VSX section of the MIB if the VSX configuration has changed. The only solution is to restart the SNMP agent on the box. This was observed on version R77.30.
A performance issue with the incremental backup was reported when the backup server was located in a remote data centre and the data had to go through multiple firewalls. The latency of the network and firewalls caused the backup to run for a long period. This was because the backup script was invoking 1 to 3 ssh/scp commands for each file being backed up. The backup/restore transfer mechanism has been changed from ssh/scp to sftp in batch mode. This reduces the TCP socket connections to one, which as has resulted in a significant speedup of both the backup and restore.
When a device was added using SNMPv2, and later discovered using SNMPv3, the duplicate entry was not being removed because SNMPv2 uses mac table checks, and SNMPv3 uses engine id. This was done due to a bug in IXIA devices which don't return valid MAC address entries. The algorithm was changed so devices with no valid MAC addresses are not checked for duplicate devices. SNMPv3 now also uses the mac address finger print checks too.
wait 5m * * ping4 PING.icmpState = email NetEng * * ping4 PING.icmpState = stop * * * * = email NetEng
Several customers have experienced database deadlocks issues when discovering SNMPv3 devices. The AKIPS database comprises of multiple underlying databases (e.g. time-series, events, configuration, etc). One of these databases is the SNMPv3 LUK (Localised User Key) database which holds things like the SNMPv3 SHA/MD5 generated LUK, Engine ID, Engine Boot time, Engine Boot Count, etc. Refer to the Deploying SNMPv3 blog for a detailed description of SNMPv3.
The deadlock issues were able to be reliably reproduced in our lab network by running the discover against a large number of SNMPv3 devices. Several race conditions in the underlying database locking have been rectified.
Fixed handling of large jumps in time. Time is always meant to go forwards at a constant rate! The time-series database stores all data in 30 day blocks. If a server is shutdown and rebooted with a wildly different time (e.g. out by several years), the time-series database processing immediately starts pruning data blocks which are not within the storage time period.
Additional time sanity checks have been added so if the software detects that time has jumped, either at boot time or while it is running, then the software drops into read only mode and an error message is displayed in the GUI to contact AKIPS tech support.
In version 16.2, the Postfix package was updated as part of the operating system upgrade. It appears the FreeBSD ports team made an unannounced change to the Postfix package. In their wisdom, they have suddenly decided not to update /etc/mail/mailer.conf to point to Postfix instead of Sendmail.
Due to this change, AKIPS servers running 16.2 will attempt to send mail using the default Sendmail program instead of Postfix. All of the mail settings in Admin->System->System Settings are reliant on Postfix, and will not work with Sendmail. For sites which are reliant on these settings (e.g. "Email Domain"), mail may not be delivered.
For example the following rule means the average multicast packets exceeds
1000 per minute over the last 5 minutes.
last5m avg above 1000 * * * IF-MIB.ifHCInMulticastPkts = email foo@bar.com
Whereas the following rule means the total multicast packets exceeds 5000
for the last 5 minutes.
last5m total above 5000 * * * IF-MIB.ifHCInMulticastPkts = email foo@bar.com
Previously the ping sweep option would use the ping range rules from the discover. Those range rules would typically be used to find network devices, not edge devices (e.g. PCs, Printers, etc). The new ping range rules in the Switch Port Mapper Settings allows you to ping sweep the the edge devices, therefore priming the IP ARP tables of routers and the bridge forwarding tables of switches.
/-ro/
/!-ro/
Syslog and SNMP Trap Reporter speedup. Some customers are constantly sending up to 500 syslog messages per second to their AKIPS server. The syslog collector and forwarder (ie. fanout) easly handled that amount of workload. In comparison, the reporting tool in prior versions was a bit sluggish.
The syslog and trap messages are stored in 10 megabyte lumps, which are compressed and appended to daily data files. Due to a design error, the nm-msg-reporter tool was retrieving and uncompressing significantly more data than it actually needed to. Testing has shown an approximate 4 times speedup in reporting performance.
HPET clock issues on various older platforms.
It would appear that some platforms have a broken HPET (High Precision Event Timer). We have observed a breakage with the HPET timer in this simple bit of pseudo code in our test lab:
while (....) { gettimeofday (....); usleep (10000); .... }The usleep(10000) is only suppose to be 10ms long, but very intermittently it gets stuck in there for 128 seconds, which is fairly fatal for a real time monitoring system.
Several customers have reported false device outages where every device is reported down at the same time. The cause appears to be due to a very intermittent HPET failure. Apparently the HPET design is prone to intermittent race conditions. Refer to page 9 of this VMware document: Time Keeping in a Virtual Machine
The following work arounds have been applied:
Performance speedup to the Event and Device dashboards when processing large volumes of Syslog and SNMP Trap data.
Some customers are constantly pushing 300+ syslog/trap packets per second to their AKIPS server. This data is stored in 10MB compressed lumps. Each time the Event or Device dashboard refreshed, it had to uncompress and filter large amounts of raw data to generate the syslog/trap graphs. For a last 30 minute report, that would be in excess of 500,000 syslog/trap messages to uncompress/process.
A new message database has been added that stores the number of syslog/trap messages received each minute for each IP address. The Event and Device dashboards have switched to use this new database to generate the syslog/trap graphs.
rename device router1 Router1
" did not work.
nm-discover-device {ip address or range} [version 2 community {community}] nm-discover-device {ip address or range} [version 3 user {snmp user} md5|sha {authpasswd} des|3des|aes128|aes192|aes256 {cryptpasswd} ]
In AKIPS we have the concept that all MIB objects are polled at a single rate of 60 seconds. This significantly simplified the design of the poller and time-series database. Two techniques are used to significantly reduce the overall SNMP traffic:
The adaptive polling rate algorithm detects either:
When either of these conditions occur, the poller backs off to 120 seconds, and then to 180 seconds. If the gauge or counter changes, then the polling rate is immediately ramped back up to 60 seconds. This algorithm is applied to every monitored gauge and counter. Every 60 seconds, the Interface State Tuning checks the ifOperStatus of every interface. It turns polling on/off for all counters/gauges for each interface depending on its current state.
A problem was discovered where the Interface State Tuning was disrupting the Adaptive Polling algorithm. This disruption would randomly cause objects which where being polled at 180 seconds to have their responses ignored. For example, interfaces which run zero discards for long periods and then have a burst of discards in a single minute.A problem with Cisco Nexus switches was reported where utilisation graphs would display large positive and negative spikes. Additional validation checks and logging has been added to the poller to detect when devices return unexpected responses. Some devices return MIB objects that were not requested, or in a different order than they were requested in.
Changed the behaviour of the discover for detecting duplicate devices. The discover previously pruned invalid MAC addresses, then checked if two devices contained the same MAC, and discarded duplicates. The problem with this is many programmers at <insert your network vendor here> do silly things like use the same MAC -
NOTE: The discover will print out a list of duplicate MAC addresses found on your network. Please report these to AKIPS tech support. We still need to prune these MACs from the switch port mapper data.
skipping rule a.b.c.d/mask because its calculated runtime of XXXX seconds will exceed the limit of X secondsThe problem appeared when using a mask that is not a multiple of 8. /24 works fine, but /25, /26, and /27 may sometimes fail with a bogus message.
For example, in the Tools -> Command Console, try the following:
tf dump "last1w; mon to fri 8:00 to 17:00; sat 8:00 to 12:00;"
span 2014-11-02 00:00:00 to 2014-11-08 23:59:59 include 2014-11-03 08:00:00 to 2014-11-03 17:00:00 include 2014-11-04 08:00:00 to 2014-11-04 17:00:00 include 2014-11-05 08:00:00 to 2014-11-05 17:00:00 include 2014-11-06 08:00:00 to 2014-11-06 17:00:00 include 2014-11-07 08:00:00 to 2014-11-07 17:00:00 include 2014-11-08 08:00:00 to 2014-11-08 12:00:00
tf dump "last1w; not mon to fri 8:00 to 17:00; not sat 8:00 to 12:00;"
span 2014-11-02 00:00:00 to 2014-11-08 23:59:59 include 2014-11-02 00:00:00 to 2014-11-03 08:00:00 include 2014-11-03 17:00:00 to 2014-11-04 08:00:00 include 2014-11-04 17:00:00 to 2014-11-05 08:00:00 include 2014-11-05 17:00:00 to 2014-11-06 08:00:00 include 2014-11-06 17:00:00 to 2014-11-07 08:00:00 include 2014-11-07 17:00:00 to 2014-11-08 08:00:00 include 2014-11-08 12:00:00 to 2014-11-08 23:59:59
Added functionality to the discover so it creates two interface virtual
objects of IF-MIB.ifInSpeed and IF-MIB.ifOutSpeed. The values of these
objects are updated using the ADSL line rate speeds. The percent
utilisation values in the interface reports are calculated using these
two new virtual objects for ADSL interfaces only. You will need to select
the 'adsl' ifType in the discover configuration and perform a rewalk for
ADSL interfaces to be added and their speeds set correctly.
NOTE: The ADSL discovery has only been tested on Cisco devices.
Wireless Controller (ESSID)
Wireless Access Point
Wireless AP Radio
NOTE: To automatically group all Aruba Access Points,
add the following rules to your Auto Grouping
add device group Aruba-AP
assign * * /^ap/ WLSX-WLAN-MIB.wlanAPName = Aruba-AP
include SNMPv2-MIB.sysObjectID PowerNet-MIB
add device group APC
assign * * sys SNMPv2-MIB.sysObjectID value /PowerNet/ = APC
include SNMPv2-MIB.sysDescr NetScaler
add device group Citrix
assign * * sys SNMPv2-MIB.sysDescr value /NetScaler/ = Citrix
‘We spun up a VM and installed AKIPS in 4¼ minutes. We put in some IP ranges and discovered our whole network in less than 15 minutes. It gave us better visibility of our network than we'd ever had before.’
Please direct all product related technical questions and issues to support@akips.com
NOTE: AKIPS does not provide consulting services on how to address network issues that our software has highlighted.
AKIPS is the sole creator and distributor of AKIPS Network Monitor.
Specifying a VM or bare metal platform is difficult because every network is different (i.e. number of users, devices, polled MIB objects, syslog/trap/NetFlow rates). AKIPS recommends starting with a VM installation to determine a resource baseline required for monitoring your infrastructure and then increase the CPU/RAM/Storage resources as needed.
As a general rule, we recommend:
NOTE: Before purchasing physical hardware, contact AKIPS support with your intended vendor/model/spec so we can confirm the operating system has the appropriate disk and Ethernet controller driver support.
AKIPS is known to work on the following virtual machine platforms:
Network size | Minimum platform |
---|---|
Small 50,000 interfaces |
|
Medium 100,000 interfaces |
|
Large 250,000 interfaces |
|