LTM Route Advertisement

Advertising Healthy F5 LTM VIPs using BGP

Reading time: 8 minutes

What’s the point?

I have come across some environments where combining the capabilities of dynamic routing and the F5 LTM monitoring capabilities provide a worthwhile benefit. Of course, in many (perhaps most) cases, an LTM advertising only healthy and reachable resources via dynamic routing may not be needed since it may be the only place the resources exist. In this case, if the one set of resources goes down, there’s no where else for the traffic to go so there’s no point in taking advantage of this.

There are a couple of cases where this would be a useful tool. Without going too deeply into it in this post, advertising the same IP addressing from two different LTMs in different locations to mimic an “anycast” IP concept on a enterprise WAN benefits from this. If both LTMs are advertising the same resources (with the same IPs or different IPs), it’s beneficial to have the network automatically adjust should resources become unavailable. Otherwise, network traffic may never adjust to the failure and continue to try and reach unreachable resources.

The anycast concept I will not spend more time going into, but is an interesting one. The primary topic of this post is how to actually configure the F5 BIG-IP to perform dynamic routing based on virtual server health. This post will focus in on BGP, but the same steps could be followed for another BIG-IP supported routing protocol by supplanting the appropriate routing configuration.

The configuration

Enable BGP on the routing domain

The first thing we will need to do is enable BGP on the appropriate routing domain on the F5 LTM. This is done in the LTM GUI. On the Main tab, click Network > Route Domains. This will bring you to the configured route domain (which may be only the default of 0). Under Dynamic Routing Protocols you will see a list of Available and Enabled protocols. Move the desired routing protocol from Available to Enabled and click Update. In this example, we’re using BGP. This will enable the overall routing daemon on the LTM, as well as the individual BGP routing protocol daemon.

Note: If you have Port Lockdown configured on the Self IP you will be using to develop neighbor relationships, you will need to allow the needed port (e.g. TCP/179 for BGP).

Verify the routing daemon is now running and BGP is enabled

In order to perform these steps, you will need to access the TMSH CLI of your LTM.

The following output confirms that the BGP protocol was enabled for the route domain.

Note: If using a non-default routing domain, replace the 0 with “-r [ID]” in the following command.

root@(bigip)(cfg-sync Standalone)(Active)(/Common)(tmos)# list /net route-domain 0 
net route-domain 0 {
    id 0
    routing-protocol {
        BGP
    }
    vlans {
        internal
        VL191-INSIDE
        VL193-OUTSIDE
        http-tunnel
        socks-tunnel
    }
}

Let’s now confirm that the route daemon is now running.

root@(bigip)(cfg-sync Standalone)(Active)(/Common)(tmos)# show /sys service tmrouted
tmrouted     run (pid 7309) 33 days

Configure BGP on the LTM

Now the LTM is running the two core routing processes: imi and nsm. It has also started a process for BGP. The IMI shell is the integrated management interface for the routing processes. We will use this to configure the LTM’s BGP process. This shell is remarkably similar to Cisco IOS; however, it is stripped down a bit in capability which is not surprising.

In order to access the shell, run the following command.

run /util imish

Note: add “-r [ID]” if using a non-default routing domain.

Once in the shell, we will use IOS style syntax to configure BGP. This example will keep things quite simple.

bigip[0]>en
bigip[0]#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
bigip[0](config)#router bgp 65001
bigip[0](config-router)#neighbor 10.202.191.1 remote-as 65001
bigip[0](config-router)#end

Configure BGP on the upstream router

We will also need to ensure that BGP is configured on the upstream router. In this case, I already had it configured.

RTR#show run | s bgp
router bgp 65001
 bgp router-id 10.202.191.1
 neighbor 10.202.191.22 remote-as 65001

Verify BGP neighbors

We can now check to see if the BGP neighbor adjacency was established. Let’s check the router first:

RTR(config-router)#do show ip bgp summary
BGP router identifier 10.202.191.1, local AS number 65001
BGP table version is 1, main routing table version 1

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.202.191.22   4        65001       2       2        1    0    0 00:00:04        0

Looks good. Let’s check the LTM.

bigip[0]#show ip bgp summary
BGP router identifier 10.202.193.22, local AS number 65001
BGP table version is 1
0 BGP AS-PATH entries
0 BGP community entries

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.202.191.1    4 65001       4       3        1    0    0 00:00:33        0

Total number of neighbors 1

Also looks good! We see the neighbor is up and messages are being exchanged. However, we do see that 0 prefixes have been received. Let’s go deeper.

Configure BGP route advertisement for virtual addresses

If we look at the BGP routing table on the router, we see:

RTR(config-router)#do show ip route bgp 
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area 
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is 10.202.1.33 to network 0.0.0.0

So, no routes are being learned. Our LTM is not advertising any routes. Let’s check out the LTM routing table (in this case, our VIP is 10.202.193.100/32):

bigip[0]>show ip route
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       * - candidate default

Gateway of last resort is 10.202.193.1 to network 0.0.0.0

K*      0.0.0.0/0 via 10.202.193.1, VL193-OUTSIDE
C       10.0.0.0/29 is directly connected, internal
K       10.202.30.0/24 via 10.202.191.1, VL191-INSIDE
C       10.202.191.0/24 is directly connected, VL191-INSIDE
C       10.202.193.0/24 is directly connected, VL193-OUTSIDE
C       127.0.0.1/32 is directly connected, lo
C       127.1.1.0/24 is directly connected, tmm0

We don’t see the actual VIP advertisement; however, we do see the connected routes as a result of the configured VLAN interface on the LTM.

This is because we haven’t configured any routes to be advertised. F5 injects routes for virtual servers into the tmrouted as “kernel” routes, indicated by the K flag. However, this only happens after configuring virtual IP addresses with “Route Advertisement”. Go to Local Traffic > Virtual Servers > Virtual Addresses and select the virtual addresses that you want to advertise.

Note: virtual addresses are automatically created when creating a virtual server object. Multiple virtual servers using the same IP address (but different ports) only result in one virtual address. It also auto deletes itself if you remove the relevant virtual servers (by default).

LTM Route Advertisement
Enabling/Disabling Route Advertisement on a Virtual Address

After selecting the virtual address, check the Route Advertisement box. You can see the check box in the last field of the screenshot below. You can also see some other configurations relevant to route advertisement such as whether to advertise the route when any virtual server associated with this virtual address is available, or only when all of them are available. This is obviously dependent on your specific scenario.

After that is enabled, you can now see virtual server addresses in the IMI shell routing table.

bigip[0]>show ip route
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       * - candidate default

Gateway of last resort is 10.202.193.1 to network 0.0.0.0

K*      0.0.0.0/0 via 10.202.193.1, VL193-OUTSIDE
C       10.0.0.0/29 is directly connected, internal
K       10.202.30.0/24 via 10.202.191.1, VL191-INSIDE
C       10.202.191.0/24 is directly connected, VL191-INSIDE
C       10.202.193.0/24 is directly connected, VL193-OUTSIDE
K       10.202.193.100/32 is directly connected, tmm0
C       127.0.0.1/32 is directly connected, lo
C       127.1.1.0/24 is directly connected, tmm0

Great! We now see it in our LTM routing table, and we can see it is injected as a “kernel” route. Unfortunately, we will not yet see it in our router’s routing table.

Note: This feature is also called Route Health Injection (RHI).

Configure kernel route redistribution

Now that the routing table shows our virtual IP addresses as kernel routes, we need to redistribute them into BGP. On the IMI sh of TMSH, configure the following:

bigip[0](config)#router bgp 65001
bigip[0](config-router)#redistribute kernel

Verify virtual server route on router

Let’s verify that we see it in our router’s routing table now.

SDEL-SW-CORE#show ip route bgp
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area 
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is 10.202.1.33 to network 0.0.0.0

      10.0.0.0/8 is variably subnetted, 46 subnets, 7 masks
B        10.202.193.100/32 [200/0] via 10.202.191.22, 00:03:50

Excellent! We are now successfully advertising our virtual server via BGP. Now let’s see what happens if the VIP becomes unavailable.

Verify route availability if virtual server fails

Note: For testing purposes, this must be simulated by force offline-ing the relevant pool members. Disabling the virtual server itself or the pool members does not trigger the Virtual Server to actually “fail” and pull the route advertisement.

“Force offline” the pool members of your testing virtual server. You could also actually fail the server itself and let the monitor check fail to test this more realistically.

LTM Nodes Forced Offline
Pool Members Forced Offline (Black Diamond)

Verify if the virtual server is now failed. Our testing VS is slitaz_http_pool, forgive the counter-intuitive naming. 🙂

LTM Virtual Server Offline
Virtual Server is Failed

Now let’s check that the route is no longer present on our router as a result of the failed virtual server.

RTR#show ip route bgp
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area 
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is 10.202.1.33 to network 0.0.0.0

RTR#

Note that no BGP routes are now present.

This, by default, will take up to 10 seconds to be reflected in the tmrouted. This can be viewed in TMSH as the query period. Note that the value only displays if it has been configured away from default.

list /sys db tmrouted.queryperiod

In order to modify this timer to have faster convergence to a failed virtual server, perform the following:

modify /sys db tmrouted.queryperiod value [period_in_seconds]

You can also delay the withdrawal of routes as a result of a virtual server status change using (to prevent flapping from effecting route advertisement):

modify /sys db tmrouted.rhifailoverdelay value [delay_in_seconds]

Obviously, the BGP config used was quite rudimentary as an example, but route-maps and other functions can be configured, as well.

References

This article was used as reference. I recommend reading through it for caveats specific to your situation, as well as confirming your BIG-IP software version has no specific considerations.

https://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/tmos-ip-routing-administration-11-2-0/4.html

EAP-TLS with a Broken Private Key

When going through a standard EAP-TLS deployment recently, a seemingly new problem reared its head. We had tested it out and validated EAP-TLS on a couple of laptops using the Microsoft Native Supplicant. All of the ISE policies were good to go, and everything was working as expected.

Then came along a new laptop. This one had other plans. We plug it in and get to authenticatin’.

Troubleshooting

Well, it didn’t seem to want to send many 802.1X requests, and the port kept falling back to MAB authentication. We validated the client’s native supplicant configuration to ensure that the GPO got everything correct. All seemed to be in good order. We couldn’t modify any of the administrator-level settings at this point, so we decided to move on instead of play with some of the settings.

We disabled MAB to see the problem more clearly. Most of the requests were seen on ISE using the host/anonymous identity. This clued us in a bit to the issue since that identity is the default unprotected identity string for EAP-TLS. Something was happening prior to gleaning the actual identity from the machine’s certificate.

We check the machine’s certificate store and see the lonely certificate. It’s purposed for machine authentication as expected. We then check the trust of the ISE server certificate. That didn’t seem to be the issue given how it was presenting; however, never hurts to check. The chain was present and certificate deemed valid.

Checking out the ISE live authentications, I start to see a pattern of failures between the “host/anonymous” attempts. Actual certificate identities were coming through very sparingly, though they were still failing. The error was descriptive. Unfortunately, that doesn’t necessarily mean helpful.

Supplicant stopped responding to ISE after sending it the first EAP-TLS message

Okay, so that means something is happening after the client sends its initial EAP-TLS message. That’s interesting. What would make that occur? The client had a good certificate, server certificate was valid, and supplicant settings were solid. Nothing seemed out of order. On to the next level of troubleshooting. Debugs.

Luckily, there was only one port being authenticated on this test switch for the time being. This made debugging a bit more digestible given the chattiness of the RADIUS, EAP, and AAA debugs. I started with the RADIUS debugs. We knew RADIUS requests were getting to ISE, so this meant that at least some level of EAP communication was happening between the client and the switch. We could see a RADIUS request being sent to ISE for the client, and ISE responding. Then we saw retransmissions of the same response from ISE.

Moving on to enabling EAP debugs, the above ISE error was only confirmed. We saw one inbound EAP frame and then a response after ISE returned the RADIUS response. The client never responded.We did see one uncommon (read as: I’d never seen it before) EAP debug message.

EAP-TLS-ERROR: Invalid Directive (7)

Some google-fu didn’t exactly tell me what this meant, so I couldn’t rule it out as a clue, but it didn’t point me to anywhere useful, either. Perhaps ISE was responding with something that the particular client didn’t like? Cue Wireshark. Comparing the troubled workstation to a working one revealed the same EAP traffic up until the point the troubled workstation stopped responding. The contents of the EAP frames were identical with the exception of the retransmissions once the client stopped working. Theory defeated.

Okay, so we knew that the supplicant received an EAP response from ISE and was not sending its own reply. Let’s check out the Wired AutoConfig log in the Event Viewer. Nothing out of the ordinary. We see the communication and failures, but nothing out pointing at a problem. Perhaps something in the certificate processing was having an issue. Maybe this client had some crypto algorithm differences and wasn’t able to process the server’s certificate?

Headed over to the CAPI2 log in the Windows Event Viewer would allow us to check out the certificate operations on the machine and maybe identify something. Unfortunately, this log is not on by default and we weren’t able to turn it on with our user permissions. Perhaps this is the universe telling us this may be where the problem lies. Perhaps not.

Fiddling for a while and getting no more enlightened, we decide to install the AnyConnect NAM. It often has a way of doing things the native supplicant fails to do. Not a very good solution to the widespread deployment, but, hopefully, it would help us find the issue. We haggle a system’s administrator for her credentials and get things going. NAM installed, basic EAP-TLS profile configured. No dice. Still failing. Still seeing some of the ISE error stating EAP-TLS stopped after the first packet. We hadn’t installed the DART bundle (for whatever reason) and the system admin had already scurried away.

Back to the event viewer, it is. Cisco AnyConnect has some logs it adds to the event viewer. That’s where we started. We see some error messages and jump right to those. They all refer to some sort of “Internal Error 4” or “Internal Error 204” when calling a function called “acCertLoadPrivateKey()”

acCertLoadPrivateKey(), Internal Error 4 & 204

What dreaded words — Internal Error. This didn’t really tell us what exactly was happening. But, something was apparently going wrong loading the certificate’s private key. Could we have finally found the issue?!

Resolution

We decided to delete the machine’s certificate and re-issue it to test this out. After rustling back the system administrator, we were able to get it removed and a gpupdate / reboot completed. New certificate was enrolled by the client. The client was now successfully authenticating!

The failure was terribly non-graceful to say the least. It didn’t give any idea what was actually going on. You would think that Windows handling a certificate with a private key that is somehow inaccessible — corrupt, missing, who knows — would throw some sort of error worth investigating. Or, at least, give you a red “X” over the certificate in the store.

It turned out that this was a widespread problem on their existing 802.1X deployment with Cisco ACS that we were migrating from. As they were upgrading Windows clients from Windows 7 to Windows 10, a certain model of laptop starting failing authentication. It seems something in the upgrade process was causing the private key to become broken in some fashion.

So, ultimate fix was to re-enroll all the client authentication certificates via GPO.

Hope this helps some other poor soul out there one day.

Setting up an iPerf3 Server on a Raspberry Pi

Reading time: 3 minutes

I was recently battling for server time when doing to internet-based performance testing against one of the publicly listed iperf3 servers. Unfortunately, iperf3 only supports one test at a time. This makes sense in order to provide full resources to an individual test for reliable results. In any case, I didn’t want to wait when I’m smack in the middle of some heart throbbing troubleshooting. So, I decided to set my own up from my home and let internet traffic in to use it.

Updating the Raspberry Pi

First things first, I had barely touched my RPi in a while, so I needed to update Raspbian to the most recent distro — Pixel. This was easy enough, though time consuming. The RPi package servers apparently don’t download all that quickly. It took around a couple of hours all in all. This is a straightforward guide for it. But, it essentially comes down to:

sudo apt-get update
sudo apt-get dist-upgrade

The article also suggests these for some beautification and common use packages:

sudo apt-get install -y rpi-chromium-mods
sudo apt-get install -y python-sense-emu python3-sense-emu
sudo apt-get install -y python-sense-emu-doc realvnc-vnc-viewer

Install iperf3

Next, let’s get to iperf3. This is pretty simple to install, as most linux packages are.

sudo apt-get update
sudo apt-get install iperf3 -y

Now, let’s set it to boot at system start:

sudo nano /etc/rc.local

Add this line somewhere before the exit 0.

/usr/bin/iperf3 -s &

Since this file could have been customized on your RPi already, it could look like anything so you should add it where appropriate. Mine was still default so pretty much anywhere would do except for after the exit is called. If you need to add any additional flags to your server instance, do it here.

reboot

ps -A | grep iperf
561 ? 00:00:00 iperf3

Network Changes

If you are running behind a firewall, or ISP-provided modem. You will have to setup some NATs and security policy to address the needed iperf flows. In my case, it was inbound TCP/5201 and UDP/5201; however, you can customize these ports if needed. It would require some additional flags to be specified on the server and client to change the default port.

The only thing that may be non-standard is UDP flows coming out from the iperf server after a client initiates a UDP test. The server seems to send a separate UDP flow on some ephemeral ports. This seems similar to the behavior of active mode FTP, but the firewall doesn’t inspect and compensate for it. So, you may need to rules to allow the outbound UDP flows separately from the rest. For what it’s worth, I’m using a Palo Alto Networks firewall and it doesn’t seem to match the iperf App-ID on the initial outbound UDP flow, but it does for everything else. So, consider that if you’re having issues with your App-based rule approach.

Test

Of course, final step is to test the thing out. So, get iperf3 on a client. This can be Windows, macOS, Linux, etc. Distros are available here for all OS types.

After that, you’ll want to run the following command to test it out using TCP. This example is on Windows.

iperf3.exe -c [your-server]
Connecting to host [your-server], port 5201
[  4] local 10.202.192.100 port 61586 connected to [your-server-ip] port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  8.00 MBytes  67.1 Mbits/sec
[  4]   1.00-2.00   sec  8.00 MBytes  67.1 Mbits/sec
[  4]   2.00-3.00   sec  8.62 MBytes  72.4 Mbits/sec
[  4]   3.00-4.00   sec  9.12 MBytes  76.5 Mbits/sec
[  4]   4.00-5.00   sec  9.25 MBytes  77.3 Mbits/sec
[  4]   5.00-6.00   sec  9.50 MBytes  79.8 Mbits/sec
[  4]   6.00-7.00   sec  9.38 MBytes  78.5 Mbits/sec
[  4]   7.00-8.01   sec  9.50 MBytes  79.5 Mbits/sec
[  4]   8.01-9.00   sec  9.38 MBytes  79.0 Mbits/sec
[  4]   9.00-10.00  sec  9.50 MBytes  79.5 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  90.2 MBytes  75.7 Mbits/sec                  sender
[  4]   0.00-10.00  sec  90.1 MBytes  75.6 Mbits/sec                  receiver
 
iperf Done.

Add a flag for UDP, and another to set target bandwidth over 1Mbps which is default in iperf3’s setup.

iperf3.exe -c [your-server] -u -b 500000000 -f m
Connecting to host [your-server], port 5201
[  4] local 10.202.192.100 port 65527 connected to [your-server-ip] port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  55.4 MBytes   464 Mbits/sec  7095
[  4]   1.00-2.00   sec  59.9 MBytes   503 Mbits/sec  7670
[  4]   2.00-3.00   sec  59.7 MBytes   500 Mbits/sec  7636
[  4]   3.00-4.00   sec  59.1 MBytes   495 Mbits/sec  7567
[  4]   4.00-5.00   sec  58.7 MBytes   495 Mbits/sec  7510
[  4]   5.00-6.00   sec  60.7 MBytes   509 Mbits/sec  7774
[  4]   6.00-7.00   sec  60.3 MBytes   506 Mbits/sec  7717
[  4]   7.00-8.00   sec  58.1 MBytes   487 Mbits/sec  7436
[  4]   8.00-9.00   sec  62.9 MBytes   527 Mbits/sec  8048
[  4]   9.00-10.00  sec  57.0 MBytes   478 Mbits/sec  7291
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datag
rams
[  4]   0.00-10.00  sec   592 MBytes   496 Mbits/sec  2.294 ms  75210/75740 (99%)
[  4] Sent 75740 datagrams
 
iperf Done.

Customization

I recommend checking out the iperf documentation for customization options. Additionally, it shouldn’t be too hard to cater this any other Linux distro. Just tweak the package installation and startup config as needed per your distro flavor.

Issues?

Check out open issues on the Github project or do some google-fu if you run into any problems. Also, make sure your network/security is configured to supported the needed flows.

I ran into one problem with UDP testing that came down to having multiple NICs available on the pi. In my case, it was the eth0 and wlan0 interfaces. I originally had requests coming into the wlan0 interface (whoops) and the outgoing UDP flow would go out of eth0 interface. This seemed to break the test in its own right, but would also have implications on firewall policy and NAT configurations if you’re behind a firewall.

I flipped my setup to direct requests to the eth0 IP address, and the issue didn’t present itself again. So, there must be some built in NIC preferences that were messing with things. I did add some info to an existing bug addressing this on the iperf github project, so hopefully it will get resolved at something.

Allowing Mobile App Stores through a WLC Redirect ACL

Cisco ISE provides the ability to redirect users through an MDM workflow to assist in the on-boarding of mobile devices. Using integration of MDMs like MobileIron or AirWatch, you can allow registered and compliant devices onto your network, while automatically facilitating MDM enrollment for other devices. While the authorization policy for these workflows is relatively straightforward, the specification of traffic flows for redirection to the MDM portal can be somewhat challenging.

This challenge is primarily due to the need to access the Apple App Store or the Google Play Store to download a required MDM applications during on-boarding. While we may be okay with allowing people out to the internet, we still need to make sure we are capturing and redirecting web requests to the MDM enrollment portal. This is ultimately very similar to central web authentication redirect, but with more access requirements. If users can’t download the needed app(s) during the on-boarding process, they will likely not be able to get fully on-boarding.

As this pretty much always takes place on wireless, the redirect ACL is limited to the feature sets of the Cisco Wireless LAN Controller platform. If we were stuck with IP-based filtering, it would be a full time job to hunt down all possible IP addresses. To make matters more difficult, we can’t just restrict internet usage to certain ports because the app stores rely on TCP 80 and 443, as well as other ports, for access to their servers. Enter DNS-based Access Control Lists on the Cisco WLC.

DNS-based ACLs allow us to specify URLs in our standard access control lists as additional permit statements (aka whitelisted URLs) in addition to standard IP-based filtering. Admittedly, the feature took a bit to figure out. Initially, I was testing by applying an ACL to the WLAN using the Advanced tab settings and testing access which proved to be pretty much useless. Success finally became an option when applying it specifically as a redirect ACL via ISE authorization permissions. At that point, anything that was specified as a permit on the IP-based filter or a URL on the list would bypass the redirection.

This is a good point to differentiate between a DNS-based ACL and a URL ACL (which seems to be available beginning with AireOS 8.3):

  • A URL ACL is a list of URLs that can be configured to act as a Whitelist or a Blacklist, and then be applied to an interface, WLAN, or Local Policy configuration. More about this feature can be found in this document. This feature seems to mostly be for enforcement of permitting or denying access to a list of URLs. This is implemented in the Security Tab > Access Control Lists > URL ACLs.
  • DNS-based ACL is an ACL that also uses DNS lookups to permit traffic to dynamic IP addresses based on FQDN. This is implemented in the Security Tab > Access Control Lists > Access Control Lists using the Add-Remove URL setting in the blue drop down arrow as seen below:

The DNS-based ACL feature seems to have been implemented specifically for use with the redirect functionality according to the feature information and restrictions in the documentation, which states that it can only be used with the URL Redirect feature.

With DNS-based ACLs, the client when in registration phase is allowed to connect to the configured URLs. The Cisco WLC is configured with the ACL name and that is returned by the AAA server for pre-authentication ACL to be applied. If the ACL name is returned by the AAA server, then the ACL is applied to the client for web-redirection.

At the client authentication phase, the ISE server returns the pre-authentication ACL (url-redirect-acl). The DNS snooping is performed on the AP for each client until the registration is complete and the client is in SUPPLICANT PROVISIONING state. When the ACL configured with the URLs is received on the Cisco WLC, the CAPWAP payload is sent to the AP enabling DNS snooping on the client and the URLs to be snooped.

This DNS-based ACL feature was introduced in AireOS 8.0, and restrictions to exist which should be reviewed for your code version in the administrative guide (link for 8.3 above). For instance, the documentation for 8.3 states that only up to 10 URLs can be configured; however, I found this to be true in 8.0, but not in 8.2+ which allowed me to enter up to 20 URLs. This is good because for the Apple App Store and Google Play to work, we got up to 19 URLs. This may just be a documentation bug, but I would generally keep the URL list as short as possible.

URL ACLs, on the other hand, seem to have been introduced in 8.3. This feature may be able to be used for redirection; however, I did not test with it as I found a solution using DNS-based ACLs instead.

Required ACL Filters

Apple App Store Filters

Continuing with the DNS-based ACL for redirect during MDM on-boarding, I found the following filters allowed full access to the Apple App Store:

  • Permitted IPs:
    • 23.0.0.0/8
    • 65.158.0.0/16
  • Permitted URLs:
    • albert.apple.com
    • gs.apple.com
    • itunes.apple.com
    • ax.itunes.apple.com
    • www.apple.com

The IPs could use a bit further explanation. I started with just the URLs and had partial success (apple.com would load with stripped styling, and the app store still didn’t work). So, I did some digging into the DNS of apple.com to find that it was served by a Akamai Edge CDN. I tried being a bit more restrictive, but ended up using the /8 that all of the IPs belonged to: 23.0.0.0/8. I did identify that this allowed some other sites that are served by the Akamai CDN. I noticed www.cisco.com and usaa.com we also permitted. I will look to do more restrictive ranges, if possible, in the future. But, for now, this suits my requirements as the WLAN itself is still authenticated. It is restrictive enough to prevent users from using the network for bandwidth intensive purposes (music, video, etc.). The 65.158.0.0/16 was added because I saw requests being dropped in the network while the App Store was suffering from crippled functionality. I attempted to find an FQDN that could be permitted, but the reverse lookups gave nothing and I was stuck with this range for now.

Google Play Store Filters

The following filters were needed to provide access to the Google Play Store:

  • Permitted URLs:
    • android.clients.google.com
    • google.com
    • ggpht.com
    • play.googleapis.com
    • gvt1.com
    • www.googleapis.com
    • accounts.youtube.com
    • gstatic.com
    • .googleapis.com
    • .appspot.com
    • gggpht.com
    • android.pool.ntp.org
    • market.android.com
    • .google.co

No additional IPs were required.

Some of these may not be required, but they all got the job done for my testing. And, with all of the different flavors of Android out there, I figured – the more permissive, the better.

A few considerations

With the above filters in place for the redirect ACL, I did some testing on an iPhone 7 Plus running IOS 11.1 Developer Beta 2 and a Samsung Galaxy S7 running Android 7.0. During my testing I found the following:

  • Apps that were permitted:
    • iMessage
    • Google Maps
    • Google Hangouts
    • Gmail (Partially)
      • Native Mail App to Gmail Account successfully sent mail
      • Google Inbox on iOS could not send mail.
    • Google Drive file access
  • Apps that were not permitted:
    • Youtube (a video list was retrieved, but no ads, thumbnails, or actual videos were shown or played).
    • Google Duo
    • Apple FaceTime

Another consideration is that the documentation states that this functionality does not

Last but not least

Of course, beyond the app store filters, you will need to permit access to your MDM resources, DHCP, DNS, and other required services to facilitate device on-boarding. Below is my final Redirect ACL where 10.202.192.0/24 is my servers submit that can represent access to my ISE servers, MDM servers, etc.

This post also does not cover the configuration of MDM integration in ISE, as documentation exists (and is referenced earlier in this post). I also don’t have an MDM instance in my lab to go through the full configuration outside of a production test environment. The above ACL would be applied as the Redirect ACL in a CWA MDM authorization profile.

Further Updates

When I inevitably continue to fiddle with and further refine this, I will post updates to this post and announce via Twitter (@somewolfe).

Oh, you wanted to save that NAT?

I recently stumbled across an “undocumented feature” on the Cisco Firepower Threat Defense managed by Firepower Management Center (FMC) that caused quite the frustration. When entering certain parts of the FMC, the “Save” and “Cancel” buttons won’t show up in the top right corner. The downside of this, of course, is that I can’t save whatever I was working on.

The most consistent occurrence of this was the NAT Policy. When managing the NAT Policy, about 9/10 times, or probably more, the “Save” option just wouldn’t show up. If you inspect the HTML, you can find it; however, I couldn’t forcibly get it to show up. The only resolution to this, as far as I could find, is to leave the page and delete you browser’s cached files (sometimes multiple times). Afterwards, go back into the policy and ensure that the buttons are present before proceeding. Hopefully, you can catch this before getting too far into your changes.

You can see the area I’m talking about below enclosed in red:

NAT Policy Buttons

I, personally, saw this in 6.2.1 and 6.2.2. I’ve also found this bug which suggests it was present on same pages of 6.1.0, as well. I’ve also seen some Cisco presentations recently that call this out as something to be aware of. So, they are aware, but it is not yet fixed. Unfortunately, I am not sure when it will be fixed. Hopefully soon!

The Tale of the Eternal Packet

Reading Time: 5 minutes

All of us went through that early on network training and heard about loops and how they can cripple a network. Depending on when you got your start in networking, you were then told that you’re not likely to see those types of problems due to Spanning Tree or other network improvements since back in the day. Well, every once in a while, if you try really hard, you can still come across a good ole loop to give you a run for your money. This wasn’t my first instance, but hopefully it will be my last. So, this is the tale of the eternal packet, or should I say an eternal packet.. May it hopefully help some weary Googler in the future.

The year was 2017 and it was a rainy Tuesday morning working on setting up a new Disaster Recovery datacenter. There had been some intermittent issues on the VPN gateways that didn’t seem consistent at all. We identified that the CPU was running near 100% at all times unless it was shortly after a failover or reload. But, without fail, it would creep itself back up to 100% utilization as we did VPN testing. The VPN firewalls were a pair of ASA 5585 SSP-10s and we were testing with 1-3 users at any given time, so it was obviously quite amazing that it was having this serious of a toll.

------------------show cpu usage ------------------
CPU utilization for 5 seconds = 99%; 1 minute: 99%; 5 minutes: 99%
------------------ show cpu detailed ------------------
Break down of per-core data path versus control point cpu usage:
Core         5 sec              1 min              5 min
Core 0       98.8 (93.6 + 5.2)  98.7 (94.1 + 4.6)  98.7 (94.2 + 4.5)
Core 1       99.8 (99.8 + 0.0)  99.7 (99.7 + 0.0)  99.8 (99.8 + 0.0)
Core 2       98.6 (93.4 + 5.2)  98.7 (94.1 + 4.5)  98.7 (94.2 + 4.4)
Core 3       99.6 (99.6 + 0.0)  99.8 (99.8 + 0.0)  99.8 (99.8 + 0.0)
------------------ show process cpu-usage sorted non-zero ------------------
PC         Thread       5Sec     1Min     5Min   Process
-          -        25.0%    25.0%    25.0%   DATAPATH-1-1822
-          -        25.0%    25.0%    25.0%   DATAPATH-3-1824
-          -        23.4%    23.6%    23.6%   DATAPATH-0-1821
-          -        23.4%    23.6%    23.6%   DATAPATH-2-1823
0x00007f6d80d38359   0x00007f6d60296fc0     2.1%     2.1%     2.1%   CP Processing
0x00007f6d822e400b   0x00007f6d601d7500     0.4%     0.1%     0.0%   ssh
0x00007f6d82e05e2f   0x00007f6d6029b840     0.1%     0.1%     0.1%   bcmCNTR.

As you can see, the DATAPATH-#-#### threads are all taking up nearly 25% of total CPU resources. Given it is a quad core CPU, that means they’re using nearly all CPU cycles in total. There are actually a handful of bugs regarding those threads taking up a of the CPU or causing a crash, so I worked through some of those at first. I downgraded to another code version which seemed to take care of the problem, but given it was just been tested by a single person, it didn’t really test it thoroughly enough. Once this came back to be a problem, it presented the same way — users connecting to VPN would connect slowly, and often times authentications would time out preventing successful logons. This was all a symptom of the hogged CPU.

Upon further inspection, the inside interface was showing an incredible amount of traffic throughput. Keep in mind that at this time, there was only a few of us that were testing on this firewall.

------------------ show traffic ------------------
inside:
 received (in 590906.900 secs):
  194253969761 packets 15095372560634 bytes
  328004 pkts/sec 25546000 bytes/sec
 transmitted (in 590906.900 secs):
  194262351340 packets 15104470809003 bytes
  328004 pkts/sec 25561005 bytes/sec
 1 minute input rate 361141 pkts/sec, 24074337 bytes/sec
 1 minute output rate 361140 pkts/sec, 24074091 bytes/sec
 1 minute drop rate, 0 pkts/sec
 5 minute input rate 355476 pkts/sec, 23687711 bytes/sec
 5 minute output rate 355476 pkts/sec, 23687331 bytes/sec
 5 minute drop rate, 0 pkts/sec

Yep, that’s about 24GB per second over the course of one minute and five minute sampling windows (as well as the 590906 second window – about 6.8 days). That’s a whole lot of data for a firewall that’s not in any real traffic path besides for VPN users which there are a couple of that aren’t using it for any extended period of time. I took a look at the connections that the firewall had established, and there was a ton of traffic go to or from IP addresses that were part of our VPN address pool ranges. (Unfortunately, I don’t have the show connoutput saved). Here’s the kicker, though, they weren’t actually connected anymore. So, here we are looking at a bunch of ghost addresses generated an enormous amount of traffic somehow.

The key to answering this question lies in the design of the network surrounding the firewall having the issues. We have what I would call a “VPN Enclave” in this scenario. We have the requirement for a few firewalls to terminate site to site VPNs and then a pair for remote access VPNs (which is the firewall having CPU issues). Behind these firewalls sit yet another firewall that acts as an aggregator for the VPN firewalls in order to simplify management, as well as because the other firewalls could be managed externally and the aggregate gives a single point of single-owned management. This may be overkill, but let’s not talk about that here, as this is a “the powers that be have spoken” kind of situation.

The problem ultimately was due to a routing loop that occurred when traffic was destined for an IP address of a VPN client that is no longer connected. When a VPN client connects, the assigned IP address gets injected as a static /32 route into the routing table. The following output shows this (as these are not configured static routes):

VPNFW/act/pri# show route | i 10.100
S        10.100.0.10 255.255.255.255 [1/0] via 22.22.22.1, outside
S        10.100.0.11 255.255.255.255 [1/0] via 22.22.22.1, outside

When the clients were connected, the routing would look like this:

The VPN pools were not explicitly specified to route outside via a static route, and when they are connected this isn’t a problem as the longest prefix match (the /32) would be used. Otherwise, the 10.0.0.0/8 inside route would be preferred as it envelopes the default route. With no VPN routes connected, routing would look like this:

Of course, you’re probably thinking, “What’s the big deal? Disconnected clients obviously don’t send traffic..” Well, right you are; however, apparently these clients would send a steady stream of NetBIOS, amongst other things, that would fire off right before disconnecting. So, the RAVPN firewall would route the traffic inside towards datacenter/campus resources. The aggregate firewall would then route the return traffic outside to the RAVPN firewall for VPN subnets, but now the client is disconnected and the RAVPN firewall would look at its routing table and match the inside supernet because the originating client is, despite originally having a route injected for the outside by the VPN process, is no longer connected. This would route the traffic back to the aggregate firewall. It would, of course, route it right back.

Now, this may not normally be a problem because eventually the packet will bounce around the network and the TTL will expire and it won’t have that much of an impact. This case, however, was a great reminder of the fact that firewalls do not decrement the TTL of a packet when passing it through. Given that the devices that were bouncing the packet back and forth were two firewalls connected via an L2 segment, the TTL never decremented and resulted in an eternal packet.

The lesson here — long story short — is to hard code static routes for your VPN pool subnets to go out of the outside interface. This will prevent any inclusive supernets routing out of other interfaces from wreaking havoc like in my situation. The concluding routing would look like this, with a “floating static route” of sorts to catch the VPN traffic when some of that lingering traffic is returning.

You may be using VPN pools that aren’t from your inside network’s ranges, and that would prevent something like this from happening as long as the default outside route catches the traffic. As a rule of thumb, though, always remember to hardcode that static route, as it is generally a good thing to do to prevent unforeseen issues.. especially in this situation.

The Importance of “I Don’t Know”

Reading time: 2 minutes

When interviewing people, one of the biggest things I’m looking for is someone to say the words “I don’t know.” It doesn’t have to be that exact phrase, but it should be something similar, “I’d have to do some more research,” or, “I’m not sure but I’ll look further into it.” It doesn’t necessarily have to be the person being interviewed, it could even be who is actually conducting the interview.

When the interview starts, I will ask the candidate what are they 3-5 things they would call their biggest strengths from a technical perspective. From those, I typically will choose topics that I have a mastery in. I want to be able to dig as deep as possible in these areas. I’ll ask them to discuss a time that they ran into a problem with that technology. Then I’ll let the conversation go back and forth as much as possible until one of us says I don’t know or something similar.

Ideally, I’m looking for the candidate to say I don’t know. If they are able to do so, it shows me that they are comfortable recognizing and acknowledging to others that they don’t know something — particularly in an area that they see as one of their biggest strengths. That there knowledge does have limitations and that they don’t know it all. In the job, and in general, this will allow them to identify limits and gaps in their knowledge and be comfortable with having them. They can then go and bridge the gap or expand the limit using research of practice. In contrast, people may get defensive by making things up or maybe suggesting alternative methods and approaches. Making things up is obviously bad and can cause long term pain should a project move forward with incorrect assumptions. Suggesting alternatives can be fine; however, if they’ve identified their limits or a gap, don’t be afraid to say so and look further into it to make sure their suggestions are considering all the information and are rock solid. Don’t allow your ego to make your life more difficult, or spout factually incorrect information. As a technical consultant, that should be considered a cardinal sin.

Depending on the level of position I’m hiring for, when they say that they don’t know could be good or bad. If I’m looking for someone to fill a senior position and they’re saying “I’m not sure” early on in the conversation, and they consider that one of their strengths, then they may not be the person I’m looking for. However, if I’m looking for a more junior position, they could be the perfect fit. I’ll typically use a few of the strengths that candidate using this technique and measure where those skills lie in proficiency.

To close, I’ll often take a topic that wasn’t listed as a strength, but they may know something about, or that they know nothing about but has parallels to things that they are familiar with. This is to see how exactly they approach talking about and analyzing things that they don’t have mastery over. In technology, a lot of things relate or share concepts of fundamental ideas. I want to watch how they process this new information, and how they approach understanding things on a whim when they don’t have a pre-existing expertise. I also want to see how they act and perform when their confidence may be lower.

Of course, the person could always know more than I do and push me to the limits of knowledge and expertise. In that case, it showcases a level of mastery that would obviously be of use to me and functional in the team. It also has the added benefit of learning something new.

Cisco ISE and Client Certificate Chain with Any Purpose EKU

Reading time: 3 minutes

I recently came across quite an interesting issue during a Cisco ISE implementation using EAP-TLS. This was using EAP-FAST to perform EAP Chaining using the Cisco AnyConnect NAM module; however, the inner method was EAP-TLS and that’s where the problem resided. My authentication was failing due to “unsupported certificate in client certificate chain.”

Ultimately, the problem was that the certificate the client was authenticating with was sending its certificate chain along with its authentication request. In the chain, one of the Certificate Authority certificates contained an Extended Key Usage (EKU) attribute of “Any Purpose”. Note that Enhanced and Extended Key Usage are used interchangeably by most references. The EKU field is sometimes also known as the Enhanced Key Usage or Application Policies field – mostly in Microsoft lingo. Whenever this attribute was present in the client’s certificate chain, the authentication would fail due to “unsupported certificate in client certificate chain.”

The only resolution that I could find is to use a different CA without the “Any Purpose” EKU value. If you have a multiple-tier PKI, you can simply issue another Subordinate without the value specifically for client authentication certificates (assuming the Root is not the one with the problematic EKU). I would expect that this is an uncommon occurrence given I’ve only run into it once over the years. In fact, the default Subordinate CA template on a Microsoft PKI implementation does not contain EKUs, so you would have to go out of your way to include that for a specific purpose.

Use of EKUs in a CA certificate doesn’t even really seem to be valid practice. RFC 5280, which covers x509v3 Digital Certificates, specifies that EKU attributes should only be used on “end entity” certificates.

2.1.12. Extended Key Usage

This extension indicates one or more purposes for which the certified public key may be used, in addition to or in place of the basic purposes indicated in the key usage extension. In general, this extension will appear only in end entity certificates.

End entities would be the systems that certificates are issued to for usage in certificate-enabled functionalities such as SSL, TLS, authentication, digital signature, etc. and are not used for issuing certificates. Certificate Authorities issue end entity certificates, and are therefore not end entities themselves. The EKU field specifies for what purposes those end entity certificates should be used. For example, “Client Authentication” and “Server Authentication” are EKU values which are commonly used for EAP-TLS. All a CA really needs are the standard Key Usage values of Certificate Sign, Digital Signature, Key Encipherment, etc.

The only purpose I can see where “Any Purpose” would make itself into a CA certificate is if whoever stood it up intended to use the issuing certificate as an end entity certificate, as well. For instance, if I am hosting a web site on my CA server to provide my CRL and CA chain via HTTPS, I could add the “Server Authentication” purpose to my issuing certificate; however, that shouldn’t really be the way it’s done. To best protect the integrity of your issuing certificate, it should be only used for that purpose. If you need an SSL certificate for that site, it would be best to issue yourself a client certificate from the issuing CA certificate specifically for that purpose rather than share it for multiple purposes.

To validate this, I created a new three-tier PKI. The Offline Root CA was used to issue an Issuing CA and neither of those certificates had any EKU and used the standard templates in a Microsoft PKI. I then issued a Test Subordinate CA from the Issuing CA that contained the “Any Purpose” EKU. This chain can be seen below:

Test PKI Path

You can see the EKU of the Test Subordinate CA contains the “Any Purpose” EKU:

Test Sub EKU

I attempted an EAP-TLS authentication using the Test Subordinate CA and received a failure due to the unsupported certificate. I then re-issued a certificate from the standard Issuing CA and the authentication was the same. The only difference between the two scenarios was the use of the “Any Purpose” EKU.

There is an ongoing TAC case exploring whether this should be a bug or is expected behavior. It is currently pending the Cisco ISE Business Unit’s input. I will update this post when more information is provided.

Update on 10/11/2017: A bug has been opened for this “documentation” bug –CSCvg10726. The developers provided input to the TAC case that states that this is expected behavior. This suggests there will be no change to this in the future, so keep an eye out for this EKU in CAs being used in client authentications or when building new CAs for client authentications.

Failed Machine Authentication with AnyConnect NAM on Windows 8+

Reading time: 2 minutes

Starting with Windows 8, Microsoft changed a default security setting that only allows third party software to access the machine’s domain password in an encrypted format. This results in a third party supplicant sending an encrypted password string to the domain that then compares it against an unencrypted password string. The authentication then fails with the error “Machine authentication against Active Directory failed due to wrong password.” If you look in the authentication steps, you will see the 24344 RPC Logon request failed - STATUS_WRONG_PASSWORD, ERROR_INVALID_PASSWORD,pclthp10156.domain.com message. This traditionally indicates that the machine account has expired, or the machine password has otherwise fallen out of synchronization with the domain. It could be easily be fixed by rejoining the machine to the domain.

In this case, however, it is due to the fact that ISE sends the provided password string to the domain unchanged, and when the domain compares the encrypted password string to the its copy of the password that is unencrypted, it does not match up. The string you provided does not match up with the string I have? That’s an invalid password! Well, that may technically be true, but it’s also misleading.

The issue is that the AnyConnect Network Access Manager (NAM) is a third party agent, so it’s using that encrypted password format. A bug covering this behavior can be found here. The only workaround for this right now is a registry edit that allows third party agents to access the password in an unencrypted format. The following registry edit will make the change:

  1. Open regedit.exe.
  2. Go to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Lsa.
  3. Add a new DWORD(32-bit) Value with the name LsaAllowReturningUnencryptedSecrets.
  4. Modify the new key to set it to 1.
  5. Click OK.

Unfortunately, the bug is listed as “Fixed” despite it still being a problem. It may not be Cisco’s problem, but what’s the real resolution besides allowing all third party applications from accessing this value in it’s native hash format? This remains unclear. As more and more enterprises move to Window 10, this will likely cause more problems. Another resolution is to use the Windows native supplicant wherever possible, unless your use case explicitly requires the AnyConnect NAM as a supplicant, e.g. EAP Chaining, advanced management of wired and wireless networks, etc. The native supplicant will correctly handle the password without a registry change.

That being said, anyone using the NAM to perform a machine-based authentication in any way — such as a PEAP, or a PEAP inner method for EAP-FAST — you are going to want a plan for this moving forward. At this time you only have two options: native supplicant, or registry change. Hopefully, a behavior change to the NAM will fix this in the future, but with this existing since AnyConnect 3.x and, with 4.5 just having been released, I don’t see it as a near term fix. Though, to be fair, this could very likely be because of the change being something that the NAM can’t work around and Cisco is stuck in this situation.

VMware NSX Distributed Firewall (DFW) Viewer

Reading time: 2 minutes

I’ve spoken to a few people who use VMWare NSX with the Distributed Firewall (DFW). Most of them, myself included, had some gripes about the NSX interface. While it’s web-driven interface is an improvement on many firewall managers, it left something to be desired. At many points in its use, it’s easy to find yourself falling deep down a menu-clicking hole while trying to check the contents of a security group, or modify some object that you’re using in the policy. And, of course, it all takes time to load, navigate, and, ultimately, end up in the right place.

For some reason, this just isn’t as readily accessible as one would hope. In general, the viewing and use of the policy is pretty manageable. It is a bit clunky often times and quite slow. I can certainly see it becoming easy to quickly get to a point where the size of the policy is completely unmanageable — even with the use of sections. Ultimately, I decided that with the NSX API, I could put something together that was more snappy, useable, and solved some specific issues some colleagues were running into. 

I, quite creatively, called it the NSX DFW Viewer. Okay, maybe it’s not all that creative. However, it’s a responsive web application that allows for viewing the DFW policy, and performing filtering using a simple search field. The GitHub repository has some good information provided on it to get you started. 

NSX DFW Viewer Screenshot

Ultimately, however, you just put the repository on a web server running PHP and you’re off to the races. The only thing you need to update is the $nsx_host variable in the ajax/init.php file to reflect your NSX Manager’s IP address. After that, you will be asked for credentials, on a per-session basis, when you load the page. This can be seen below, along with a couple of attempts to provide unauthorized credentials.

As you can see, it simply uses the browser-based authentication prompt using PHP’s $_SERVER['PHP_AUTH_USER'] and $_SERVER['PHP_AUTH_PASS'] variables. It then performs a REST API call using those credentials looking for a HTTP 200 OK before granting access. If it doesn’t get one, access is denied. I had previously simply used a variable in [code]init.php[/code] to store the username/password and then base64 encode it to perform HTTP Basic Auth; however, this was inherently insecure as the credentials were stored locally. Ultimately, HTTPS should be used on the server to secure the HTTP transport used to send the credentials. The REST API calls are performed over HTTPS.

Once you’re in the app, you can see your firewall policy broken down by section (specifically, Layer 3 Section). By default, the sections are collapsed to allow an at-a-glimpse view of longer policies. If you have empty sections, they can be hidden using the available checkbox. Here’s a simple demo:

I had some internal debate regarding the search capabilities. I wanted it to be simply implemented but still effective. I landed on it taking whatever text is entered in the search field and filtering out any section or rule information that does not contain the query. In other words, if I were to enter “quarantine,” only a section containing “quarantine” or a rule containing “quarantine” in ANY of its fields would be displayed. This includes any fields that are not actually displayed such as rule ID, section ID, or object ID. This, ultimately, allows you to filter by a unique identifier, if needed. At this time, I found this to be the most useful way to implement this.

The project page is here: https://github.com/rnwolfe/nsx-dfw-viewer. I’m open to any feedback on the search or any part of this! I would really like for people to try it out, and let me know if I can customize it to address any specific issues you would like. Feel free to contribute via GitHub!