Living in the underlay

Mainly Networking, SDN, Automation, Datacenter and OpenStack as an overlay for my life

Thursday, May 10, 2018

SD-WAN: Come on guys, lets use that data!

7:06 PM
I had the opportunity to participate in the latest MPLS SDN NFV World @Paris 2018, and need to say I've saw a lot of SD-WAN gear around and had the opportunity to talk with some technical experts regarding their solutions.

The main goal of this post is to point some considerations I'd found during this week, if you feel that you company is the exception to this list please feel free to ping me and let's have a talk :)

Being that said this are the missing points I found and want to summarize in order to help those companies in their roadmap development:

- Big Data, really? - Lot of vendors claims that their SD-WAN solution has a built-in analytics and big data backend in order to get lot of information that will be useful for advanced reporting and metrics. That is not a lie for sure but the issue im still finding here is that no single vendor is using that data to do a feedback cycle to the system. What does this means? Imagine that you have a lot of information of the BW usage for customer links and (after a while) you can do basically two simple actions:

  1. Detect real time high-usage/packet drop/service degradation/SLA not met/etc 
  2. Predict or even prepare the network for future loads
Well that's great, but if you're only showing information there is no feedback in the cycle and your system can't adapt and even if you're showing us a degradation NO action is being done. Think about self service customer systems that can show the end user that their links are degraded and shows an action to remediate the issue, that is real usage and integration of the monitoring/telemetry of your system and provides real useful features. Being that said we can tackle down our second point, prediction and why not start thinking on AI. The information this systems are generating should not be used only for fancy reporting, the most important value for the data is his understanding inside the context, basically for us this means being able to use it to predict network patterns and take actions based on the behavior (yes.. if you're thinking about intent drive networks this goes that path, at the end we want the network to behave in a particular way). Unfortunately this is not happening for **any** SD-WAN solution.

- Analytics and controller disaggregation - More on the topic I've just described before, most of solutions presents a controller UI and platform and a complete different platform for the analytics and monitoring solution, in some cases there is no even an integration from a common portal (really). But the key point here is not that they're not under the same UI, the important missing feature is the lack of communication between those two systems (this for sure is one of the causes of the no feedback we just mentioned). Note that I've mention that only some of them are failing this point, some of them have the analytics module already connected to the controller but still no valuable use of that data. Maybe a key point to mention here is that the analytics system for networks are still being modeled in a legacy way and there is no way to ask for behavior or to trigger data about policy compliances (some of them have solved this by creating service policy that are met by reaching or not some defined tresholds, this is not for sure a meaningful API but well at least is just a start).

- Interop - Last but definitely not least important is the controllers interop, and also here I want to point two completely different issues that are related to inter-operability:
  1. SDN vs SD-WAN Controllers: Let's imagine you're living without any worries relying on your SDN controller, now the small brother cames into play and guess what... that new SD-WAN controller doesn't have a clearly defined way on how to talk against your running SDN-C. Moreover, you realize that some of the vendors are using their own specific way to do it, some of them have not even consider the use case and other relies on same old BGP to interact each other (however I was not able any doc supporting this #Versa or Juniper SD-WAN
  2. Tie to a vendor and live with that - Guess what, you have chosen a specific vendor, you have their hardware running and their SD-WAN Controller software, but now you realize that your hardware must not be tied to your software solution (highly coupled solutions are avoided far before objected oriented programming), well.. luck there, once you have elected your vendor they are not enforcing/developing a white-box roadmap, at this early stage they are closing the doors to develop their solution and let it grow till the market demands for open-sourced solutions (we have lived this in the past, right?)
Well those are pretty much my thoughts on this topic, feel free to think other way and demonstrate the opposite, I'm open to hear back from you and to discuss this :)


Wednesday, April 11, 2018

Embrace API not SDK, but don't reinvent the wheel

10:20 AM
I'm having lot of discussion with my students and colleagues regarding this topic and want to start by clarifying this question that someone made few time ago:

"Why do you teach Cobra SDK for CCIE DC curricula instead of pure raw REST API?"
The answer was quite simple "Is in the official curricula/blueprint and you need to master it"  but it has some drawbacks and I feel that some other things needs to be considered here:


  • Embrace API: The freely way to operate is to use pure REST API and automate as you wish the service (at the end we want to cover some use case that serves some purpose and that can be considered a service for an specific end user - engineer, customer, etc- ). If you understand and comprehends how the API is structured and works it would be quite easy to automate that specific gear moreover you can even start thinking like that box/system ;) [1]
  • SDK is a huge toolbox, use it wisely: The SDK just provides you a toolbox to quickly start coding without messing around with a **huge** API set, but even if it seems the easy way there are some caveats to consider: The SDK packages common operations into functional boxes (methods) for you to consume, however that doesn't mean that all your pretty weird use cases will be covered/reflected into the SDK [Even if they say that the SDK reflects the API, I've not met a single SDK from a network vendor that covers their full set of API operations into the SDK and lets not even start talking about documentation..].
  • Don't reinvent the wheel: If you're planning to move towards a pure REST API model i'm happy for you (really, not sarcastic) but consider: are you going to encapsulate the specific device on your specific package/model? If the answer is yes for each specific device my first response would be "why not to use the SDK on the first time?" since you are deploying the same thing with a less code power than the vendor. if the answer is yes but not to a specific device but to a specific role in network we are start talking about good design choices ;) don't tie your code to a specific gear, make it independent and more on this...
  • Don't repeat the past: If you're planning to use pure REST API to talk to a network device and you're thinking the legacy way you will end up in multiple API calls (being multiple equals to the lines of CLI commands that you need to enter in the device configuring by hand). So basically you will be doing old school network in a fancy way, why not starting asking devices about a desired state? (intent)
  • Code a service not a function: Creating a [script/program/api] to do a specific task such as create vlan, trunk it, enable a protocol or even to do a correlate serie of tasks is not the same than creating a service, the aim of your code should be the service automation since network automation is well covered by many sources but service automation is specific to your business/use case (please have in mind that even if there are a lot of powerpoint $#$%$# around the probability that your use case is not a standard covered one is near to 99.99999%)
Being all that said I really think that more needs to be done in order to correct instruct the way that some organizations are taking towards network automation, more over an intent based think is need definitely in order to not fail to repeat the history and do legacy network automation in the new era.



By the way for those who still asks me if I provide some network automation course for DC or generic network programming is Yes and is not based on any SDK, it's not part of the CCIE DC training and is covered in the Network Programmability course (more info ping me directly, linkedin or trough ie-bootcamps :) )


[1] Some vendors, and want to remark that shamely only some of the full network vendors ecosystem,  creates their UI or EndUser systems based purely on the consumption of their own boxes REST APIs.

Thursday, July 6, 2017

Juniper DC Reading list

11:41 AM
One of my colleagues just asked me about the recommended reading list for the Juniper DC track (in particular what I've used to clear JNCIP-DC few weeks ago), here is a complete list of free resources that you can access to prepare yourself for the exam, I will also recommend (if you don't have any real/lab experience with QFX/EX for vxlan setup and mostly with VCF) to do some labs with vQFX (you can try out EVE-NG, which is *highly recommended*)

Here it is:

Juniper Networks EVPN Implementation for Next-Generation Data Center Architectures - https://www.juniper.net/assets/us/en/local/pdf/whitepapers/2000606-en.pdf

Virtual Chassis Fabric Feature Guide - http://www.juniper.net/documentation/en_US/junos/information-products/pathway-pages/qfx-series/virtual-chassis-fabric.pdf

Comparing Layer 3 Gateway & Virtual Machine Traffic Optimization (VMTO) For EVPN/VXLAN And EVPN/MPLS - https://www.juniper.net/documentation/en_US/release-independent/solutions/information-products/pathway-pages/solutions/l3gw-vmto-evpn-vxlan-mpls.pdf

Clos IP Fabrics with QFX5100 Switches - https://www.juniper.net/assets/cn/zh/local/pdf/whitepapers/2000565-en.pdf

Virtual Chassis Fabric Best Practices Guide - http://www.juniper.net/documentation/en_US/release-independent/vcf/information-products/pathway-pages/vcf-best-practices-guide.pdf

EVPN Control Plane and VXLAN Data Plane Feature Guide for QFX Series Switches - https://www.juniper.net/documentation/en_US/junos/information-products/pathway-pages/junos-sdn/evpn-vxlan.pdf

Understanding Zero Touch Provisioning - https://www.juniper.net/documentation/en_US/junos/topics/concept/software-image-and-configuration-automatic-provisioning-understanding.html

Configuring Zero Touch Provisioning - https://www.juniper.net/documentation/en_US/junos12.3/topics/task/configuration/software-image-and-configuration-automatic-provisioning-confguring.html

Configuring Zero Touch Provisioning in Branch Networks - https://www.juniper.net/documentation/en_US/release-independent/nce/information-products/pathway-pages/nce/nce-151-zero-touch-provisioning.pdf


Also another great book which I just end reading is "Building Data Centers with VXLAN BGP EVPN", this book is Cisco NXOS oriented but provides an amazing background on how VXLAN BGP EVPN Fabrics works.

HTH,

Sunday, May 21, 2017

TSHOOT Tips: ELAM Usage on Cisco ACI

9:11 PM
I was using this quite lot past weeks and think that is a good resource to share to everyone playing around with Cisco ACI. When it comes to tshoot and to understand packet flow inside the Fabric ELAM is a great tool.

So, what it is?

ELAM stands for Embedded Logic Analyzer Module, It is a logic that is present in the ASICs that allows us to capture and view one or more packets, that match a defined rule, from all the packets that are traversing the ASIC. ELAM is not new at all, some of you can remember this from CAT6500, and thats ok, same logic also same from N7K (for the youngest?).

and... whats new?

Essentialy the concept is still the same, an we just need to focus on understand how is the architecture inside the ASICs on Leafs and Spines to fully apply this concept.

Cisco ASIC data path is divided into ingress and egress pipelines where two ELAMs are present (see figure) at the beginning of the lookup block.



As we can see in the picture Before we can use ELAM to capture a packet, we must be sure that the packet is sent from the BCM ASIC to the Northstar ASIC. ELAM operates only in the Northstar (for leafs, on Spine takes place on Alpine), so any packets that are locally switched in the BCM ASIC will not trigger the ELAM, this is important since in some scenarios the packet will not reach Northstar and will not trigger an ELAM event (we can cover this in a future post about PL-to-PL traffic on ACI fabric :) )

So, assuming that our traffic will be processed by Northstar we need to configure our ELAM instance, first of all is good to know which kind of rules can we configure based on the pipeline, this is also referred as "select lines" and the following are available:

Input Select Lines Supported 
3 - Outerl2-outerl3-outerl4
4 - Innerl2-innerl3-inner l4 
5 - Outerl2-innerl2 
6 - Outerl3-innerl3
7 - Outerl4-innerl4 

Output Select Lines Supported 
0 - Pktrw 
5 - Sideband

With this in mind we can configure our ELAM instance, first of all is always good to have an image to understand the whole process of what we need to do:


Where on INIT we choose the ASIC and pipeline in which the capture should take place, CONFIG refers to the proper configuration of the rulo to match the packets, ARM is like arming the bomb :) but in this case we arm our packet capture to be triggered once the rule defined on CONFIG section has a match, after this READ the captured data and RESET to start over :)

Now lets dig into the packet capture, we will refer to this topology for the capture.

ELAM Example


This image is extracted from a Cisco Live presentation of ELAM but we will focus on LEAF4 only, traffic will traverse from VM1 to the EP at the right going toward Northstar (at 1) and this example is also useful to show how this behaves on Alpine. 

We will arm the ELAM on Leaf 4 to capture a packet coming from EP1 (the one at the left side, directly connected to Leaf1). In this example we show use of in-select 3, which means the fields we can match on or outer L2, L3, or L4. We show also the out-select of 0.



This will work for basic ELAM packet capture.As we mention we need to configure (CONFIG section of the ELAM) 1 aspect of the trigger to match on. For this example we will use the SMAC of the locally attached endpoint:





In order to see the ELAM state the status command can be used, esentially three different status can be found:
- Triggered: indicates that a packet has been detected as matching the trigger, and that packet is available for analysis. 
- Armed: it means that that no packet has been detected as matching the trigger yet, and ELAM is actively looking at packets for a match to the trigger.
- Initialized: the ELAM is available for triggers to be configured, or to be armed with the start command. It is not currently attempting to capture a matched packet. 

Once ELAM is triggered, the packet can be viewed for analysis with the report command. The report will show the relevant header fields in the packet (note that will not show the complete payload of the packet), once this is done we can restart the process with the reset command.



This is pretty much all for a good start on ELAM usage for ACI, more info is available at N9K config guide and a good resource as well is the Cisco Live Session BRKACI-2102, from which I already took some images for this post.

Hope you enjoy and next time maybe I found some time to start the amazing post of PL-to-PL traffic on ACI Fabric.











Thursday, May 18, 2017

Stretched DC, really?... ok, for L3, BGP conditional forwarding

4:03 PM
A long time ago (I think it was years back) I was reviewing a DR solution for some internal customer who has two datacenter and a DCI between them (dark fiber). They moved initially to a stretched design extending vlans from each site and using L3 gateway on one side only at a time, since as a business requirement traffic should always leave from primary DC. However they were expecting some kind of solution to be able to automatically switchover to secondary DC in case of a failure on DC1.

For this cases it's always a pleasure to read Ivan and see how he predicts the design issues that I will face in the future (Stretched DCI), hopefully no stateful firewalls were involved here.

The main issue was not only to detect which side is alive (which is not easy without a witness, and we don't have one at all) but also how to decide which traffic should be served and from where.

So here is a big stop. After keep going with this we need to take some assumptions and business decisions:

  • If DC1 site fails but DCI and DC2 site alive, traffic will enter from DC2 side and traverse the DCI.
  • If DCI fails, traffic will continue being served from DC1 for stretched VLANs subnets, this implies move by other method those servers to the surviving side or at least shut them down.
  • If DC2 site fails but DCI and DC1 site alive, traffic will enter DC1 side and traverse DCI to reach DC2 side servers.
  • Traffic should leave and enter from DC1 whenever possible and DC2 site should not be used unless strictly necessary (this was imposed by customer)

So after reviewing lot of options, and assuming that eventually we can fail and working around that (and the fact that we need to do a stretched cluster after all) we came across a nice BGP feature which is called conditional forwarding. 
Just for your reference, BGP Conditional forwarding allows us to advertise a given network based on the information that we have in our FIB. This can be really useful for this scenario by defining an witness network from each side and advertise to each other, this should be a dummy network like 1.1.1.0/30 for DC1 and 1.1.2.0/30 for DC2 and the match statement will verify if we are getting this network advertisement and based on that will withdraw our advertisement or just let it flow.

Ok, so enough of reading and lets have a quick view on configuration (On NXOS) and behaviour:



Here is the config for the eBGP side of the DC2


Based on that normal behaviour would behave like this (routes will be withdrawn):


Now if we have a failure on DC1 side, conditional trigger will take place and start advertising from DC2.





Is this all what we need? Definitely No... There are still lot of things to resolve and we don't have an optimal design (we can discuss here, if we are meeting business requirements is there anything else to do?), but apart from that notice that stretching a VLAN is not a good choice, guess why? you're extending your fault domain and that doesn't simplify things it also make more complex the isolation and detection. so let's start wondering why we made such poor decisions and why we can't start talking about application level aware resiliency, making our life better by allowing us to use different subnets/networks at each site being able to handle traffic in/out more flexible by leveraging existing methods (long talk about BGP attributes and policy control enters here).


Some references:

Cisco. (Agosto de 2010). Cisco IP Routing. http://www.cisco.com/en/US/tech/tk365/technologies_configuration_example09186a0080094309.shtml











Sunday, May 14, 2017

CCIE DC v2 - bootcamp - outline

12:45 PM
For those attending to my CCIE DC v2 bootcamp next week, here is the updated outline, I will be posting updated diagram in few (remember this course is not based in any rack rental so interface numbering is up to that :) )

Introduction

Exam Considerations / Oveview / Strategy


Section 1 – Cisco Data Center Layer 2/Layer 3 Technologies

1.1 – Configure VDC Resources
1.2 – Configure NXOS multicast
1.3- Understanding VxLAN
1.4 – Configure vPC & Deployment options
1.5 – Configure FEX & Deployment options
1.6- Configure VxLAN L2/L3 GW (EVPN | F&L)
1.7 – Configure NXOS Security
1.8 – Configure& Troubleshoot Spanning Tree Protocol
1.9 – Configure & Troubleshoot OTV


Section 2 – Cisco Data Center Network Services

2.1- ACI Service Graph
2.2 – RISE
2.3 – Unmanaged devices in ACI
2.4 –Configure Shared L3 Services


Section 3 – Data Center Storage Networking and Compute

3.1 – Configure FCoE
3.2 – Cisco UCS Connectivity
3.3 – UCS QoS
3.4 – Service Profiles
3.5 – Configure advanced policies
3.6 – Configure Cisco UCS Authentication
3.7 – Configure Call Home Monitoring
3.8 – Troubleshoot SAN Boot
3.9 – UCS Central Basics
3.10 – UCS Central Advanced configuration & tshoot


Section 4 – Data Center Automation and Orchestration

4.1 – Introduction to scripting in Python / cobra SDK
4.2 – Python Programming with ACI Advanced
4.3 – UCS Director Basics
4.4 – UCSD Advanced Workflows Design


Section 5 – ACI

5.1 – Understanding ACI Fabric Policies
5.2 –Understanding ACI Access policies
5.3 – ACI external L3 connectivity in shared resources
5.4 – ACI L2 bridge / L2out
5.5 – ACI VMM integration

















Saturday, April 29, 2017

Multicast redundancy: Phantom RP

11:08 AM
Past week two weeks a colleague and also a student asked me about Phantom RP and how it works, all was related with a discussion we have around VXLAN Part 2 post and about supported Multicast configurations for VXLAN in NX-OS.

First of all, and in order to avoid further confusions around it, I would resume current supported methods for VXLAN underlay on Cisco NXOS/ASR devices:

Source: Cisco doc

Being clarified that, we can continue with the original purpose of this post.
So, based in our previous post we have configured our Nexus 5K / 7K underlay to run multicast in to support Flood and Learn configuration, by that time we choose Bidir PIM since is the only supported method in N5K. So let's get some background about bidir and how can we make it redundant (can we?)


BiDir PIM


PIM Bi Directional mode enable multicast group to route traffic over a single shared tree rooted at the RP, instead of using different unidirectional or sources tree. Since RP is the root  (his IP address :) ) is good to not to place it on a router but on an unused IP on the network reachable from PIM domain (this will be seen later in PhantomRP configuration).
Explicit join messages are used to establish group membership, Traffic from sources is unconditionally sent up the shared tree toward the RP and passed down the tree toward the receivers on each branch of the tree (note: traffic is not sent unidirectional to RP)

Bidir-PIM shares mechanisms of PIM-SM like unconditional forwarding of the source traffic toward the RP but without the registering process for sources (https://tools.ietf.org/html/rfc7761#section-4.2). Based on that forwarding can take place based on (*,G) entries, removing the need of any source specific state and, therefore, expanding scaling capabilities. This image extracted from Cisco white paper are good to see the differences in upstream process towards the RP in SM vs BiDir:


Source: http://www.cisco.com/c/en/us/td/docs/ios/12_0s/feature/guide/fsbidir.html#wp1023176

"PIM-SM cannot forward traffic in the upstream direction of a tree, because it only accepts traffic from one Reverse Path Forwarding (RPF) interface. This interface (for the shared tree) points toward the RP, therefore allowing only downstream traffic flow. In this case, upstream traffic is first encapsulated into unicast register messages, which are passed from the designated router (DR) of the source toward the RP. In a second step, the RP joins an SPT that is rooted at the source. Therefore, in PIM-SM, traffic from sources traveling toward the RP does not flow upstream in the shared tree, but downstream along the SPT of the source until it reaches the RP. From the RP, traffic flows along the shared tree toward all receivers."


Need of redundancy? Let's do it

We mention that our shared tree is rooted at RP address, so in order to give him redudancy we need a way to duplicate this or use a virtual IP. For bidir pim no traffic is targeted at RP (no control plane functions) so our solution is easier, instead of actually assign same IP in a sort of anycast we can just advertise it thru our IGP, the only issue foreseen is that the actual shared tree should be only one at a given time (we dont want that our RPF interface changes everytime) so in oirder to avoid that we can leverage the path decision to a more specific match in the RIB (by advertising same subnet with largest mask by some of the redundant points).
Well, that was so much talk I think that a code/config snippet worths more than a millon words:


Primary


Secondary (hmm.. if you don't see any difference here is a hint: look at the mask)




Now it's done, you can run your set of favourite verification commands to see if this is working:



Also you can shutdown the active interface (lo1) and see how does this change and our redundancy is working.

For CCIE / CCDE students:
- What is the convergence time of RP in case  of a failure on primary?
- Can we give sub-second convergence?
- In flood and learn configuration for VxLAN what would you recommend ASM or bidir PIM?
- In case of choosing ASM how is your redundancy going to be solved?
- Why are we using "ip ospf network point to point" ?

More on Multicast ASM/SSM/Bidir comparisson: http://lostintransit.se/2015/08/09/many-to-many-multicast-pim-bidir/