Author: Hany Michael

Release: NSX SDN-LAB v3.5 with advanced designs, vRealize Log + Network Insight and more.

Refresher: NSX SDN-LAB is a fully virtualized and nested lab running on VMware’s internal cloud. This lab is architected and developed by Hany Michael – Senior Staff Architect in the Networking & Security Business Unit @ VMware. If you are a VMware employee, you have an instant access to this lab as a virtual pod which could be deployed as an independent and dedicated instance from the OneCloud portal. If you are a VMware customer or a partner and would like to have an access to the lab, you can contact your account team for further guidance. This lab is vendor neutral and any of its third-party vendors listed could be replaced if required. This architecture could be used also to illustrate some of VMware’s networking and security solutions and capabilities. All the information included in this architecture reflects the exact design and configuration of the NSX SDN-LAB including but not limited to: designs, product releases, hostnames, IP addresses and so forth. The lab is also designed to be modular and could be scaled to include more sites, network, servers and/or storage resources. The future development of this lab will be based on “add-ons” to introduce other networking technologies (like MPLS), topologies (like Service-Provider models) or external clouds (like Amazon AWS, Microsoft Azure, Google CPE just to name a few).

 

It’s been a while.

As a VMworld 2016 special, I would like to announce publicly the new NSX SDN-LAB (formerly NSX vLAB) release v3.5. There have been multiple internal releases of this lab on OneCloud but this one represents the latest and greatest from NSX. I have also developed a brand new architecture for the lab to show a holistic and detailed views of all the components together in a whopping A0 scale diagram — this is literally the largest diagram I have ever architected on Visio.

The Lab Architecture

NSX SDN-LAB v3.5

What’s News

Here is a quick list of what is new since the last release:

Upgrades:

  • vSphere upgraded to 6.0U2
  • NSX upgraded to v6.2.4
  • vRealize Automation upgraded to v7.1 with NSX vRO Plugin 1.0.4
  • vCloud Director upgraded to 8.1

New Products:

  • vRealize Operations 6.2 with NSX Management Pack.
  • vReaize Log Insight 3.6 with NSX Content Pack.
  • vRealize Network Insight 3.0 (this is officially the new NSX monitoring and operations tool – more on this in future posts).

New Vendors:

  • Arista vEOS joins the club in a Top-of-Rack leaf architecture.

New Designs:

  • A new mutli-uplink / multi-rack Edge design introduced in this lab. This has been one of the most difficult subjects to understand from the NSX design guides and presentations, so I included it in a (hopefully) clear and practical way. As usual, the design in the above diagram reflects the actual and exact configuration in the lab include hostnames, ip addresses, OSPF areas, etc.
  • The lab introduces also another UDLR with no local-agress configuration to show an Active/Passive design. All the VMware management products are still abstracted over a universal VXLANs as part of the other UDLR with Local-Egress.

New Integrations

  • With the inclusion of Arista vEOS, you can now integrat NSX with Arista as a ToR hardware VTEP.
  • NSX now leverages the new vRA 7.1 ReservationPolicyID to allow end-users to provision workloads on universal-logical-switches with the datacenter of thier choice using one single Converged-Blueprint.
  • vRealize Network Insight running in full-feature mode and collecting data from: Cisco CSRs, Arista vEOS (SSH + SNPM), and NSX Managers / Controllers (SSH + Central CLI), and all vDSs (with NetFlow).

Lab Configurations on GitHub:

 

 

Share Button
Continue Reading

Release: NSX vLAB 2.5

This is going to be a short blog post to announce the availability of the NSX vLAB 2.5. In this release i have introduced the following NSX features:

  • Site-to-Site L2VPN between the two datacenters. In the Cairo site we have the L2VPN “server” while in the Alexandria site we have the L2VPN “client”.
  • I’ve configured the above tunnel to stretch a VXLAN to VXLAN across the two sites. This is to showcase the ability to stretch your L2 networks between sites that have more than 150ms RTT which is the requirement for the native Cross-vCenter NSX universal logical switches.
  • In addition to the above site-to-site L2VPN using the NSX full Edge Services Gateways, I have deployed also a new NSX Standalone L2VPN Edge in the remote site as a client connecting to the first L2VPN server in the Cairo site. This is to form a hub-spoke topology and demonstrate also the fact that you can stretch your networks with remote offices (or external clouds) without the need to have a full blown NSX infrastructure running there.
  • Last but definitely not least, I have configured a VPN Gateway at the Alexandria site to enable the SSL-VPN for remote users. In the diagram below you can see that the user in the remote office can initiate a dial-in VPN connection to connect to a secured network(s) in the Alexandria site. The user is also authenticated using his/her own active directory domain account and can download his/her VPN client from a self-service portal. The latter is provided as part of the services running on the NSX VPN Gateway (aka ESG).

NSX vLAB 2.5

Share Button
Continue Reading

SDDC Architectures: Workload Mobility & Recovery with NSX 6.2, vSphere 6.x & SRM 6.x

In this blog post I am going to talk about one of the subjects that I have been always passionate about at VMware. Application continuity is something that is gaining an incredible traction and interest today from customers and with all the ground breaking technologies that VMware has introduced with the release of vSphere 6.0 and NSX 6.2, this has become a reality today in the software-defined datacenter era.

I have everything you would expect to come with such rich topic from a business case, to real-customer deployment, to a detailed architecture, to even a recoded video demo’ing parts of the technologies explained here. So, without further ado, let’s get to the details.

The Business Case: Workload mobility in a Telco-over-cloud environment

Right before my recent transition to the VMware SDDC R&D organization, my last project in the Professional Services Org was probably the most interesting one throughout my 5 years in my role as a field architect. I was tasked to design a state-of-the-art Telco Over Cloud platform for one of the largest Telcos in the region. As you would expect also from a carrier-grade design, application and service continuity was one of the top priorities in that project.

Apart from the traditional disaster recovery requirements for this Telco, there was a requirement also for a disaster avoidance on a datacenter/location level. The requirement here was to perform a rapid workload mobility across two sites should a foreseeable outage were to strike one of the two active/active sites. Network Functions Virtualization (NFV) management workloads was the first aspect that we looked at here. The Telco had multiple Network Equipment Providers (NEPs) involved in the project, and the customer wanted to have a unified platform to enable the management applications continuity. This was required regardless of the NEP specific application capabilities to tolerate failures. In fact, the two NEPs that were involved in this project had a clustering mechanism builtin in their Virtualized Network Functions Manager (VNFM) application. This, however, was a traditional Active/Standby clustering mechanism that would require a downtime for switch over. In carrier-grade environments, a one second of downtime would have a big impact on business.

Same as the VNFMs, the Cloud Management Platform (CMP) used here was vCloud Director – Service Provider Edition (vCD-SP), which was no exception for the required availability since it provides the gateways (through APIs) between the VNFMs and the actual VNFs being provisioned. Although vCD-SP is an Active/Active stateless application, it still requires a database tier that has to be abstracted. Same thing holds true for other operational workloads like the NEPs Element Management Systems (EMS) as well as VMware’s applications like vRealize Operations, Log Insight and so forth. Those workloads plays a critical role to monitor and manage an environment that is operating in real-time.

So how did we achieve that application continuity in such mission-critical environment? The simple and short answer is through the combination of vSphere’s Long Distance vMotion along with the Cross-vCenter Network & Security in NSX 6.2. The long answer is that we will discuss and examine in the rest of the article in detail.

The Architecture

The Architecture

Breaking the physical and logical boundaries

In vSphere 6.0, VMware introduced a ground-breaking new feature called Long Distance vMotion (LDvMotion). I was most excited about this feature more than any other technology since the invention of the original vMotion itself. As a matter of fact, I created the very first prototype inside VMware using early builds of vSphere + NSX to show case how you can live migrate a VM across two datacenter using LDvMotion and NSX’s L2VPN network extension. All in software, no network or storage extension across sites.

That previous design with the NSX Edge Services Gateway (ESG) & L2VPN is still a valid design and can be implemented in many customer use cases, however, with the introduction of NSX 6.2, VMware has taken this to a whole new level. The new Cross-vCenter NSX feature in the 6.2 release basically allows you to stretch your Logical Switches (aka VXLANs) across sites for L2 networking and to create also a brand new Universal Distributed Logical router for L3 routing. Sounds cool? Not as cool as adding a Local Egress capabilities into the mix. But more on that in a bit.

Stretching your networking and security constructs

If you try to speak to a network engineer about Layer 2 extensions across sites, he/she would probably push back and start explaining to you how that would be a very bad idea and how a wrong configuration in the infamous Spanning Tree protocol would bring down your two sites should an incident of that nature occur. That’s probably right, expect that we are not doing any of that here in this architecture. We are actually doing a software overlay on top of an existing layer 3 physical and traditional networks. Think of it the same way VMware first introduced the VXLANs to abstract L2 networks across racks over existing layer 3 networks within one datacenter. What is different here is that this capability is now possible across vCenter Servers and, consequently, in the context of our current design, across datacenters. VMware has basically introduced a new technique to peer two (or more) NSX Managers together and to unify the controllers into one universal cluster. The outcome is simply amazing. You can now create universal transport zones across compute clusters that are managed by different and independent vCenter Servers sitting in different datacenter/locations as long as you have a very basic L3 connectivity across them which is the case in the vast majority of today’s customer environments. You are probably already doing that today with the vCenter Enhanced Linked-Mode to manage your two sites from one single WebClient. If you do, congratulations, you already have most of this architecture already in place in your environment!

vCross-vCenter NSX

NSX Peering and vCenter Server Enhanced Linked-Mode

Local Egress Optimization

So now that we’ve explained (in a very high level) how the L2 networks are extended across two sites with the new Cross-vCenter NSX 6.2 capabilities, lets have a look at the L3 aspect or north/south traffic. The first question that customers ask at this point, regardless of the great L2 applications adjacency benefits they recognize, is this: what about my L3 traffic, do my application need to traverse datacenters to exit to the dc core, enterprise edge or the internet? The answer is simply no, but it really depends on how you architect your logical L2 & L3 networking. If you want all your traffic to exit from one site, you can. If you don’t, you don’t have to and in this case you enable the Local Egress in your universal DLRs. The way you do this is that when you deploy your UDLR for the first time, you enable the option “Local Egress” during the configuration wizard. Once your do, your UDLR will be deployed and extended across the two datacenters. The next steps would be as follows:
1) You deploy two different Control VMs, one in each datacenter. This is not mandatory but it is required in our current design here because you need to establish an OSPF routing adjacency between your UDLR and the upstream ESGs.
2) You then need to create two uplinks (yes, that’s possible with the UDLR) and each one goes to a universal logical switch that is, in turn, linked to your ESGs that are sitting at each site. In the diagram you can see two ESGs with internal interfaces going to those vxlans while their uplinks are going upstream to a unique VLAN that is relevant to each site. Through the latter, you are establishing another OSPF adjacency with your datacenter L3 physical device. We are also leveraging here the equal cost multi-path (ECMP) to allow a better performance (more N/S bandwidth) and better availability (faster convergence within an ECMP nodes cluster). In the next section I will explain how your e/w and n/s traffic will flow.

Your East/West and North/South application traffic

In our current design, I used a practical example for multi-tier tier applications to make things realistic. As you can see, we have the vRealize Suite (vRealize Automation, Log Insight, Operations, etc). If we take vRA as an example, it’s a typical three tier application that consist of Web tier (appliances and IIS), application (Manager Service) and database (MS-SQL). There are also some services dependencies like single sign-on which is represented here by the vRealize Identity Appliance. All your east/west traffic across these nodes happen over L2 since they share the same logical switch. The north/south traffic that need to go out/in happens through the Edges that the UDLR is uplinked to. In our case here, the UDLR has one internal interface that is acting as the default gateway for those application nodes. Keep in mind here that if one node is sitting in site A, then it’s default gateway is local in that datacenter. Same thing holds true to another node that might be sitting in site B, its default gateway (in our case 172.16.10.1) is also local to site B. Now what about the routed traffic all the way to the DC core? It happens exactly the same way, thanks to the Local Egress optimization that we enabled on the UDLR. The first node in site A will be routed to the upstream ESGs in the same site and from there to the DC core which is the Layer 3 switch (192.168.110.1). The exact same situation happens for site B, the node sitting there will be routed to the ESGs sitting there and then to the Layer 3 switch (192.168.210.1). If you are interested to know in detail how this is happening, you can read about the locale-id concept in NSX. This is largely out of scope of our article here but I am planning to do a technical deep dive into in a future blog post.

The Dynamic Routing – upstream & downstream

I’ve briefly mentioned earlier that we are establishing an OSPF adjacency between the ESGs and the upstream switches. Lets have a closer look into that.
If we take a look starting from the virtualized application itself (vRA in our case), we said that its default gateway is the internal UDLR logical interface 172.16.10.1.

That UDLR has an OSPF peering happening with the ESGs in an ECMP topology. The ESGs, in turn, have another OSPF peering happening with their upstream physical L3 switches.

NSX Routing

NSX upstream and downstream routing adjacencies

vCenter Enhanced-Linked Mode and PSCs

One of the important subjects I don’t want to miss here is the vCenter Enhanced-Linked mode. You have everything to win by enabling this in your vSphere virtual infrastructure right from the was of managing your inventory and licensing, all the way to performing cross datacenter migrations (vMotion & Storage vMotion) right from the comfort of your Web Client. Although your can perform that using APIs, who would want that if you can simply drag and drop objects from the UI. In the following demo, we will examine that.

Your First Long Distance vMotion

So this is it. This is where we test the beauty of this new vSphere 6.0+NSX 6.2 capabilities to live migrate your VMs across datacenters. In the past, that was exclusive only to physically stretched L2 networks and storage. Today, we do all that in software thanks to the universal logical objects. With that said, when you right-click on your VM and chose Migrate, you will have the typical wizard asking you for choosing between three options: Compute only, Storage Only or both. Since we are in the new brave world of all things software, we are choosing here the third option to migrate across both compute and storage to a completely different and independent datacenter. Next, you choose your compute cluster and storage destinations, both of which are in the second datacenter. When you reach the point of choosing your network, you will find that vCenter is already displaying the associated VXLAN on the other side with the same segment ID. That’s simply because this very VXLAN segment is a universal object. Now let’s play the demo.

Demo: VM migration across three datacenters

To make things ever more interesting, I’m not going to demo a VM migration across two datacenters. That was probably cool prior to VMworld 2015 when I first prototyped this with NSX. We are going to demo here two consequent VM migrations across three datacenters. This is also to show you how flexible the design is and how scalable you can take it further.

Security

The last subject but definitely not least of importance is security. You may be already wondering at this point: what about the VM security constructs? Do I lose that when I migrate the VM? Of course the answer is no. You keep and maintain those settings via the universal DFW. Since we are also taking the vRA example here as an application, let me talk in a bit more detail as to how you would secure this application in your datacenter with this design.

1) L2 micro-segementation: Firstly, you can strict the traffic between the vRA nodes in their own L2 network. For example, the DB node speaks only with Manager Service and Web tier over the required ports. No access is allowed ingress or egress to anything else.
2) L3 end user access: The vRA like any application with Web tier, requires end user access to the portal. Since we already know that this happens through https and vmrc, then we simply open those ports only to the end-users networks. Those latter networks are blocked to access anything else in the vRA VXLAN network.

One would ask here: why do I apply these security enforcement on the NSX DFW rather than my traditional ACLs or DC firewalls? The answer is simple. While you can still do that the very same old way you used to secure your applications (virtual or physical), you would want to do this on the NSX DFW to enable those security settings to migrate with the VM across datacenters (or after a failover with SRM as I will show in part two). NSX is enabling you here to avoid any requirement to reconfigure (or preconfigure) any of those security rules. This is a onetime configuration that will be maintained throughout your application lifecycle.

Of course there are even more advanced techniques you can follow here, assuming you are in a highly secured or multi-tenant environment, to perform deep packet inspections and that can simply be done through redirecting your traffic to any of VMware security partners (like Palo Alto Networks or CheckPoint, to name a few) as long as they are certified with NSX.

NSX Universal DFW Security Rules

Conclusion

Let me get back to the original business use case. In our design here, we enabled the Telco to provide a unified platform, regardless of the applications and their vendors, to seamlessly live migrate those applications across datacenters. Those migrations are neutral to the location or network connectivity as long as the telco can maintain the LDvMotion requirement of 150ms. No stretched networking, storage or compute clusters were required here. No vendor-spesific hardware solutions either were used here. This is all software-defined, and configurable in a matter of minutes (not even hours) thought your vSphere WebClient. This design also will work on the vast majority of todays datacenters since we require no special cross-datacenter solutions whatsoever.

What’s Next?

Next is disaster recovery. We have seen here how disaster avoidance can be achieved, next I will show you how NSX and SRM work beautifully and are in fact like a match made in heaven. Forget everything you know about applications failover, ip addressing/DNS changes, scripting or manual routing convergency. I am going to demonstrate (with this very same design) how you can recover your applications with zero-touch infrastructure right after a major disaster taking out your entire datacenter. Stay tuned for part two.

Postscript:
– The content of this blog post (writeup, architecture and video) was produced last year in my previous role in the ISBU. I never had the chance to complete and publish except this month but everything mentioned here is still up-to-date and the customer reference here is already running in production.
– This solution was prototyped using my NSX vLab from a proof-of-concept all the way to a customer deployment. I would highly recommend to you to check it out. It can help you, regardless of your role being an architect, consultant or admin, in your planing, design, deployment and validation phases.

Share Button
Continue Reading

Introducing the VMware NSX vLab 2.0

If you are a VMware employee, you can have an instant access to this lab on VMware’s OneCloud. If you are a VMware customer and would like to have a demo on any of those topics mentioned here, you can reach out to your local SE/AM and see if they can arrange a remote demo for you. That’s the whole beauty of those cloud labs!

I’ll start this post by saying something that might sound a bit shocking for many: I have never owned a home lab in my life or even had access to a physically dedicated labs in my career. I have been always an avid fan of virtually nested labs and have always used them as the only way to develop, test and validate solutions. This has become especially obvious when I joined VMware in 2010 where I used quite heavily our internal “OneCloud” (called at that time vSEL) to run all my labs.

Ever since, I have architected and developed so many labs that I cannot even count. The one I am sharing here in this blog post is not the biggest but I can fairly say that it’s the one I am proud of the most. Apart from the incredible flexibility that it gives me in testing almost anything I want in the NSX world, it also allowed me to learn new topics, validate my solutions and last but not least demonstrate them to colleagues or customers. Take the NSX 6.2 Local Egress as an example. This is one of the most powerful features in NSX yet one of the most difficult subjects to understand for me when I first read about it. With this lab, it was quite easy to design and implement it in no time and to learn all about its powerful capabilities along the process (future blog posts coming soon on this).

Granted, there are some things that cannot be done (yet!) in this vLab like the use of VLANs, but that is not a show-stopper at all to achieve what you want to architect/test/validate in your NSX labs. As you will see, for example, I have substituted the use of VLANs by dedicating interfaces on the core router at each datacenter to have its own subnet + default gateway. That pretty much allows you to simulate a typical enterprise environment with different networks for management, production, campus access, etc. The use of VLANs may come later by leveraging for example the Arista vEOS or the Cisco L2vIOS (depending on the licensing terms & conditions for eval/lab use).

File: NetSkyX-OneCloud-NSX-vLAB-v2-0-W09

Although this lab can be used in so many purposes, here are my top favorite topics that can be fully demonstrated:
1) NSX 6.2 Local Egress: as mentioned above, I leveraged this lab mainly to test and validate this powerful feature in NSX when 6.2 was out. You can combine that with some workload mobility using Long Distance vMotion to showcase how you can live migrate your application across datacenters with zero-downtime. I will take about this in detail in my very next blog post.
2) Disaster Recovery: this is a very hot topic now in the NSX world. Forget about what you used to do in the old days with SRM and the painful process of changing your applications IP addressing after a failover process. With SRM + NSX, you can demonstrate how they both form an unmatchable solution for a fast and efficient recovery of your apps. Not just that, you can also test and validate the actual NSX system recovery across sites when you lose for example a complete datacenter. How can you recover your Net/Sec environment in the recovery site, and how easy will you do that.
3) Routing: this is one my favorite topics. You can leverage this lab to configure and test your OSPF/BGP routing adjacencies between your NSX environment and your physical network (simulated here with Cisco CSR1000V). That includes all the ECMP goodness as well.
4) Integrations with CMPs: so whether you have vRA, vCD-SP or even VIO, this lab will be your best bet to integrate and test NSX with those CMPs. Bet it a single site, single cluster or a multi-site with multiple clusters, you name it. All you need is to configure your favorite CMP with NSX and start all the fun of automating network and app provisioning.
5) Micro-segmentation: not only you can test the app-to-app micro-sgementation, but also combine that with security enforcement for those apps with external access (campus, remote or internet) users. This is a great way to explain to someone the powerful DFW capabilities in NSX and how you can leverage it in real-world to secure and harden your apps.
6) Pen-Testing: to build up on the previous point, you can take this further and start performing some penetration testing from within your datacenter network or from your campus network of from your remote locations to try to exploit some vulnerabilities in your applications. Combine that with some of the NSX partners NGFW solutions like Palo Alto Networks, McAfree, Trend-micro, etc, and you’ve got a very powerful platform to do your testing (with your favorite tools like Metasploit for example).

This list can keep going on and on. These are just some of my fav topics. In future blog posts I will keep developing this lab to introduce new NSX features or solutions. Just as an example, I am already working now on setting up an L2VPN between sites as well as a standalone NSX Edge in the remote site in a hub-spoke topology. I will keep updating this Lab and blogging about its development here. If you are a VMware employee, you can have an instant access to this lab on VMware’s OneCloud. If you are a VMware customer and would like to have a demo on any of those topics mentioned above, you can reach out to your local SE/AM and see if they can arrange a remote demo for you. That’s the whole beauty of those cloud labs!

And in case you are still wondering why I prefer to use this nested lab over a physical home lab, here are also my top reasons:
– Resources: no matter how rich I am, I am not going to be able to match the resources required to build such large environment of two DCs + remote site in a physical form. This can pretty much scale also as I fancy adding more datacenters or remote locations. Also, why would I want to buy a Cisco router to run MPLS in my core network if I can just use their CSR? Take that example and apply it also on VMware for physical ESXi hosts, or better yet, on storage vendors like the dying Fibre Channel arrays or even NFS filers.
– Flexibility: obviously it is easier to deal with digital files than metal hardware. I can do pretty much whatever I want like span-shotting an ESXi host before upgrading it or a router before applying a written configuration, etc.
– Sharing: obviously I am all about knowledge sharing so it would be a bit tricky if I wanted to share with you (my colleague or customer) my home lab. Here, I can either share with you the access to the lab (with RDP access or with direct vCD access), or I can even publish it to our internal catalog for anyone to deploy it and have his/her full control over.
– Tear & Deploy: sometimes I like to do quite disruptive tasks like simulating an actual disaster strike on a complete datacenter (routers, links, dc, etc) and test how a recovery can be achieved. This is normally a disruptive task in the physical world that requires some effort to return back to the original state. In this type of nested labs, it has never been easier. All you can do is simply delete your lab when you are done, and then deploy a clean one for a fresh start.

Lastly, I will leave you with some built points to list down what the lab consists of but the attached blueprints/diagram still still speak the thousand words.

Product Release:
– NSX 6.2.1 (upgraded from 6.2.0)
– vCenter Server 6.0 U1 (upgraded from 6.0 GA)
– ESXi 6.0 U1 (upgraded from 6.0 GA using Embedded Host Client)
– vRA 7.0 GA
– vCD-SP 8.0 GA

Routing Design:
– Two site-independent ESGs peered upstream with the core router (CSR1000V) in ECMP configuration and downstream with the UDRL control VM (which is again unique at each site).
– Local Egress routing is enabled on the UDLR.
– One core area 0 connecting the edge routers at each site (primary, secondary and branch) and the SP Router.
– Two unique NSSA ospf areas at the primary and secondary sites (10 and 20) and they are configured from each site router as totally NSSA to avoid exchanging the core routes with the ESGs.

CMP – Service Provider Model:
– vCD-SP 8.0 is configured with two vCenter Servers and NSX Managers for both sites.
– Two Provider-vDCs each pointing to the resource cluster of each site.
– Two Organization-vDCs carved up from the previous PvDCs in PAYG model.
– Edge Gateway configured on the first Org-vDC and setup with External-Direct, Private and NAT-Routed OrgNets.

CMP – Enterprise Model:
– vRA 7.0 installed and configured in simplified mode.
– Two Endpoints pointing to vCenter Servers at the Primary and Secondary sites.
– NSX 6.2 configured with the previous vCenter Servers.
– Network Profiles created fro External, Routed and NAT’ed networking.
– Blueprints created to reflect VMs with the previous network topologies.

BCDR:
– vCenter SRM 6.1 installed and configured across the primary and secondary sites in a two-way protection.
– vSphere Replication 6.1 is configure and replicating various applications (like vRA and vCD) across sites.
– NSX 6.2 is fully integrated with SRM and the application are abstracted over universal VXLANs.
– Recovery Plans already tested many times for fail-over and fail-back across sites.

Access:
– Management: RDP into the AD01 (of DC1) or AD02 (for DC2) to have local access to the environment.
– Campus: vCD console access to either DC1 or DC2.
– Remote: vCD console access to the remote office client.

Share Button
Continue Reading

New Year, New Blog and New Career Chapter @ VMware!

 

[This blog post was written on Dec 30th 2015 but I couldn’t publish in time due to PTO]

I am very excited to announce that I am joining the NSBU @ VMware as a Senior Staff Architect effective Jan 1st 2016. This is actually a second major move for me inside VMware in 6 months as I had transitioned before that from PSO (professional services organization) to the ISBU (integrated systems business unit). The latter move was a big career transition for me because I have been always customer facing throughout my career. The ISBU role was purely R&D which allowed me to learn an incredible amount of new things inside the SDDC org and to interact with diverse teams like Engineering, QE, PMs and so forth. What I actually used to do in the ISBU was mostly what I did in my spare time during the PSO days. Like coming up with new ideas, prototyping them into tangible solutions like nested labs or documenting them in different forms. During those six months also I came to realize how much I am passionate about networking and security virtualization. This is something that was not very obvious to me during the past few years but, surprisingly, when I look back to what I did in VMware or even how I started my career and developed it, I realize that networking has been always my biggest passion. You can even tell that from my blog posts on hypervizor.com. They have been always centralized abound NSX and its predecessor vCNS.

Which takes me to the new role that I am moving to. It is still part of the same R&D organization at VMware which will allow me to continue to do what I enjoy the most to prototype new ideas and solutions but now even with a greater focus on NSX. What I am even more excited about is that I will be able to engage with customers (in limited scopes though) to help them adopt NSX in their environments, be it enterprises, service providers or even telcos. Speaking of the latter sector, my last project in PSO was around NSX & NFV which was one of the very first worldwide VMware NFV projects done purely with NSX & vCD-SP. This has not just turned into a huge success story but also started a whole new opportunities for NSX in the telco world. This is exactly what I am looking forward to continue doing in my new role in the NSBU. There is no restrictions whatsoever on what I can work on or what solutions I want to develop from a proof of concept inside the R&D all the way to the field (with PSO, SE, customers) to implement it. It’s like designing the next F-16 and having the opportunity to be the first one to fly it!

2015 has been an awesome year in my career and I would like to take this chance to thank my previous manager, Phil Weiss, for his incredible support during my short time in the ISBU and in my decision to transition to the NSBU. He is hands-down the best manager I had in my career, not just in VMware. He is also a brilliant architect and in fact I was inspired for a long time by his Mega vPods that he used to build in his pervious R&D days.

I am not writing this blog post just to announce my move (who cares anyways), but also to showcase that sometimes you might be passionate about something for a very long time without necessary realizing that. I have always found similar blog posts from VMware rock stars like Duncan Epping or Scott Lowe to be really inspiring. I hope one day I will be able to do the same for others in their careers.

And just for the record, this time I know exactly what I want to do next, where I want to be inside VMware and how to reach there! All I can say for now is that specialization is every thing in our industry and once you know what you are really passionate about, don’t think twice to focus your career in that direction.

Share Button
Continue Reading

SDDC Architectures: Faster and Reliable BCDR for your cloud using NSX, SRM and vRealize Suite

Nearly six years ago, when VMware first introduced Site Recovery Manager (SRM), it was quite a big hit in the world of enterprise customers. Being a VMware customer myself at that time, I remember how this represented a significant improvement in how we handle our DR scenarios. Specifically, we shifted from long and exhausting runbooks full of manual instructions to simply constructed recovery plans. In fact, I was so grateful for this solution that I immediately started evangelizing about it and created a full series of videos, covering everything from installation all the way to configuration.

Fast-forward 6 years, and the conversations that I have with my customers are completely different now. Disaster recovery for them is already a given. SRM is a part of almost every environment that I have seen over my past 5 years at VMware, and customers use it on regular basis for planned or unplanned VM recovery. What has changed since then, and what do customers require as we stand today are, hopefully, that questions that this blog post and the associated reference architecture will provide the answer to.

Business Drivers

The greatest challenge that I see in customer environments today is around the operational part after a disaster strike. Recovering virtual machines from one site to another cannot be simpler today with the great enhancements of SRM, especially when you combine it with vSphere Replication (VR). However, the problems that almost all my customers face today are primarily around IP address changes and how that breaks, to a large extent, how the applications work all together.

Another challenge faced by customers with private cloud models is the day-two operations through their cloud management platform (CMP). If you have a CMP like vRealize Automation (formally vCloud Automation Center), and your end users – like developers, QEs, or application owners – access it on a daily basis, how do you bring that platform up and running after a disaster recovery, and most importantly, what is your recovery time objective (RTO) for that?

Think of it this way: If you are a large bank, service provider, or telco that had already adopted cloud computing and achieved all the great benefits of it, how do you ensure a fast RTO for both your applications and that cloud platform itself to continue your day-to-day business operations after a disaster? What is the use of your BCDR plan and architecture if you cannot resume your normal business at the recovery site exactly as you used to do in your protected site? Let’s take a practical example here to put things into perspective. You have a development department in your business and your developers use vRealize Automation (vRA) on a daily basis to provision applications (not just VMs). If this department is of great importance to your business, how will you be able to provide them with the same platform to resume their work the very next day after a disaster? Are you going to recover the same environment on your DR site? Or are you going to build another vRA instance? If it’s the former, how fast can you recover your vRA platform, and if it is the latter, how will you be able to maintain the same configuration across the two sites? Now take that example and try to apply it for different departments in your organization, like the IT Ops responding to business requests for new apps through vRA, or your NOC that had already established specific monitoring dashboards through vRealize Operations and so forth.

Objectives and Outcomes

If you haven’t already done so, I would highly recommend to checkout the VMware’s IT Outcomes at this link. In a nutshell, VMware has grouped a number of IT outcomes and mapped them to how organizations can deliver more business value. Our solution here, in turn, maps to the following three IT outcomes:

– High Availability and Resilient Infrastructure

– Streamlined and Automated Data Center Operations

– Application and Infrastructure Delivery Automation

To translate that into clear and concise objectives, and to set the stage for the following sections in the article, here are what we are trying to achieve the very same day or the next day at the latest after a disaster recovery is complete:
1) Enable our application owners to access the same CMP (vRA portal in our case) to resume their day-to day operations like accessing their VM consoles, create snap-shots, add more disk space, etc
2) Enable our developers to continue their test/dev VM provisioning the exact same way they used to do before the failover. That include, but not limited to: using the very same Blueprints, Multi-machine templates, NSX automated network provisioning, etc.
3) Enable our IT-Ops to provision VMs (IaaS) or applications (PaaS) through the vRA portal without altering any configurations like changing IP addresses or hostnames, etc.
4) Enable our NOC/Helpdesk to monitor the applications, services and servers through vRealize Operations Manager (vROM) the same way they did in the original site. No changes in accessing the portal, default or customer dashboards previously created, or losing any historical data.
5) Enable our higher management to access vRealize Business to see financial reports on the environment and cost analysis the same way they did before the disaster and failover took place.

If you are motivated at this point but you feel that it’s all marketing talk, it’s time to dig dipper into the proposed architecture, because from this point onwards, it cannot get more technical.

The Architecture


First and foremost, I always like to stress the fact that all of my reference architectures are validated before I publish them. As a matter of fact, in order to create this architecture, I spent nearly two weeks between testing and validation on our internal VMware cloud, and the end result was 40+ VMs. This ranged between physically simulated devices (like edge routers) to management components (like vCenter Servers and SRM), all the way to nested VMs to simulate the actual applications being failed-over and failed-back across sites. Now, let’s examine this architecture in detail.

Starting from the top to bottom, the following are the major components:

1) Datacenters: This represents a typical two-DC environment in remote locations. The first is the protected site in Cairo, and the second is the recovery site in Luxor. This could still be two data centers in a metro/campus location, but that would be just too easy to architect a solution for. It is worth mentioning also that, in my current cloud lab, I have this environment in a three-datacenters architecture, which works pretty much the same. I just didn’t want to overcomplicate the architecture or the blog post but know that this solutions works perfectly well with 2+ Datacenters.

2) The vRealize Suite Infrastructure: These are the vRS virtual appliances/machines that should be typically running in your management cluster. That’s your vRA front-end appliances, your vRA SSO, the IaaS components, and the database plus the other components in the Suite like vR-Operations and Business. What you see different here is that we are connecting these VMs to a logical switch created by NSX, and it is represented in the diagram by (VXLAN 5060). You will know in a minute why this platform is abstracted from the physical network. Another important note to point out here is that this vRS environment could still be distributed and high-available; in fact, I have the vRA appliances load-balanced in my lab with a one-arm Edge Services Gateway (ESG). For detailed information about architecting a distributed vRA solution, you can check my blog post here. For simplicity, I included the vRA nodes in a standalone mode, but the latter distributed architecture will work just fine in our scenario here.

3) The Virtual & Network Infrastructure: In the third layer of this architecture comes the management components of the virtual infrastructure, such as vCenter Server, SRM, NSX Manager, and so forth. Everything you see in this layer is relevant to each datacenter independently. For example, we will never fail-over the vCenter server from one site to another. The same thing holds true for the infrastructure services like DNS or Active Directory domain controllers. I will be talking in detail about the networking and routing subjects in another section, but for now, I would like to point out that we have a traditional IP fabric for the management workloads, represented here by VLAN 110 and subnet 192.168.110.0/24 in the Cairo datacenter, while we have VLAN 210 and subnet 192.168.210.0/24 in the Luxor datacenter. The two networks are routed using a traditional MPLS connection or any L3 WAN cloud, depending on how your environments are designed.

4) Resources: In this layer, you can see our vSphere clusters. We have here a management cluster for running your VMware-related or infrastructure services workloads. The second cluster is your production workloads cluster, which runs your business applications. The third and last cluster here is your test/dev environment. This three-cluster architecture doesn’t have to be exactly the same for you. For example, some of my customers run their management workloads along with the production applications. Other customers separate the management from production clusters but run their UAT environment along with their production workloads. These are all valid design scenarios, with pros and cons for each choice that are beyond our scope here. What I just want to point out is that this solution will work just fine with any of those three architecture choices.

5) SRM: This is just an illustration layer showing the SRM constructs in terms of Protection Groups and Recovery plans and their association with the operations layer beneath.

6) Operations: This layer is detailing the various operations related to this architecture from application owner provisioning all the way to the SRM admin recovery of workloads. We will come to these operational subjects in detail later in the article.

Now that we had an overview on the architecture, it is time to discuss subjects in detail.

It’s all about ‘Abstraction’

To buildup on what I have mentioned at the beginning of this article, our main goal here is to abstract as much infrastructure as possible in order to achieve the required flexibility in our design.

What you see here is that the entire vRealize Suite is abstracted from the traditional management network/portgroup (VLAN-backed) to an NSX Logical Switch (VXLAN-backed). This included vRealize Automation, Operations, Business and Log Insight.

Traditionally, you would connect these components to a VLAN-backed portgroup which is most likely the same as your vCenter Server network. We are not doing this here because we want to maintain the same IP addressing of all these components to avoid, in turn, the requirement to change them when we failover to the second site.

If you look closely to the architecture, you will see that the NSX Logical Switch (or VXLAN) on each site has the same IP subnet which is 172.16.0.0/24. The internal interface (LIF) of the DLRs at each site also has the same IP address which is 172.16.0.1. This is the default gateway of all the vRealize Suite components. When you failover those VMs from one site to another, firstly, they will maintain the same IP addresses, and secondly, they will still have the same default gateway (the DLR LIF interface, that is). Furthermore, they will be already configured with two DNS servers: 192.168.110.10 and 192.168.210.11. These are the two existing Active Directory domain controllers sitting at each site. In case of a disaster in the first site, the 192.168.110.10 will be gone but the 192.168.210.11 will still be alive on the second site that the vRS VMs are being failed-over to. And guess what, all the DNS records are already the same since they are replicated across the two domain controllers. That’s basic AD functionality. For example, vra-portal.hypervizor.com is resolving to 172.16.0.12 which is the vRA appliance. Another example, vra-sso.hypervizor.com is resolving to 172.16.0.11 and that entry already exists on both the DNS servers.

With that, we are achieving the maximum flexibility and the very least amount of changes required to be done after a failover takes place. We are maintaining the same IP addresses of the vRS nodes, we are maintaining the same default gateway, and we are maintaining also the name server configuration along with all the DNS records between sites.

Routing and Switch-over

So how does the routing work here and how do you switch-over from one site to another after a disaster? Let’s have a look.

You will see in the architecture that the external (uplink) interface of the DLR is connected to the traditional management VLAN in each site. In case of the Cairo datacenter it is 192.168.110.5. The default gateway of the latter is the actual L3 router/switch in that site (like a Cisco Nexus 7K in a typical core/aggregation/access design, or a Nexus 9K in a spine/leaf architecture). That DLR needs no further routing to be configured to go out to the physical network, but it needs a routing back from that physical network to the abstracted virtual networks (172.16.0.0/24 in our case). This is exactly why we need to have a static route on that L3 device to say: in order to reach the 172.16.0.0/24 network, you have to go through the next hop router which is the DLR external interface (192.168.110.5). Easy enough, that’s basic networking. Of course you can configure and use dynamic routing (like OSPF) to exchange the routing information dynamically between the DLR and its upstream L3 device. We do not really need that here since it’s just one network that is static in nature and does not change.

Now, as long as the vRS is “living” in the Cairo datacenter, the static route on your L3 device will be active there. But what happens when we failover to a second site? The answer is easy. At that point, the Cairo datacenter will have been out of the picture so you can adjust the routing on your DR site, which is Luxor in our case here. This is simply by adding a static route entry like the one above: in order to reach the 172.16.0.0/24 network, your next hop router is 192.168.210.5. This is the DLR external interface that is already sitting idle there in the Luxor site. That very step could be quite easily part of your physical network team switch-over procedures when a disaster recovery is declared.

Replication, Protection and Recovery

We are using here, of course, vCenter Site Recovery Manager as the engine for orchestrating and automating our workloads failover. There are few things to examine here:

1) Replication:
I am leveraging here the vSphere Replication 5.8 mainly for the incredible flexibility it gives us, not to mention the great enhancements and performance improvements in the latest 5.8 release. We basically need to setup first all the replication of those vRS VMs only once to replicate from Cairo to Luxor, and then setup our Protection Groups. If you already have an array-based replication, and you are quite comfortable with it, then by all means you can still use it here. A typical configuration in this case would be to gather all your vRS VMs into one LUN and set your replication to the secondary site. Same configuration of the Protection Group can follow that.

2) Infrastructure Mapping:
It is important to set your infrastructure mapping between the two sites before you proceed to the Protection Groups configuration. Failing to do so could lead to some generic errors that you might not troubleshoot easily. For example, if you do not map your NSX Logical Switches together, the Protection Group configuration (in the next step) will fail with a generic error to check your settings.

3) Protection Groups and Recovery Plans:
After you set your replication for the VMs and map your infrastructure items, the next step is to setup your Protection Groups. In our case here, we are configuring two protection groups. The first for the vRA nodes which will consist of the vRA Virtual Appliance, vRA SSO Appliance, Windows IaaS VM and the vRA Database. We are including the DB VM since it is a vital component of the vRA instance and it has to move along the other VMs. You have the option here to do DB based replication (like SQL Log Shipping) if you feel more comfortable with that. Needless to say that, by replicating the entire VM, you guarantee a faster and automated recovery of the vRA instance.

The second Protection Group will contain the other vRealize Suite components like vRealize Operations, Business and Log Insight. You could combine those virtual appliances with the previous protection group but better and easier management, we are segregating them into two.

Next comes the Recovery Plans. The configuration is fairly simple here where you point your recovery plan to the protection group. You could have here one Recovery Plan containing both the vRA and vRS Protection Groups mentioned above.

All the above is relevant to the replication, protection and recovery of the vRS infrastructure. With the production workloads, we will be adopting a different configuration mechanism that will be, at some extent, automated via the vRA end-user portal. I will explain that in the next part of this blog post.

Static vs. Dynamic net/sec configurations

You may have already noticed that we have two type of clusters, one designated as “NSX: Automated” and the other is not. To explain what does that mean, we have to look at the function of each cluster first.

The Production cluster in this architecture is designated to host workloads that are dynamically provisioned through vRA, however, we do not require to automate the underlying networking for it. In other words, if we have a typical multi-tier application with Web, DB and App VMs, those tiers will already have a pre-provisioned VXLANs (or VLANs). In case of VXLANs, or Logical Switches as we call them in the context of NSX, you can simply pre-configure them on the other side as well. This is pretty much what we did for the vRA infrastructure itself. If you look closely to that vRA app, you can more or less consider it as a typical enterprise application with Web (vRA Appliance), DB (MS-SQL) and Application (Manager Service) tiers. These are static components that do not and will not change in nature. With that said, pre-configuring your networking on both sides is done only one time and that is it.

On the other hand, in the world of the test/dev dynamic provisioning, you would require to dynamically provision the networking and services around your applications. Let’s say you are developing a SharePoint application. You would require not just to provision multiple instances of this app with the same networking requirements (e.g. same ip addressing in an isolated networks), but also to provision NSX Edge devices to load balance its web tier as an example.

Now, since we cannot auto-sync the NSX configurations across sites (yet!), the test/dev workloads will not be failed over to the second site. Yes you are losing those VMs in a case of a disaster, but how many customers currently do VM recovery for their test/dev workloads? In the same time, we are still having an advantage here when you look at the overall architecture. Your developers will still be able to provision test/dev workloads in the very same way they always do after a disaster recovery. The reason being is that you will always have the spare capacity sitting in the DR site and your blueprints ready to provision workloads from the vRA portal that has shifted to that site after the recovery.

Nevertheless, there is an internal effort at VMware currently in the works to allow this type of net/sec configuration synchronization across the sites. If this is a hard requirement for you, you will still be able to do it once this mechanism (which will be driven by vRealize Orchestrator) is available. They key point here is that you do not have to change anything in this architecture, it will always be your foundation and then support the net/sec sync across site later (should you require that).

Interoperability Matrix

Before I conclude this first article of two, I would like to go through the interoperability matrix of the products in this architecture.

We have vSphere 5.5 as the foundation of everything which translates to vCenter Server 5.5 and ESXi 5.5. We have vCenter Site Recovery Manager 5.8 along with the vSphere Replication 5.8 for the DR automation, orchestration and replication. We have then the vRealize Suite 6.0 which consist of vRealize Automation 6.2, vRealize Operations Manager 6.0 and vRealize Business. Everything just mentioned is part of the vCloud Suite 5.8. Now the last components, and most important I would say, is the NSX 6.1 for vSphere.

One of the common questions I have received internally at VMware when I showcased this solution is whether it will work with vSphere 6.0 or not. The answer is absolutely yes! In fact, with vSphere 6.0 you would be able to take this to the next level and start live-migrating the entire vRealize Suite across the site without any service interruption. Think about situations like datacenter level maintenance or DC-migrations/consolidation and how that would be very efficient in terms of uptime and business continuance.

Conclusion

In this article I’ve explained how it is of a great advantage to abstract your vRealize Suite components into NSX driven virtual networks. By that, we have demonstrated how it is fast and reliable to recover your entire cloud infrastructure and the operational model around it in a matter of hours rather than days or weeks. We have done that without the need to change any settings, execution of runbooks or standing up new stack of software.

In the next part of this article, I will go in detail around the vRA configuration and the recovery operations of the cloud workloads. I will also list down the frequently asked questions that I have been receiving from colleagues or customers around this solution when presenting it to them to complete the picture for you.

Share Button
Continue Reading

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

Share Button
Continue Reading

Reference Architecture: Building your Hybrid Cloud with vCAC, vCHS and NSX

Introduction:

In this blog post, I will try to provide a practical approach to building your first hybrid cloud using various VMware solutions and services. In a nutshell, we will use vSphere for the private/on-prem cloud while we will be leveraging the vCloud Hybrid Service (vCHS) for the public/off-prem cloud. We will then bridge both clouds with a VPN tunnel using NSX (or vCNS, whichever applicable to you). Last but definitely not least, we will stand a vCloud Automation Center (vCAC) as an abstracted layer above these two clouds for automation and policy-based provisioning of workloads in a true hybrid cloud model.

This is not going to be a detailed how-to guide, but rather an architecture discussion with some good practices that I have seen from my real-world experience in the field. The important thing to note also is that everything you will read in this post has been tested and proven to work.

The Architecture:


 

As you see in the architecture diagram above, we have four main areas to look at. Starting from the bottom-up, we have NSX for vSphere running a site-to-site VPN tunnel with an Edge Gateway on vCHS. This is to bridge the communication between the two clouds. I will explain why in just a bit. Next, we have on the left the traditional vSphere infrastructure where NSX is connected to. On the right side we have our Virtual Private Cloud (VPC) on vCHS. The latter could be also a Dedicate Cloud on vCHS depending on your business requirements. Lastly, and at the top of the architecture, you can find vCAC as a one single entry to this hybrid cloud with policy based configurations mapped to your business groups. Now let’s examine all that in details.

The Business Driver:

Before we take a technical deep dive into all these components, let us stop first for a moment to understand the business driver behind this architecture. It is important to note here that I did not base this solution on a fictitious scenario. This is a real-word requirement from one of my large customers in the financial sector. Their main challenge is that they cannot predict the timing and the size of new projects that require from IT to provide compute resources for. Another challenge is that these requirements are very dynamic in nature that their Operations team cannot handle using the traditional VM provisioning on vSphere. These projects run through three main phases: Test/Dev, UAT and Production. The second and third phases could be handled internally since the Ops team will have the time and exact requirements to plan for. It is the first phase that is unpredictable and require the greatest amount of agility to respond to. These Test/Dev VMs are not an isolated type of workloads that can be just spun up on any public cloud. The customer needs to have a communication back and forth to internal infrastructure services (like AD, DNS, DB, Antivirus ..etc) or even corp apps/databases that cannot be hosted off-prem. Lastly, The templates used to spin up these workloads in phase one must be based on internally maintained corp images not those that are available on the public cloud.

The Networking Infrastructure:

Now to the interesting part. We will start here with the underlying network infrastructure required for cross site communications. As I have mentioned in the previous section, the workloads that are provisioned on the cloud (be it internal or external) must be able to communicate to the on-prem infrastructure services. To achieve that, we are setting up a site-to-site VPN tunnel between two gateways. Those gateways, in our case here, are NSX 6.0 for vSphere sitting on-prem while on the other end we have an Edge Gateway running on vCHS. You can find many tutorials on the internet that describe this part in details. Please note that you can utilize vCNS (part of the vCloud Suite) or any hardware based VPN gateway instead of NSX. The choice is really yours. With NSX (or vCNS), you will need to have at least two interfaces, one internal to the corp network fabric and other other one external to the public Internet. The latter needs to be set with a public IP or NAT’ed in a DMZ provided that it can receive the VPN connection requests from the internet (in our case here, from the Edge Gateway in vCHS). I would’t personally hesitate to leverage NSX here due to the incredible flexibility that it has when it comes to the design and deployment. If you haven’t already done so, please take the time to review my blog posts on NSX as an SDN gateway on the internet. http://www.hypervizor.com/nsx/

On the counterpart, we will need to configure the VPN information on the vCHS Edge Gateway. At the time of this writing, these configuration parameters are not exposed on the vCHS UI itself so you will have to jump into your vCloud Director UI and manage the Edge Gateway from there. Nevertheless, it is quite straight forward and it is almost identical to what you configured on your NSX end.

The Compute Resources

Now that we have setup our network piece and our traffic is flowing back and forth between the two clouds, it is time to construct our compute resources. On the private cloud part, this would typically be a vSphere cluster which we will designate here as “Production” to run the workloads in phase three. This could also host the UAT workloads or we can simply allocate another cluster for that. The important part here is that the NSX Edge must have a network interface/route to the VM Networks on that cluster. This is showing on the architecture diagram as 192.168.110.0/24.

On the other side, and assuming that we have a VPC subscription on vCHS, we already get an Org-vDC with 5Ghz CPU (burstable to 10Ghz) and 20GB Memory to start with. That will be our compute resource on the public cloud side. This, of course, could be scaled up per demand. Like with our on-prem vSphere cluster, this Org-vDC must have a Routed Organization Network that is acting as the internal interface for the Edge Device. This is illustrated on the architecture diagram as 172.16.1.0/24.

Content Synchronization

Before we go to the provisioning part, let us stop for a minute to address the requirement of content synchronization across the clouds. As I have mentioned earlier in the business requirements section, we need to maintain the core templates on premise and have these templates synchronized (or pushed out) to the public cloud for consumption through the blueprints which we will talk about in just a bit. To do that, we just need to setup vCloud Connector (vCC) on our private cloud side only. vCC comes in two components – Server and Node. The vCC Server is a one time setup that will typically sit in your management cluster. This acts as a hub (if you will) to coordinate the workloads migrations and template sync across your private and public clouds. The Node part is where you need to deploy per each internal endpoint (e.g. vSphere) or public cloud (e.g. vCHS). In the case of the latter, it is already setup for you. You just need to change the port number from 443 to 8443 on your vCD Organization URL when connecting to it and you will be all set. Again, there are some resources out there on the Internet that explain these configurations in details.

Once you have your Server and Node vCC components setup on-prem, you need to “Subscribe” to the template location to sync that content out to your public cloud endpoint. It is as simple as this.

vCloud Automation Center

This is where everything comes together to form our first and true hybrid cloud. In vCAC, we first need to add two endpoints, the first being the local vCenter Server that is managing the Production Cluster. And the second being the vCloud Director of our VPC subscription on vCHS. I will not go through the exact configuration steps, there is already some excellent and detailed videos on YouTube by the VMware Technical Marketing. I will list below some guidelines specific to our architecture here:
– Blueprints: You will need to create two different Blueprints, one for the local/prod VMs and the other for the remote/dev VMs. Each one should be pointing to the relevant Reservation Policy that you set in the previous point.
– Templates: The Templates you set in the blueprints above will be selected respectively from vSphere and vCHS. Both will be identical since they are syned by vCloud Connector as exampled in an earlier section.
– Customization: On the private cloud, we can leverage any vCenter Customization Specification to customize your template while provisioning. On the vCHS side, you do not have to identify that as vCD takes care of this for you.
– Network Profiles: You will maintain both the external and internal IP addressing through the Network Profile configurations. You do not have to work around IP address conflicts even on your public cloud since both vCAC and vCD will keep track of the consumed IPs.
– Reservation Policies: You will have to create here two Reservation Policies, one for the local compute cluster running on vSphere, and the second for the remote compute resources running on vCHS.

Putting it all together

The last piece in this architecture is to create the entitlements for each Blueprint/Service and relevant approval processes. What is even more interesting is what you can do to delegate specific and precise capabilities to your internal users or even external contractors or consultants working on your projects. Imagine, as an example, that a new SharePoint consultant is on your site ready to start the three phases process of Test/Dev, UAT and Prod. You simple assign him an account on AD, provide him access to the vCAC portal, and give him entitlement to create VMs only on the public side of your hybrid cloud. This, of course, is still controlled with your approval process. Now that the consultant has his VM(s), he can connect to your on-prem infrastructure services per the security policies that you defined on NSX. During the consultant work, he wanted to create a snapshot before he upgrades a specific components on his software, no problem, he has that permission right form his vCAC portal – no need to bother your Ops team. But let’s even say that he messed up something so badly that he wants just to start fresh. Still no issue, he can “reprovision” his VM and have it prepared from scratch in a matter of minuets. In this case, he doesn’t really need another approval process (unless you define that) since he already got that the first time – imagine how much time you have saved him and your team in operations tasks like this. And have I mentioned that this very consultant can even access the console of his VM(s) on vCHS through the VMRC without even touching vCD or knowing anything about it?

We have just scratched the surface here. There are so much that you can do with such flexible architecture and agile infrastructure that is defined and driven per your very own business requirements.

Share Button
Continue Reading

Solution Architecture – VMware NSX 6.0 with vCloud Director – Part 2 of 2 – Remote VPN Access

In this second part of the NSX 6.0 solution architectures series, we are building up on the pervious blog post. If you haven’t already done so, please take a few minutes to read that post at this link.

In this solution, we are changing the way the end-users are accessing the cloud services. In the last part it was a direct web access that was NAT’ed through the External Edge. Here, we are adopting a different approach by enabling the SSL VPN-Plus service on the very same external Edge to allow the end-users to connect directly to the external perimeter network. Let’s examine that in details.

The Architecture

NSX6-vCD-VPN-

As you can see from the diagram above, we have now one public IP address assigned to the “Uplink” interface of the external Edge. This IP address will be used as the VPN gateway that will allow our end user to passthrough to our External Perimeter network. Once this is configured, the End-user can point his/her browser to the URL: https://vpn.hypervizor.com and from there login with his/her username and password that would be assigned by the service provider. The web interface/portal that the end-user login from is actually part of the SSL VPN-Plus service of the Edge. It is highly customizable as well to reflect your corp identity.

From that portal, the end-use can either launch a publish web application (like a traffic monitory solution), or download the full VPN client to get connected to the external perimeter network as I’ve mentioned earlier. This VPN client has different versions for Windows, Linux or even Mac. Once the client is installed on the end-user machine, he/she can then authenticate (with the same user/pass) and get the applicable IP address (172.16.20.100 in our case here). At this point, the end-user can access either the vCD portal or other applications/services that are running on the external perimeter network like vCenter Operations for monitoring the VMs or perhaps vCenter Chargeback for monitoring the usage and charges and so forth.

Note here that we can still load-balance the vCD portal (https service) and the VMRC like what we did in the last post. The difference here is that we are only load-balancing (but not NAT’ing) the 172.16.10.xx IPs directly. For example, we pick a new IP address, say, 172.16.10.4 and we designate it as the virtual IP of the https service so we configure it to LB to the vCD cell IPs 172.16.10.11 & 13. Same thing holds true for the VMRC service.

Configuring the Parameters

Now let’s take a closer look into some important configuration parameters related to the SSL VPN-Plus service:

Screen Shot 2013-11-01 at 5.13.21 PM

In the IP Pool as you see from the screenshot above, you (as the service provider) can set here the IP address pool(s) you want your end-customers to consume their IP configuration from (e.g 172.16.20.100-200). Note here that the IP address you give as a default gateway will be assigned automatically to the Edge. You may also want to setup a dedicated DNS on the external perimeter network for name resolution. The hostnames here would be, for example, anything.ext.hypervizor.com. And the IP addresses being resolved are basically the ones on the 172.16.10.x subnet.

Screen Shot 2013-11-01 at 5.14.15 PM

The Private Networks here are the networks that you want your end-user to connect to. In our case here it is the external perimeter network (172.16.10.00/24). Of course, the External Edge will automatically take care of the routing between the 172.16.20.0/24 and 172.16.10.0/24 networks.

Screen Shot 2013-11-01 at 5.16.13 PM

The last part I want to show you here is the Installation Package. Here, you can add the external hostname and IP address of the Edge Gateway (or VPN Service if you will). In our case, the hostname is vpn.hypervizor.com and it is resolving to the public IP: 66.147.44.221. You can see also that this package will be available in Windows, Linux or Mac executables.

A look from an End-User perspective

When our end-user connects to the vpn.hypervizor.com from the web, he would get this screen (which again, is customizable).

Screen Shot 2013-11-01 at 5.25.15 PM

Once logged in for the first time, the user can download the VPN package (titled Hyper-VPN in our case here).

Screen Shot 2013-11-01 at 5.27.33 PM

After the installation is done, the end-user can fire-up the VPN program and connect the network.

Screen Shot 2013-11-01 at 5.29.04 PM

Conclusion

As you have seen in these two blog posts, the NSX 6.0 for vSphere product can play a major role in improving, securing and simplifying the way you used to architect and configure your L2-L7 services. We’ve just scratched the surface here to show some of the cool things that can be done with NSX. Imagine the amount of innovation and creativity you (as an architect or an administrator) can have now. What really stands out for me here is how fast you can get all this from sketching the ideas on a piece of paper to real services up and running in a matter of minutes or hours at the most!

Share Button
Continue Reading

Solution Architecture – VMware NSX 6.0 with vCloud Director – Part 1 of 2 – External Portal Access

This is a two-part blog post to show you two different solution architectures for VMware NSX 6.0 and vCloud Director. These architectures are focused primarily on securing, load-balancing and publishing the vCD portal through NSX. Although this blog post is typically targeted for Service Providers, I do not see why it won’t fit the bill for an enterprise as well that would be interested in publishing it’s cloud service for external consumption (by partners, subsidiaries, contractors ..etc).

 

Before we go any further into the details, please note that this solution is applicable to *any* vCD release since we are not touching here specific interoperability features between the two products. In fact, the very same concept I am presenting here could be applied to almost any application that requires to be exposed to the internet. I’ve chosen vCD here as a case in point. I have been involved in so many public cloud projects and I know first hand how it is a challenging task, using traditional solutions, to achievethe same results. I can’t remember how many hours, days (and sometime even weeks!) I had to wait for customers to get a simple load-balancing configuration done correctly, not to mention the IP allocations from the network teams or, even worse, the security part and what needs to be opened, closed or monitored. In this blog post, you will see how all that can be done in a matter of minutes now with the minimum network/security intervention. Welcome to the NSX world.
NSX-vCD

Preparing the layout components

We will need to have our two vCD cells ready and configured with three network cards. I’ve blogged about this long time back in a similar solution here (using the traditional and hard way). To recap, the first two NICs will be assigned to the upstream HTTP and VMRC services. The third NIC, will be assigned to interface with the downstream management services (e.g. vCenter Server, DB, DNS..etc).

 

From the NSX part, we will need to create two Logical Switches identified as “External Perimeter” and “Internal Perimeter”. The former will be connected to the first two NICs of the cell, and the latter will be connected to the third NIC. Next, we will need to provision a couple of NSX Edges. The first one, again, will be identified as an “External Edge” and the second one will be an “Internal Edge”. This latter Edge could be either an NSX Edge Services GW or an NSX Logical Router/Bridge. This depends on your use case and how you require to route/firewall your internal perimeter zone. More on that in a future post.

 

The NSX Logical Switches
Screen Shot 2013-10-30 at 11.43.58 AM

 

The NSX Edges
Screen Shot 2013-10-30 at 11.45.54 AM

The Internal Edge

In this Edge, we will create two interfaces. The first is the “Uplink” to be connected to the traditional management network (D-Portgroup) that you already have in your environment (the one that typically have the vCenter Server, ESXi hosts..etc). The second is the “Internal” interface to be connected to “Internal Perimeter” Logical Switch. See the screenshot below.
Screen Shot 2013-10-30 at 11.50.38 AM
Note that the Internal interface has an IP address of “10.20.30.1” which will be the default gateway for the third interface on the cells. On the other side, the Uplink interface of the Edge has the IP address 192.168.110.5 and the default gateway for that Edge should be your existing physical router/switch on the network. Confused? Have a look into the diagram as it reflects all these configurations with the same exact IP addresses and NSX components.

The External Edge

Unlink the above one, this Edge has to be provisioned as “Edge Services Gateway”. We do not have an option here to have it as a “Logical Router/Bridge” since we do need different services like Load-balancing, NAT’ing, Firewall’ing (and in the next solution, a VPN Gateway). Now let’s examine this Edge in details.
Screen Shot 2013-10-30 at 12.11.30 PM
As you see from the screenshot above, we still have two interfaces for this Edge. One as an “Internal” interface, connected to the “External Perimeter” logical switch. The second is an “Uplink” interface and connected to the Internet router. This is typically a port group with a dedicated uplink NICs to DMZ switches in your Management Cluster. Note here that the Uplink has two IP addresses. In the screenshot you can see them as (192.168.225 & 226), while on the diagram you will see them as (66.147.244.221 & 222). As you may have guessed, I do not have in my lab a public IP addresses, instead, I am using a different external network to simulate an Internet connectivity. Now, why do we have two IP addresses set on the Uplink? The answer is that we will have one IP address dedicated to the HTTP service, and the second dedicated to the VMRC service. Both of which will be NAT’ed and Load-balanced to the first two IP/Interfaces of the vCD Cells. Again, have a look into the diagram as it would save you a thousand words of explaining this in writing.

The Load-Balancing

Since this post is intended to show the architecture of the solution rather than the how-to, I will not go through the details of configuring the Load-banalcing (probably will do that in a future post/video). The thing to note here though is that the NSX will take care of the NAT rules creation. This point had me confused at the beginning where I thought that I had to do the NATing first. In our case here, as soon as the load-balancing configuration is done, the NSX will automatically publish the NAT rules as shown in the screenshot below.
Screen Shot 2013-10-30 at 1.02.38 PM

Configuring the relevant Firewall rules

I recommend that you keep this as a last step after configuring and testing your environment fully. After that, you can start enabling the relevant firewall rules to open/close specific ports. On the External Edge IPs, you typically want to open only ports 80 (http) and 443 (https) since all the vCD communication happens over SSL. For the Internal Edge, you will need to open the ports that are required by vCD to communicate with your management servers/services like vCenter Server, ESXi hosts, DNS, NTP ..etc. I’ve produced a (very old!) diagram showing a sample of these ports here. Make sure to get the up-to-date list of the relevant vCD version you are running in your environment.

Conclusion

As you have seen, provisioning network and security services has never been easier. With NSX, we did all the L2-L7 provisioning and configuration right from one console and yet had a minimum dependencies over the physical network. Looking closer into this architecture, you can see how we are securing the vCD cells from the upstream and downstream traffic. If a hacker were to break into the cells through the first/external firewall, he/she would still need to go through another firewall wall to touch your network. Things can get even more interesting when we look at the NSX extensibility. For example, we can hook up a virtual IPS to the external Edge from any VMware security partner (e.g. Symantec) and have our traffic deeply inspected against exploits and vulnerabilities targeting the Linux or vCD software. The possibilities are really endless here.

In my next post, I will show you a different approach to this by enabling the SSL VPN-Plus feature on the external Edge and how that will change the external access to the vCD cells.

P.S. If you are a VMware employee, I have this lab running on the VMware OneCloud if you want to examine the configurations, functionalities or architecture. Reach out to me over email to give you access.

Share Button
Continue Reading