Quantcast
Channel: Jose Barreto's Blog
Viewing all 143 articles
Browse latest View live

Networking configurations for Hyper-V over SMB in Windows Server 2012 and Windows Server 2012 R2

$
0
0

One of the questions regarding Hyper-V over SMB that I get the most relates to how the network should be configured. Networking is key to several aspects of the scenario, including performance, availability and scalability.

The main challenge is to provide a fault-tolerant and high-performance network for the two clusters typically involved: the Hyper-V cluster (also referred to as the Compute Cluster) and the Scale-out File Server Cluster (also referred to as the Storage Cluster).

Not too long ago, the typical configuration for virtualization deployments would call for up to 6 distinct networks for these two clusters:

  • Client (traffic between the outside and VMs running in the Compute Cluster)
  • Storage (main communications between the Compute and Storage clusters)
  • Cluster (communication between nodes in both clusters, including heartbeat)
  • Migration (used for moving VMs between nodes in the Compute Cluster)
  • Replication (used by Hyper-V replica to send changes to another site)
  • Management (used to configuring and monitoring the systems, typically also including DC and DNS traffic)

These days, it’s common to consolidate these different types of traffic, with the proper fault tolerance and Quality of Service (QoS) guarantees.

There are certainly many different ways to configure the network for your Hyper-V over SMB, but this blog post will focus on two of them:

  • A basic fault-tolerant solution using just two physical network ports per node
  • A high-end solution using RDMA networking for the highest throughput, highest density, lowest latency and low CPU utilization.

Both configurations presented here work with Windows Server 2012 and Windows Server 2012 R2, the two versions of Windows Server that support the Hyper-V over SMB scenario.

 

Configuration 1 – Basic fault-tolerant Hyper-V over SMB configuration with two non-RDMA port

 

The solution below using two network ports for each node of both the Compute Cluster and the Storage Cluster. NIC teaming is the main technology used for fault tolerance and load balancing.

image

Configuration 1: click on diagram to see a larger picture

Notes:

  • A single dual-port network adapter per host can be used. Network failures are usually related to cables and switches, not the NIC itself. It the NIC does fail, failover clustering on the Hyper-V or Storage side would kick in. Two network adapters each with one port is also an option.
  • The 2 VNICs on the Hyper-V host are used to provide additional throughput for the SMB client via SMB Multichannel, since the VNIC does not support RSS (Receive Side Scaling, which helps spread the CPU load of networking activity across multiple cores). Depending on configuration, increasing it up to 4 VNICs per Hyper-V host might be beneficial to increase throughput.
  • You can use additional VNICs that are dedicated for other kinds of traffic like migration, replication, cluster and management. In that case, you can optionally configure SMB Multichannel constraints to limit the SMB client to a specific subset of the VNICs. More details can be found in item 7 of the following article: The basics of SMB Multichannel, a feature of Windows Server 2012 and SMB 3.0
  • If RDMA NICs are used in this configuration, their RDMA capability will not be leveraged, since the physical port capabilities are hidden behind NIC teaming and the virtual switch.
  • Network QoS should be used to tame each individual type of traffic on the Hyper-V host. In this configuration, it’s recommended to implement the network QoS at the virtual switch level. See http://technet.microsoft.com/en-us/library/jj735302.aspx for details (the above configuration matches the second one described in the linked article).

 

Configuration 2 - High-performance fault-tolerant Hyper-V over SMB configuration with two RDMA ports and two non-RDMA ports

 

The solution below requires four network ports for each node of both the Compute Cluster and the Storage Cluster, two of them being RDMA-capable. NIC teaming is the main technology used for fault tolerance and load balancing on the two non-RDMA ports, but SMB Multichannel covers those capabilities for the two RDMA ports.

image

Configuration 2: click on diagram to see a larger picture

Notes:

  • Two dual-port network adapter per host can be used, one RDMA and one non-RDMA.
  • In this configuration, Storage, Migration and Clustering traffic should leverage the RDMA path. The client, replication and management traffic should use the teamed NIC path.
  • In this configuration, if using Windows Server 2012 R2, Hyper-V should be configured to use SMB for Live Migration. This is not the default setting.
  • The SMB client will naturally prefer the RDMA paths, so there is no need to specifically configure that preference via SMB Multichannel constraints.
  • There are three different types of RDMA NICs that can be used: iWARP, RoCE and InifiniBand. Below are links to step-by-step configuration instructions for each one:
  • Network QoS should be used to tame traffic flowing through the virtual switch on the Hyper-V host. If your NIC and switch support Data Center Bridging (DCB) and Priority Flow Control (PFC), there are additional options available as well. See http://technet.microsoft.com/en-us/library/jj735302.aspx for details (the above configuration matches the fourth one described in the linked article).
  • In most environments, RDMA provides enough bandwidth without the need of any traffic shaping. If using Windows Server 2012 R2, SMB Bandwidth Limits can optionally be used to shape the Storage and Live Migration traffic. More details can be found in item 4 of the following article: What’s new in SMB PowerShell in Windows Server 2012 R2. SMB Bandwidth Limits can also be used for configuration 1, but it's more common here.

 

I hope this blog posts helps with the network planning for your Private Cloud deployment. Feel free to ask questions via the comments below.

 

 


Automatic SMB Scale-Out Rebalancing in Windows Server 2012 R2

$
0
0

Introduction

 

This blog post focus on the new SMB Scale-Out Rebalancing introduced in Windows Server 2012 R2. If you haven’t seen it yet, it delivers a new way of balancing file clients accessing a Scale-Out File Server.

In Windows Server 2012, each client would be randomly directed via DNS Round Robin to a node of the cluster and stick with that one for all shares, all traffic going to that Scale-Out File Server. If necessary, some server-side redirection of individual IO requests could happen in order to fulfill the client request.

In Windows Server 2012 R2, a single client might be directed to a different node for each file share. The idea here is that the client will connect to the best node for each individual file share in the Scale-Out File Server Cluster, avoiding any kind of server-side redirection.

Now there are some details about when redirection can happen and when the new behavior will apply. Let’s look into the 3 types of scenarios you might encounter.

 

Hyper-V over SMB with Windows Server 2012 and a SAN back-end (symmetric)

 

When we first introduced the SMB Scale-Out File Server in Windows Server 2012, as mentioned in the introduction, the client would be randomly directed to one and only one node for all shares in that cluster.

If the storage is equally accessible from every node (what we call symmetric cluster storage), then you can do reads and writes from every file server cluster node, even if it’s not the owner node for that Cluster Shared Volume (CSV). We refer to this as Direct IO.

However, metadata operations (like creating a new file, renaming a file or locking byte range on a file) must be done orchestrated cross the cluster and will be executed on a single node called the coordinator node or the owner node. Any other node will simply redirect these metadata operations to the coordinator node.

The diagram below illustrates these behaviors:

 

image

Figure 1: Windows Server 2012 Scale-Out File Server on symmetric storage

 

The most common example of symmetric storage is when the Scale-Out File Server is put in front of a SAN. The common setup is to have every file server node connected to the SAN.

Another common example is when the Scale-Out File Server is using a clustered Storage Spaces solution with a shared SAS JBOD using Simple Spaces (no resiliency).

 

Hyper-V over SMB with Windows Server 2012 and Mirrored Storage Spaces (asymmetric)

 

When using a Mirrored Storage Spaces, the CSV operates in a block level redirected IO mode. This means that every read and write to the volume must be performed through the coordinator node of that CSV.

This configuration, where not every node has the ability to read/write to the storage, is generically called asymmetric storage. In those cases, every data and metadata request must be redirected to the coordinator node.

In Windows Server 2012, the SMB client chooses one of the nodes of the Scale-Out File Server cluster using DNS Round Robin and that may not necessarily be the coordinator node that owns the CSV that contains the file share it wants to access.

In fact, if using multiple file shares in a well-balanced cluster, it’s likely that the node will own some but not all of the CSVs required.

That means some SMB requests (for data or metadata) are handled by the node and some are redirected via the cluster back-end network to the right owner node. This redirection, commonly referred to as “double-hop”, is a very common occurrence in Windows Server 2012 when using the Scale-Out File Server combined with Mirrored Storage Spaces.

It’s important to mention that this cluster-side redirection is something that is implemented by CSV and it can be very efficient, especially if your cluster network uses RDMA-capable interfaces.

The diagram below illustrates these behaviors:

 

image

Figure 2: Windows Server 2012 Scale-Out File Server on asymmetric storage

 

The most common example of asymmetric storage is when the Scale-Out File Server is using a Clustered Storage Spaces solution with a Shared SAS JBOD using Mirrored Spaces.

Another common example is when only a subset of the file server nodes is directly connected to a portion backend storage, be it Storage Spaces or a SAN.

A possible asymmetric setup would be a 4-node cluster where 2 nodes are connected to one SAN and the other 2 nodes are connected to a different SAN.

 

Hyper-V over SMB with Windows Server 2012 R2 and Mirrored Storage Spaces (asymmetric)

 

If you’re following my train of thought here, you probably noticed that the previous configuration has a potential for further optimization and that’s exactly what we did in Windows Server 2012 R2.

In this new release, the SMB client gained the flexibility to connect to different Scale-Out File Server cluster nodes for each independent share that it needs to access.

The SMB server also gained the ability to tell its clients (using the existing Witness protocol) what is the ideal node to access the storage, in case it happens to be asymmetric.

With the combination of these two behavior changes, a Windows Server 2012 R2 SMB client and server are capable to optimize the traffic, so that no redirection is required even for asymmetric configurations.

The diagram below illustrates these behaviors:

 

image

Figure 3: Windows Server 2012 R2 Scale-Out File Server on asymmetric storage

 

Note that the SMB client now always talks to the Scale-Out File Server node that is the coordinator of the CSV where the share is.

Note also that the CSV ownership is shared across nodes in the example. That is not a coincidence. CSV now includes the ability to spread its CSVs across the nodes uniformly.

If you add or remove nodes or CSVs in the Scale-Out File Server cluster, the CSVs will be rebalanced. The SMB clients will then also be rebalanced to follow the CSV owner nodes for their shares.

 

Key configuration requirements for asymmetric storage in Windows Server 2012 R2

 

Because of this new automatic rebalancing, there are key new considerations when designing asymmetric (Mirrored or Parity Storage Spaces) storage when using Windows Server 2012 R2.

First of all, you should have at least as many CSVs as you have file server cluster nodes. For instance, for a 3-node Scale-Out File Server, you should have at least 3 CSVs. Having 6 CSVs is also a valid configuration, which will help with rebalancing when one of the nodes is down for maintenance.

To be clear, if you have a single CSV in such asymmetric configuration in Windows Server 2012 R2 Scale-Out File Server cluster, only one node will be actively accessed by SMB clients.

You should also try, as much as possible, to have your file shares and workloads evenly spread across the multiple CSVs. This way you won’t have some nodes working much harder than others.

 

Forcing per-share redirection for symmetric storage in Windows Server 2012 R2

 

The new per-share redirection does not happen by default in the Scale-Out File Server if the back-end storage is found to be symmetric.

For instance, if every node of your file server is connected to a SAN back-end, you will continue to have the behavior described on Figure 1 (Direct IO from every node plus metadata redirection).

The CSVs will automatically be balanced across file server cluster nodes even in symmetric storage configurations. You can turn that behavior off using the cmdlet below, although I'm hard pressed to find any good reason to do it.

(Get-Cluster). CSVBalancer = 0

However, when using symmetric storage, the SMB clients will continue to each connect a single file server cluster node for all shares. We opted for this behavior by default because Direct IO tends to be efficient in these configurations and the amount of metadata redirection should be fairly small.

You can override this setting and make the symmetric cluster use the same rebalancing behavior as an asymmetric cluster by using the following PowerShell cmdlet:

Set-ItemProperty HKLM:\System\CurrentControlSet\Services\LanmanServer\Parameters -Name AsymmetryMode -Type DWord -Value 2 -Force

You must apply the setting above to every file server cluster node. The new behavior won’t apply to existing client sessions.

If you switch to this configuration, you must apply the same planning rules outlined previously (at least one CSV per file server node, ideally two).

 

Conclusion

 

I hope this clarifies the behavior changes introduced with SMB Scale-Out Automatic Rebalancing in Windows Server 2012 R2.

While most of it is designed to just work, I do get some questions about it from those interested in understanding what happens behind the scenes.

Let me know if you find those useful or if you have any additional questions.

Selecting the number of nodes for your Scale-Out File Server Cluster

$
0
0

I recently got a stream of e-mails and questions about the maximum number of cluster nodes you can have in a Scale-Out File Server cluster. For the record, we test and support up to 8 nodes per file server cluster. This is the case for both Windows Server 2012 (which introduced the Scale-Out File Server cluster feature) and Windows Server 2012 R2.

 

image

 

However, the real question is usually "How many file server nodes do I need for my Scale-Out File Server cluster?" The most common scenarios we see involve the deployment of 2 to 4 file server nodes per cluster, with just a few people considering 8 nodes. Here are some arguments for each cluster size:

  • 2 is the bare minimum for achieving continuous availability (transparent failover). We see a lot of these two-node clusters out there.
  • 3 is a good idea to allow you to still have continuous availability even when you're doing maintenance on a node.
  • 4 allows you to upgrade the cluster without extra hardware by evicting two nodes, installing the new operating system and using the Copy Cluster Roles Wizard.
  • 8 will allow you to combine the network throughput and computing power of the many nodes to create an amazing file sharing infrastructure.

On the topic of performance, keep in mind that the Scale-Out concept means that you have the ability to linearly scale the cluster by adding more nodes to achieve higher IOPS and throughput. We have proved it in our test labs. However, we have also shown that a single file server cluster node can deliver over 200,000 IOPs at 8KB each or over 2GB/sec throughput, with a fairly standard server configuration (two modern Intel CPUs with a few cores each, dual 10GbE RDMA network interfaces, two SAS port each with four lanes of 6Gbps SAS shared storage with an SSD tier). We have also shown high-end file server configurations achieving over a million IOPs and over 16.5GB/sec from a single node. So, there are only a few scenarios that would require anything close to 8 nodes solely to achieve specific performance goals.

Also, keep in mind that the reasons for adopting a Scale-Out File Server usually include a combination of availability, performance and scalability of the resulting file service. While the number of nodes is an important ingredient to achieve that, you should never overlook related configuration decisions like the number and model of CPU you're using, the number and type of network interfaces, the number and speed of the SAS ports, the proper deployment of tiered storage with the right class of SSDs and even the generation and number of lanes of the PCIe slots you use. Two well sized and properly fitted file server nodes can easily beat eight nodes that were poorly put together.

Here a few additional references for those looking to dig deeper:

I'm not sure what's driving this wave of questions about the Scale-Out File Server cluster limits, but I hope this helps clarify the topic. Have you ever encountered a scenario where you would need more than 8 nodes? We would love to hear it...

Is accessing files via a loopback share the same as using a local path?

$
0
0

Question from a user (paraphrased): When we access a local file via loopback UNC path, is this the same as accessing via the local path? I mean, is  "C:\myfolder\a.txt" equal to "\\myserver\myshare\a.txt" or I'll be using TCP/IP in any way?

Answer from SMB developer: When accessing files over loopback, the initial connect and the metadata operations (open, query info, query directory, etc.) are sent over the loopback connection. However, once a file is open we detect it and forward reads/writes directly to the file system such that TCP/IP is not used. Thus there is some difference for metadata operations, but data operations (where the majority of the data is transferred) behave just like local access.

How much traffic needs to pass between the SMB Client and Server before Multichannel actually starts?

$
0
0

One smart MVP was doing some testing and noticed that SMB Multichannel did not trigger immediately after an SMB session was established. So, he asked: How much traffic needs to pass between the SMB Client and Server before Multichannel actually starts?

Well... SMB Multichannel works slightly different in that regard depending on whether the client is running Windows 8 or Windows Server 2012.

On Windows Server 2012, SMB Multichannel starts whenever an SMB read or SMB write is issued on the session (but not other operations). For servers, network fault tolerance is a key priority and sessions are typically long lasting, so we set up the extra channels as soon as we detect any read or write.

SMB Multichannel in Windows 8 will only engage if there are a few IOs in flight at the same time (technically, when the SMB window size get to a certain point). The default for this WindowSizeThreshold setting is 8 (meaning that there are at least 8 packets asynchronously in flight). That requires some level of activity on the SMB client, so a single small file copy won't trigger it. We wanted to avoid starting Multichannel for every connection from a client, especially if just doing a small amount of work. You can query this client setting via "Get-SmbClientConfiguration". You can set it with "Set-SmbClientConfiguration -WindowSizeThreshold n". If you set it to 1, for instance, to have a behavior similar to Windows Server 2012.

Even after SMB Multichannel kicks in, the extra connections might take a few seconds to actually get established. This is because the process involves querying the server for interface information, there's some thinking involved about which paths to use and SMB does this as a low priority activity. However, SMB traffic continues to use the initial connection and does not wait for additional connections to be established. Once the extra connections are setup, they won't be torn down even if activity level drops. If the session ends and is later restarted, though, the whole process will start again from scratch. 

You can learn more about SMB Multichannel at http://blogs.technet.com/b/josebda/archive/2012/06/28/the-basics-of-smb-multichannel-a-feature-of-windows-server-2012-and-smb-3-0.aspx

Minimum version of Mellanox firmware required for running SMB Direct in Windows Server 2012

$
0
0

There are two blog posts explaining in great detail what you need to do to use Mellanox ConnectX-2 or ConnectX-3 cards to implement RDMA networking for SMB 3.0 (using SMB Direct). You can find them at:

However, I commonly get some question where SMB cmdlets reports a Mellanox NIC as not being RDMA-capable. Over time, I learned that the most common issue around this is an outdated firmware. Windows Server 2012 now comes with an inbox driver for these Mellanox adapters, but it is possible that your firmware on the adapter itself is old. This will cause the NIC to not use RDMA.

To be clear, your Mellanox NIC must have firmware version 2.9.8350 or higher to work with SMB. The driver actually checks the firmware version on start up and logs a message if the firmware does not meet this criteria: "The firmware version that is burned on the device <device name> does not support Network Direct functionality. This may affect the File Transfer (SMB) performance. The current firmware version is <current version> while we recommend using firmware version 2.9.8350 or higher. Please burn a newer firmware and restart the Mellanox ConnectX device. For more details about firmware burning process please refer to Support information on http://mellanox.com".

However, since the NIC actually works fine without RDMA (at reduced performance and higher CPU utilization), some administrators might fail to identify this issue. If they are following the steps outlined in the links above, the verification steps will point to the fact that RDMA is actually not being used and the NIC is running only with TCP/IP.

The solution is obviously to download the firmware update tools from the Mellanox site and fix it. It will also come with the latest driver version, which is newer than the inbox driver. The direct link to that Mellanox page is http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=32&menu_section=34. You need to select the “Windows Server 2012” tab at the bottom of that page and download the "MLNX WinOF VPI for x64 platforms" package, shown in the picture below.

image

Sample PowerShell Scripts for Storage Spaces, standalone Hyper-V over SMB and SQLIO testing

$
0
0

These are some PowerShell snippets to configure a specific set of systems for Hyper-V over SMB testing.
Posting it here mainly for my own reference, but maybe someone else out there is configuring a server with 48 disks split into 6 pools of 8 disks.
These systems do not support SES (SCSI Enclosure Services) so I could not use slot numbers.
This setup includes only two computers: a file server (ES-FS2) and a Hyper-V host (ES-HV1).
Obviously a standalone setup. I'm using those mainly for some SMB Direct performance testing.

Storage Spaces - Create pools

$s = Get-StorageSubSystem -FriendlyName *Spaces*

$d = Get-PhysicalDisk | ? { ";1;2;3;4;5;6;7;8;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool1 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

$d = Get-PhysicalDisk | ? { ";9;10;11;12;13;14;15;16;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool2 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

$d = Get-PhysicalDisk | ? { ";17;18;19;20;21;22;23;24;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool3 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

$d = Get-PhysicalDisk | ? { ";25;26;27;28;29;30;31;32;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool4 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

$d = Get-PhysicalDisk | ? { ";33;34;35;36;37;38;39;40;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool5 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

$d = Get-PhysicalDisk | ? { ";41;42;43;44;45;46;47;48;".Contains(";"+$_.DeviceID+";") }
New-StoragePool -FriendlyName Pool6 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d
 

Storage Spaces - Create Spaces (Virtual Disks)

1..6 | % {
Set-ResiliencySetting -Name Mirror -NumberofColumnsDefault 4 -StoragePool  ( Get-StoragePool -FriendlyName Pool$_ )
New-VirtualDisk -FriendlyName Space$_ -StoragePoolFriendlyName Pool$_ -ResiliencySettingName Mirror -UseMaximumSize
}

Initialize disks, partitions and volumes

1..6 | % {
$c = Get-VirtualDisk -FriendlyName Space$_ | Get-Disk
Set-Disk -Number $c.Number -IsReadOnly 0
Set-Disk -Number $c.Number -IsOffline 0
Initialize-Disk -Number $c.Number -PartitionStyle GPT
$L = “DEFGHI”[$_-1]
New-Partition -DiskNumber $c.Number -DriveLetter $L -UseMaximumSize
Initialize-Volume -DriveLetter $L -FileSystem NTFS -Confirm:$false
}

Confirm everything is OK

Get-StoragePool Pool* | sort FriendlyName | % { $_ ; ($_ | Get-PhysicalDisk).Count }
Get-VirtualDisk | Sort FriendlyName
Get-VirtualDisk Space* | % { $_ | FT FriendlyName, Size, OperationalStatus, HealthStatus ; $_ | Get-PhysicalDisk | FT DeviceId, Usage, BusType, Model }
Get-Disk
Get-Volume | Sort DriveLetter

Verify SMB Multichannel configuration

Get-SmbServerNetworkInterface -CimSession ES-FS2
Get-SmbClientNetworkInterface -CimSession ES-HV1 | ? LinkSpeed -gt 1
Get-SmbMultichannelConnection -CimSession ES-HV1

On the local system, create files and run SQLIO

1..6 | % {
   $d = “EFGHIJ”[$_-1]
   $f=$d+”:\testfile.dat”
   fsutil file createnew $f (256GB)
   fsutil file setvaliddata $f (256GB)
}

c:\sqlio\sqlio2.exe -s10 -T100 -t4 -o16 -b512 -BN -LS -fsequential -dEFGHIJ testfile.dat
c:\sqlio\sqlio2.exe -s10 -T100 -t16 -o16 -b8 -BN -LS -frandom -dEFGHIJ testfile.dat

Create the SMB Shares for use by ES-HV1

1..6 | % {
   $p = “EFGHIJ”[$_-1]+":\"
   $s = "Share"+$_
   New-SmbShare -Name $s -Path $p -FullAccess ES\User, ES\ES-HV1$
   (Get-SmbShare -Name $s).PresetPathAcl | Set-Acl
}

On remote system, map the drives and run SQLIO again:

1..6 | % {
   $l = “EFGHIJ”[$_-1]+":"
   $r = "\\ES-FS2\Share"+$_
   New-SmbMapping -LocalPath $l -RemotePath $r
}

c:\sqlio\sqlio2.exe -s10 -T100 -t4 -o16 -b512 -BN -LS -fsequential -dEFGHIJ testfile.dat
c:\sqlio\sqlio2.exe -s10 -T100 -t16 -o16 -b32 -BN -LS -frandom -dEFGHIJ testfile.dat

Creating the VM BASE

New-VM -Name VMBASE -VHDPath C:\VMS\BASE.VHDX -Memory 8GB
Start-VM VMBASE
Remove-VM VMBASE

Set up VMs - Option 1 - from a BASE and empty shares

1..6 | % {
   Copy C:\VMS\Base.VHDX \\ES-FS2\Share$_\VM$_.VHDX
   New-VHD -Path \\ES-FS2\Share$_\Data$_.VHDX -Fixed -Size 256GB
   New-VM -Name VM$_ -VHDPath \\ES-FS2\Share$_\VM$_.VHDX -Path \\ES-FS2\Share$_ -Memory 8GB
   Set-VM -Name VM$_ -ProcessorCount 8
   Add-VMHardDiskDrive -VMName VM$_ -Path \\ES-FS2\Share$_\Data$_.VHDX
   Add-VMNetworkAdapter -VMName VM$_ -SwitchName Internal
}

Set up VMS - Option 2 - when files are already in place

1..6 | % {
   New-VM -Name VM$_ -VHDPath \\ES-FS2\Share$_\VM$_.VHDX -Path \\ES-FS2\Share$_ -Memory 8GB
   Set-VM -Name VM$_ -ProcessorCount 8
   Add-VMHardDiskDrive -VMName VM$_ -Path \\ES-FS2\Share$_\Data$_.VHDX
   Add-VMNetworkAdapter -VMName VM$_ -SwitchName Internal

Setting up E: data disk inside each VM

Set-Disk -Number 1 -IsReadOnly 0
Set-Disk -Number 1 -IsOffline 0
Initialize-Disk -Number 1 -PartitionStyle GPT
New-Partition -DiskNumber 1 -DriveLetter E -UseMaximumSize
Initialize-Volume -DriveLetter E -FileSystem NTFS -Confirm:$false

fsutil file createnew E:\testfile.dat (250GB)
fsutil file setvaliddata E:\testfile.dat (250GB)

Script to run inside the VMs

PowerShell script sqlioloop.ps1 on a shared X: drive (mapped SMB share) run from each VM
Node identified by the last byte of its IPv4 address.
Empty file go.go works as a flag to start running the workload on several VMs at once.
Also saves SQLIO output to a text file on the shared X: drive
Using a separate batch files to run SQLIO itself so they can be easily tuned even when the script is running on all VMs

CD X:\SQLIO
$node = (Get-NetIPAddress | ? IPaddress -like 192.168.99.*).IPAddress.Split(".")[3]
while ($true)
{
  if ((dir x:\sqlio\go.go).count -gt 0)
  {
     "Starting large IO..."
     .\sqliolarge.bat >large$node.txt
     "Pausing 10 seconds..."
      start-sleep 10
     "Starting small IO..."
      .\sqliosmall.bat >small$node.txt
     "Pausing 10 seconds..."
      start-sleep 10
   }
   "node "+$node+" is waiting..."
   start-sleep 1
}

PowerShell script above uses file sqliolarge.bat

.\sqlio2.exe -s20 -T100 -t2 -o16 -b512 -BN -LS -fsequential -dE testfile.dat

Also uses sqliosmall.bat

.\sqlio2.exe -s20 -T100 -t4 -o16 -b32 -BN -LS -frandom -dE testfile.dat

 

P.S.: The results obtained from a configuration similar to this one were published at http://blogs.technet.com/b/josebda/archive/2013/02/11/demo-hyper-v-over-smb-at-high-throughput-with-smb-direct-and-smb-multichannel.aspx

Hyper-V over SMB – Sample Configurations

$
0
0

This post describes a few different Hyper-V over SMB sample configurations with increasing levels of availability. Not all configurations are recommended for production deployment, since some do not provide continuous availability. The goal of the post is to show how one can add redundancy, Storage Spaces and Failover Clustering in different ways to provide additional fault tolerance to the configuration.

 

1 – All Standalone

 

image

 

Hyper-V

  • Standalone, shares used for VHD storage

File Server

  • Standalone, Local Storage

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost

Configuration lowlights

  • Storage not fault tolerant
  • File server not continuously available
  • Hyper-V VMs not highly available
  • Hardware setup and OS install by IT Pro

 

2 – All Standalone + Storage Spaces

 

image

 

Hyper-V

  • Standalone, shares used for VHD storage

File Server

  • Standalone, Storage Spaces

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost
  • Storage is Fault Tolerant

Configuration lowlights

  • File server not continuously available
  • Hyper-V VMs not highly available
  • Hardware setup and OS install by IT Pro

 

3 – Standalone File Server, Clustered Hyper-V

 

image

 

Hyper-V

  • Clustered, shares used for VHD storage

File Server

  • Standalone, Storage Spaces

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost
  • Storage is Fault Tolerant
  • Hyper-V VMs are highly available

Configuration lowlights

  • File server not continuously available
  • Hardware setup and OS install by IT Pro

 

4 – Clustered File Server, Standalone Hyper-V

 

image

 

Hyper-V

  • Standalone, shares used for VHD storage

File Server

  • Clustered, Storage Spaces

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost
  • Storage is Fault Tolerant
  • File Server is Continuously Available

Configuration lowlights

  • Hyper-V VMs not highly available
  • Hardware setup and OS install by IT Pro

 

5 – All Clustered

 

image

 

Hyper-V

  • Clustered, shares used for VHD storage

File Server

  • Clustered, Storage Spaces

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost
  • Storage is Fault Tolerant
  • Hyper-V VMs are highly available
  • File Server is Continuously Available

Configuration lowlights

  • Hardware setup and OS install by IT Pro

 

6 – Cluster-in-a-box

 

image

 

Hyper-V

  • Clustered, shares used for VHD storage

File Server

  • Cluster-in-a-box

Configuration highlights

  • Flexibility (Migration, shared storage)
  • Simplicity (File Shares, permissions)
  • Low acquisition and operations cost
  • Storage is Fault Tolerant
  • File Server is continuously Available
  • Hardware and OS pre-configured by the OEM

 

More details

 

You can find additional details on these configurations in this TechNet Radio show: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-SMB-30-Deployment-Scenarios

You can also find more information about the Hyper-V over SMB scenario in this TechEd video recording: http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/VIR306


Hyper-V over SMB – Performance considerations

$
0
0

1. Introduction

 

If you follow this blog, you probably already had a chance to review the “Hyper-V over SMB” overview talk that I delivered at TechEd 2012 and other conferences. I am now working on a new version of that talk that still covers the basics, but adds brand new segments focused on end-to-end performance and detailed sample configurations. This post looks at this new end-to-end performance portion.

 

2. Typical Hyper-V over SMB configuration

 

End-to-end performance starts by drawing an end-to-end configuration. The diagram below shows a typical Hyper-V over SMB configuration including:

  • Clients that access virtual machines
  • Nodes in a Hyper-V Cluster
  • Nodes in a File Server Cluster
  • SAS JBODs acting as shared storage for the File Server Cluster

 

image

 

The main highlights of the diagram above include the redundancy in all layers and the different types of network connecting the layers.

 

3. Performance considerations

 

With the above configuration in mind, you can then start to consider the many different options at each layer that can affect the end-to-end performance of the solution. The diagram below highlights a few of the items, in the different layers, that would have a significant impact.

 

image

 

These items include, in each layer:

  • Clients
    • Number of clients
    • Speed of the client NICs
  • Virtual Machines
    • VMs per host
    • Virtual processors and RAM per VM
  • Hyper-V Hosts
    • Number of Hyper-V hosts
    • Cores and RAM per Hyper-V host
    • NICs per Hyper-V host (connecting to clients) and the speed of those NICs
    • RDMA NICs (R-NICs) per Hyper-V host (connecting to file servers) and the speed of those NICs
  • File Servers
    • Number of File Servers (typically 2)
    • RAM per File Server, plus how much is used for CSV caching
    • Storage Spaces configuration, including number of spaces, resiliency settings and number of columns per space
    • RDMA NICs (R-NICs) per File Server (connecting to Hyper-V hosts) and the speed of those NICs
    • SAS HBAs per File Server (connecting to the JBODs) and speed of those HBAs
  • JBODs
    • SAS ports per module and the speed of those ports
    • Disks per JBOD, plus the speed of the disks and of their SAS connections

It’s also important to note that the goal is not to achieve the highest performance possible, but to find a balanced configuration that delivers the performance required by the workload at the best possible cost.

 

4. Sample configuration

 

To make things a bit more concrete, you can look at a sample VDI workload. Suppose you need to create a solution to host 500 VDI VMs. Here is an outline of the thought process you would have to go through:

  • Workload, disks, JBODs, hosts
    • Start with the stated workload requirements: 500 VDI VMs, 2GB RAM, 1 virtual processor, ~50GB per VM, ~30 IOPS per VM, ~64KB per IO
    • Next, select the type of disk to use: 900 GB HDD at 10,000 rpm, around 140 IOPS
    • Also, the type of JBOD to use: SAS JBOD with dual SAS modules, two 4-lane 6Gbps port per module, up to 60 disks per JBOD
    • Finally, this is the agreed upon spec for the Hyper-V host: 16 cores, 128GB RAM
  • Storage
    • Number of disks required based on IOPS: 30 * 500 /140 = ~107 disks
    • Number of disks required based on capacity: 50GB * 2 * 500 / 900 = ~56 disks.
    • Some additional capacity is required for snapshots and backups.
    • It seems like we need 107 disks for IOPS to fulfill both the IOPS and capacity requirements
    • We can then conclude we need 2 JBODs with 60 disks each (that would give us 120 disks, including some spares)
  • Hyper-V hosts
    • 2 GB VM / 128GB = ~ 50 VM/host – leaving some RAM for host
    • 50 VMs * 1 virtual procs / 16 cores = ~ 3:1 ratio between virtual and physical processors.
    • 500 VMs / 50 = ~ 10 hosts – We could use 11 hosts, filling all the requirements plus one as spare
  • Networking
    • 500 VMs*30 IOPS*64KB = 937 MBps required – This works well with a single 10GbE which can deliver 1100 MBps . 2 for fault tolerance.
    • Single 4-lane SAS at 6Gbps delivers 2200 MBps. 2 for fault tolerance. You could actually use 3Gbps SAS HBAs here if you wanted.
  • File Server
    • 500 * 25 IOPS = 12,500 IOPS. Single file server can deliver that without any problem. 2 for fault tolerance.
    • RAM = 64GB, good size that allows for some CSV caching (up to 20% of RAM)

Please note that this is simply as an example, since your specific workload requirements may vary. There’s no general industry agreement on exactly what a VDI workload looks like, which kind of disks should be used with it or how much RAM would work best for the Hyper-V hosts in this scenario. So, take this example with a grain of salt :-)

Obviously you could have decided to go with a different type of disk, JBOD or host. In general, higher-end equipment will handle more load, but will be more expensive. For disks, deciding factors will include price, performance, capacity and endurance. Comparing SSDs and HDDs, for instance, is an interesting exercise and that equation changes constantly as new models become available and prices fluctuate. You might need to repeat the above exercise a few times with different options to find the ideal solution for your specific workload. You might want to calculate your cost per VM for each specific iteration.

Assuming you did all that and liked the results, let’s draw it out:

 

image

 

Now it’s up to you to work out the specific details of your own workload and hardware options.

 

5. Configuration Variations

 

It’s also important to notice that there are several potential configuration variations for the Hyper-V over SMB scenario, including:

  • Using a regular Ethernet NICs instead of RDMA NICs between the Hyper-V hosts and the File Servers
  • Using a third-party SMB 3.0 NAS instead of a Windows File Server
  • Using Fibre Channel or iSCSI instead of SAS, along with a traditional SAN instead of JBODs and Storage Spaces

 

6. Speeds and feeds

 

In order to make some of the calculations, you might need to understand the maximum theoretical throughput of the interfaces involved. For instance, it helps to know that a 10GbE NIC cannot deliver up more than 1.1 GBytes per second or that a single SAS HBA sitting on an 8-lane PCIe Gen2 slot cannot deliver more than 3.4 GBytes per second. Here are some tables to help out with that portion:

 

NIC Throughput
1Gb Ethernet~0.1 GB/sec
10Gb Ethernet~1.1 GB/sec
40Gb Ethernet~4.5 GB/sec
32Gb InfiniBand (QDR)~3.8 GB/sec
56Gb InfiniBand (FDR)~6.5 GB/sec

 

HBA Throughput
3Gb SAS x4~1.1 GB/sec
6Gb SAS x4~2.2 GB/sec
4Gb FC~0.4 GB/sec
8Gb FC~0.8 GB/sec
16Gb FC~1.5 GB/sec

 

Bus Slot Throughput
PCIe Gen2 x4~1.7 GB/sec
PCIe Gen2 x8~3.4 GB/sec
PCIe Gen2 x16~6.8 GB/sec
PCIe Gen3 x4~3.3 GB/sec
PCIe Gen3 x8~6.7 GB/sec
PCIe Gen3 x16~13.5 GB/sec

 

Intel QPI Throughput
4.8 GT/s~9.8 GB/sec
5.86 GT/s~12.0 GB/sec
6.4 GT/s~13.0 GB/sec
7.2 GT/s~14.7 GB/sec
8.0 GT/s~16.4 GB/sec

 

Memory Throughput
DDR2-400 (PC2-3200)~3.4 GB/sec
DDR2-667 (PC2-5300)~5.7 GB/sec
DDR2-1066 (PC2-8500)~9.1 GB/sec
DDR3-800 (PC3-6400)~6.8 GB/sec
DDR3-1333 (PC3-10600)~11.4 GB/sec
DDR3-1600 (PC3-12800)~13.7 GB/sec
DDR3-2133 (PC3-17000)~18.3 GB/sec

 

Also, here is some fine print on those tables:

  • Only a few common configurations listed.
  • All numbers are rough approximations.
  • Actual throughput in real life will be lower than these theoretical maximums.
  • Numbers provided are for one way traffic only (you should double for full duplex).
  • Numbers are for one interface and one port only.
  • Numbers use base 10 (1 GB/sec = 1,000,000,000 bytes per second)

 

7. Conclusion

 

I’m still working out the details of this new Hyper-V over SMB presentation, but this posts summarizes the portion related to end-to-end performance.

I plan to deliver this talk to an internal Microsoft audience this week and also during the MVP Summit later this month. I am also considering submissions for MMS 2013 and TechEd 2013.

 

P.S.: You can get a preview of this portion of the talk by watching this recent TechNet Radio show I recorded with Bob Hunt: Hyper-V over SMB 3.0 Performance Considerations.

P.P.S: This content was delivered as part of my TechEd 2013 North America talk: Understanding the Hyper-V over SMB Scenario, Configurations, and End-to-End Performance

Hardware options for highly available Windows Server 2012 systems using shared, directly-attached storage

$
0
0

Highly available Windows Server 2012 systems using shared, directly-attached storage can be built using either Storage Spaces or a validated clustered RAID controller.

 

Option 1 – Storage Spaces

You can build a highly available shared SAS system today using Storage Spaces.

Storage Spaces works well in a standalone PC, but it is also capable of working in a Windows Server Failover Clustering environment. 

For implementing Clustered Storage Spaces, you will need the following Windows Server 2012 certified hardware:

  • Any SAS Host Bus Adapter or HBA (as long as it’s SAS and not a RAID controller, you should be fine)
  • SAS JBODs or disk enclosures (listed under the “Storage Spaces” category on the Server catalog)
  • SAS disks (there’s a wide variety of those, including capacity HDDs, performance HDDs and SSDs)

You can find instructions on how to configure a Clustered Storage Space in Windows Server 2012 at http://blogs.msdn.com/b/clustering/archive/2012/06/02/10314262.aspx.

A good overview of Storage Spaces and its capabilities can be found at http://social.technet.microsoft.com/wiki/contents/articles/15198.storage-spaces-overview.aspx

There's also an excellent presentation from TechEd that covers Storage Spaces at http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/WSV315

 

Option 2 – Clustered RAID Controllers

The second option is to build a highly available shared storage system using RAID Controllers that are designed to work in a Windows Server Failover Cluster configuration.

The main distinction between these RAID controllers and the ones we used before is that they work in sets (typically a pair) and coordinate their actions against the shared disks.

Here are some examples:

  • The HP StoreEasy 5000 cluster-in-a-box uses Clustered RAID controllers that HP sources and certifies. You can find details at the HP StoreEasy product page.
  • LSI is working on a Clustered RAID controller with Windows Server 2012 support. This new line of SAS RAID Controllers is scheduled for later this year. You can get details on availability dates from LSI.

 

Both options work great for all kinds of Windows Server 2012 Clusters, including Hyper-V Clusters, SQL Server Clusters, Classic File Server Clusters and Scale-Out File Servers.

You can learn more about these solutions in this TechEd presentation: http://channel9.msdn.com/Events/TechEd/Europe/2012/WSV310

Increasing Availability – The REAP Principles (Redundancy, Entanglement, Awareness and Persistence)

$
0
0

Introduction

 

Increasing availability is a key concern with computer systems. With all the consolidation and virtualization efforts under way, you need to make sure your services are always up and running, even when some components fail. However, it’s usually hard to understand the details of what it takes to make systems highly available (or continuously available). And there are so many options…

In this blog post, I will describe four principles that cover the different requirements for Availability: Redundancy, Entanglement, Awareness and Persistence. They apply to different types of services and I’ll provide some examples related to the most common server roles, including DHCP, DNS, Active Directory, Hyper-V, IIS, Remote Desktop Services, SQL Server, Exchange Server, and obviously File Services (I am in the “File Server and Clustering” team, after all). Every service employs different strategies to implement these “REAP Principles” but they all must implement them in some fashion to increase availability.

Note: A certain familiarity with common Windows Server roles and services is assumed here. If you are not familiar with the meaning of DHCP, DNS or Active Directory, this post is not intended for you. If that’s the case, you might want to do some reading on those topics before moving forward here.

 

Redundancy – There is more than one of everythingimage

 

Availability starts with redundancy. In order to provide the ability to survive failures, you must have multiple instance of everything that can possibly fail in that system. That means multiple servers, multiple networks, multiple power supplies, multiple storage devices. You should be seeing everything (at least) doubled in your configuration. Whatever is not redundant is commonly labeled a “Single Point of Failure”.

Redundancy is not cheap, though. By definition, it will increase the cost of your infrastructure. So it’s an investment that can only be justified when there is understanding of the risks and needs associated with service disruption, which should be balanced with the cost of higher availability. Sadly, that understanding sometimes only comes after a catastrophic event (such as data loss or an extended outage).

Ideally, you would have a redundant instance that is as capable as your primary one. That would make your system work as well after the failure as it did before. It might be acceptable, though, to have a redundant component that is less capable. In that case, you’ll be in a degraded (although functional) state after a failure, while the original part is being replaced. Also keep in mind that, these days, redundancy in the cloud might be a viable option.

For this principle, there’s really not much variance per type of Windows Server role. You basically need to make sure that you have multiple servers providing the service, and make sure the other principles are applied.

 

Entanglement – Achieving shared state via spooky action at a distance

 image

Having redundant equipment is required but certainly not sufficient to provide increased availability. Once any meaningful computer system is up and running, it is constantly gathering information and keeping track of it. If you have multiple instances running, they must be “entangled” somehow. That means that the current state of the system should be shared across the multiple instances so it can survive the loss of any individual component without losing that state. It will typically include some complex “spooky action at a distance”, as Einstein famously said of Quantum Mechanics.

A common way to do it is using a database (like SQL Server) to store your state. Every transaction performed by a set of web servers, for instance, could be stored in a common database and any web server can be quickly reprovisioned and connected to the database again. In a similar fashion, you can use Active Directory as a data store, as it’s done by services like DFS Namespaces and Exchange Server (for user mailbox information). Even a file server could serve a similar purpose, providing a location to store files that can be changed at any time and accessed by a set of web servers. If you lose a web server, you can quickly reprovision it and point it to the shared file server.

If using SQL Server to store the shared state, you must also abide by the Redundancy principle by using multiple SQL Servers, which must be entangled as well. One common way to do it is using shared storage. You can wire these servers to a Fibre Channel SAN or an iSCSI SAN or even a file server to store the data. Failover clustering in Windows Server (used by certain deployments of Hyper-V, File Servers and SQL Server, just to name a few) levarages shared storage as a common mechanism for entanglement.

Peeling the onion further, you will need multiple heads of those storage systems and they must also be entangled. Redundancy at the storage layer is commonly achieved by sharing physical disks and writing the data to multiple places. Most SANs have the option of using dual controllers that are connected to a shared set of disks. Every piece of data is stored synchronously to at least two disks (sometimes more). These SANs can tolerate the failure of individual controllers or disks, preserving their shared state without any disruption. In Windows Server 2012, Clustered Storage Spaces provides a simple solution for shared storage for a set of Windows Servers using only Shared SAS disks, without the need for a SAN.

There are other strategies for Entanglement that do not require shared storage, depending on how much and how frequently the state changes. If you have a web site with only static files, you could maintain shared state by simply provisioning multiple IIS servers with the exact same files. Whenever you lose one, simply replace it. For instance, Windows Azure and Virtual Machine Manager provide mechanisms to quickly add/remove instances of web servers in this fashion through the use of a service template.

If the shared state changes, which is often the case for most web sites, you could go up a notch by regularly copying updated files to the servers. You could have a central location with the current version of the shared state (a remote file server, for instance) plus a process to regularly send full updates to any of the nodes every day (either pushed from the central store or pulled by the servers). This is not very efficient for large amounts of data updated frequently, but could be enough if the total amount of data is small or it changes very infrequently. Examples of this strategy include SQL Server Snapshot Replication, DNS full zone transfers or a simple script using ROBOCOPY to copy files on a daily schedule.

In most cases, however, it’s best to employ a mechanism that can cope with more frequently changing state. Going up the scale you could have a system that sends data to its peers every hour or every few minutes, being careful to send only the data that has changed instead of the full set. That is the case for DNS incremental zone transfers, Active Directory Replication, many types of SQL Server Replication, SQL Server Log Shipping, Asynchronous SQL Server Mirroring (High-Performance Mode), SQL Server AlwaysOn Availability Groups (asynchronous-commit mode), DFS Replication and Hyper-V Replica. These models provide systems that are loosely converging, but do not achieve up-to-the-second coherent shared state. However, that is good enough for some scenarios.

At the high end of replication and right before actual shared storage, you have synchronous replication. This provides the ability to update the information on every entangled system before considering the shared state actually changed. This might slow down the overall performance of the system, especially when the connectivity between the peers suffers from latency. However, there’s something to be said of just having a set of nodes with local storage that achieve a coherent shared state using only software. Common examples here include a few types of SAN replication, Exchange Server (Database Availability Groups), Synchronous SQL Mirroring (High Safety Mode) and SQL Server AlwaysOn Availability Groups (synchronous-commit mode).

As you can see, the Entanglement principle can be addressed in a number of different ways depending on the service. Many services, like File Server and SQL Server, provide multiple mechanisms to deal with it, with varying degrees of cost, complexity, performance and coherence.

 

Awareness – Telling if Schrödinger's servers are alive or not

 image

Your work is not done after you have a redundant entangled system. In order to provide clients with seamless access to your service, you must implement some method to find one of the many sources for the service. The awareness principle refers to how your clients will discover the location of the access points for your service, ideally with a mechanism to do it quickly while avoiding any failed instances. There a few different ways to achieve it, including manual configuration, broadcast, DNS, load balancers, or a service-specific method.

One simple method is to statically configure each client with the name or IP Address of two or more instances of the service. This method is effective if the configuration of the service is not expected to change. If it ever does change, you would need to reconfigure each client. A common example here is how static DNS is configured: you simply specify the IP address of your preferred DNS server and also the IP address if an alternate DNS server in case the preferred one fails.

Another common mechanism is to broadcast a request for the service and wait for a response. This mechanism works only if there’s someone in your local network capable of providing an answer. There’s also a concern about the legitimacy of the response, since a rogue system on the network might be used to provide a malicious version of the service. Common examples here include DHCP service requests and Wireless Access Point discovery. It is fairly common to use one service to provide awareness for others. For instance, once you access your Wireless Access Point, you get DHCP service. Once you get DHCP service, you get your DNS configuration from it.

As you know, the most common use for a DNS server is to map a network name to an IP address (using an A, AAAA or CNAME DNS record). That in itself implements a certain level of this awareness principle. DNS can also associate multiple IP addresses with a single name, effectively providing a mechanism to give you a list of servers that provide a specific service. That list is provided by the DNS server in a round robin fashion, so it even includes a certain level of load balancing as part of it. Clients looking for Web Servers and File Servers commonly use this mechanism alone for finding the many devices providing a service.

DNS also provides a different type of record specifically designed for providing service awareness. This is implemented as SRV (Service) records, which not only offer the name and IP address of a host providing a service, but can decorate it with information about priority, weight and port number where the service is provided. This is a simple but remarkably effective way to provide service awareness through DNS, which is effectively a mandatory infrastructure service these days. Active Directory is the best example of using SRV records, using DNS to allow clients to learn information about the location of Domain Controllers and services provided by them, including details about Active Directory site topology.

Windows Server failover clustering includes the ability to perform dynamic DNS registrations when creating clustered services. Each cluster role (formerly known as a cluster group) can include a Network Name resource which is registered with DNS when the service is started. Multiple IP addresses can be registered for a given cluster Network  Name if the server has multiple interfaces. In Windows Server 2012, a single cluster role can be active on multiple nodes (that’s the case of a Scale-Out File Server) and the new Distributed Network Name implements this as a DNS name with multiple IP addresses (at least one from each node).

DNS does have a few limitations. The main one is the fact that the clients will cache the name/IP information for some time, as specified in the TTL (time to live) for the record. If the service is reconfigure and new addresses or service records are published, DNS clients might take some time to become aware of the change. You can reduce the TTL, but that has a performance impact, causing DNS clients to query the server more frequently. There is no mechanism in DNS to have a server proactively tell a client that a published record has changed. Another issue with DNS is that it provides no method to tell if the service is actually being provided at the moment or even if the server ever functioned properly. It is up to the client to attempt communication and handle failures. Last but not least, DNS cannot help with intelligently balancing clients based on the current load of a server.

Load balancers are the next step in providing awareness. These are network devices that function as an intelligent router of traffic based on a set of rules. If you point your clients to the IP address of the load balancer, that device can intelligently forward the requests to a set for servers. As the name implies, load balancers typically distribute the clients across the servers and can even detect if a certain server is unresponsive, dynamically taking it out of the list. Another concern here is affinity, which is an optimization that consistently forwards a given client to the same server. Since these devices can become a single point of failure, the redundancy principle must be applied here. The most common solution is to have two load balancers in combination with two records in DNS.

SQL Server again uses multiple mechanisms for implementing this principle. DNS name resolution is common, both statically or dynamically using failover clustering Network Name resources. That name is then used as part of the client configuration known as a “Connection String”. Typically, this string will provide the name of a single server providing the SQL Service, along with the database name and authentication details. For instance, a typical connection string would be: "Server=SQLSERV1A; Database=DB301; Integrated Security=True;". For SQL Mirroring, there is a mechanism to provide a second server name in the connection string itself. Here’s an example: "Server=SQLSERV1A; Failover_Partner=SQLSRV1B; Database=DB301; Integrated Security=True;".

Other services provide a specific layers of Awareness, implementing a broker service or client access layer. This is the case of DFS (Distributed File System), which simplifies access to multiple file servers using a unified namespace mechanism. In a similar way, SharePoint web front end servers will abstract the fact that multiple content databases live behind a specific SharePoint farm or site collection. SharePoint Server 2013 goes one step further by implementing a Request Manager service that can even be configure as a Web Server farm placed in front of the main SharePoint web front end farm, with the purpose of routing and throttling incoming requests to improve both performance and availability.

Exchange Server Client Access Servers will query Active Directory to find which Mailbox Server or Database Access Group contains the mailbox for an incoming client. Remote Desktop Connection Broker (formerly known as Terminal Services Session Broker), is used to provide users with access to Remote Desktop services across a set of servers. All these brokers services can typically handle a fair amount of load balancing and be aware of the state of the services behind it. Since these can become single point of failures, they are typically placed behind DNS round robin and/or load balancers.

 

Persistence – The one that is the most adaptable to change will survive

 image

Now that you have redundant entangled services and clients are aware of them, here comes the greatest challenge in availability. Persisting the service in the event of a failure. There are three basic steps to make it happen: server failure detection, failing over to a surviving server (if required) and client reconnection (if required).

Detecting the failure is the first step. It requires a mechanism for aliveness checks, which can be performed by the servers themselves, by a witness service, by the clients accessing the services or a combination of these. For instance, Windows Server failover clustering makes cluster nodes check each other (through network checks), in an effort to determine when a node becomes unresponsive.

Once a failure is detected, for services that work in an active/passive fashion (only one server provides the service and the other remains on standby), a failover is required. This can only be safely achieved automatically if the entanglement is done via Shared Storage or Synchronous Replication, which means that the data from the server that is lost is properly persisted. If using other entanglement methods (like backups or asynchronous replication), an IT Administrator typically has to manually intervene to make sure the proper state is restored before failing over the service. For all active/active solutions, with multiple servers providing the same service all the time, a failover is not required.

Finally, the client might need to reconnect to the service. If the server being used by the client has failed, many services will lose their connections and require intervention. In an ideal scenario, the client will automatically detect (or be notified of) the server failure. Then, because it is aware of other instances of the service, it will automatically connect to a surviving instance, restoring the exact same client state before the failure. This is how Windows Server 2012 implements failover of file servers though a process called SMB 3.0 Continuous Availability, available for both Classic and Scale-Out file server clusters. The file server cluster goes one step further, providing a Witness Service that will proactively notify SMB 3.0 clients of a server failure and point them to an alternate server, even before current pending requests to the failed server time out.

File servers might also leverage a combination of DFS Namespaces and DFS Replication that will automatically recover from a failed server situation, with some potential side effects. While the file client will find an alternative file server via DFS Namespaces, the connection state will be lost and need to be reestablished. Another persistence mechanism in the file server is the Offline Files option (also known as Client Side Caching) commonly used with the Folder Redirection feature. This allows you to keep working on local storage while your file server is unavailable, synchronizing again when the server comes back.

For other services, like SQL Server, the client will surface an error to the application indicating that a failover has occurred and the connection has been lost. If the application is properly coded to handle that situation, the end user will be shielded from error message because the application will simply reconnect to the SQL Server using either the same name (in the case of another server taking over that name) or a Failover Partner name (in case of SQL Server Mirroring) or another instance of SQL Server (in case of more complex log shipping or replication scenarios).

Clients of Web Servers and other load balanced workloads without any persistent state might be able to simply retry an operation in case of a failure. This might happen automatically or require the end-user to retry the operation manually. This might also be the case of a web front end layer that communicates with a web services layer. Again a savvy programmer could code that front end server to automatically retry web services requests, if they are idempotent.

Another interesting example of client persistence is provided by an Outlook client connecting to an Exchange Server. As we mentioned, Exchange Servers implement a sophisticated method of synchronous replication of mailbox databases between servers, plus a Client Access layer that brokers connections to the right set of mailbox servers. On top of that, the Outlook client will simply continue to work from its cache (using only local storage) if for any reason the server becomes unavailable. Whenever the server comes back online, the client will transparent reconnect and synchronize. The entire process is automated, without any action required during or after the failure from either end users and IT Administrators.

 

Samples of how services implement the REAP principles

 

Now that you have the principles down, let’s look at how the main services we mentioned implement them.

ServiceRedundancyEntanglementAwarenessPersistence
DHCP, using split scopesMultiple standalone DHCP ServersEach server uses its own set of scopes, no replicationActive/Active, Clients find DHCP servers via broadcast (whichever responds first)DHCP responses are cached. Upon failure, only surviving servers will respond to the broadcast
DHCP, using failover clusterMultiple DHCP Servers in a failover cluster Shared block storage (FC, iSCSI, SAS)Active/Passive, Clients find DHCP servers via broadcastDHCP responses are cached. Upon failure, failover occurs and a new server responds to broadcasts
DNS, using zone transfersMultiple standalone DNS ServersZone Transfers between DNS Servers at regular intervalsActive/Active, Clients configured with IP addresses of Primary and Alternate servers (static or via DHCP)DNS responses are cached. If query to primary DNS server fails, alternate DNS server is used
DNS, using Active Directory integrationMultiple DNS Servers in a DomainActive Directory ReplicationActive/Active, Clients configured with IP addresses of Primary and Alternate servers (static or via DHCP)DNS responses are cached. If query to primary DNS server fails, alternate DNS server is used
Active DirectoryMultiple Domain Controllers in a DomainActive Directory ReplicationActive/Active, DC Locator service finds closest Domain Controller using DNS service recordsUpon failure, DC Locator service finds a new Domain Controller
File Server, using DFS (Distributed File System)Multiple file servers, linked through DFS. Multiple DFS servers.DFS Replication maintains file server data consistency. DFS Namespace links stored in Active Directory.Active/Active, DFS Namespace used to translate namespaces targets into closest file server.Upon failure of file server, client uses alternate file server target. Upon DFS Namespace failure, alternate is used.
File Server for general use, using failover clusterMultiple File Servers in a failover clusterShared Storage (FC, iSCSI, SAS)Active/Passive, Name and IP address resources, published to DNSFailover, SMB Continuous Availability, Witness Service
File Server, using Scale-Out ClusterMultiple File Servers in a failover clusterShared Storage, Cluster Shared Volume (FC, iSCSI, SAS)Active/Active, Name resource published to DNS (Distributed Network Name)SMB Continuous Availability, Witness Service
Web Server, static contentMultiple Web ServersInitial copy onlyActive/Active. DNS round robin, load balancer or combinationClient retry
Web Server, file server back-endMultiple Web ServersShared File Server Back EndActive/Active. DNS round robin, load balancer or combinationClient retry
Web Server, SQL Server back-endMultiple Web ServersSQL Server databaseActive/Active. DNS round robin, load balancer or combinationClient retry
Hyper-V, failover clusterMultiple servers in a clusterShared Storage (FC, iSCSI, SAS, SMB File Share)Active/Passive. Clients connect to IP exposed by the VMVM restarted upon failure
Hyper-V, ReplicaMultiple serversReplication, per VMActive/Passive. Clients connect to IP exposed by the VMManual failover (test option available)
SQL Server, ReplicationMultiple serversReplication, per database (several methods)Active/Active. Clients connect by server nameApplication may detect failures and switch servers
SQL Server, Log ShippingMultiple serversLog shipping, per databaseActive/Passive. Clients connect by server nameManual failover
SQL Server, MirroringMultiple servers, optional witnessMirroring, per databaseActive/Passive, Failover Partner specified in connection stringAutomatic failover if synchronous, with witness. Application needs to reconnect
SQL Server, AlwaysOn Failover Cluster InstancesMultiple servers in a clusterShared Storage (FC, iSCSI, SAS, SMB File Share)Active/Passive, Name and IP address resources, published to DNSAutomatic Failover, Application needs to reconnect
SQL Server, AlwaysOn Availability GroupsMultiple servers in a clusterMirroring, per availability groupActive/Passive, Availability Group listener with a Name and IP address, published to DNSAutomatic Failover if using synchronous-commit mode. Application needs to reconnect
SharePoint Server (web front end)Multiple ServersSQL Server StorageActive/Active. DNS round robin, load balancer or combination.Client retry
SharePoint Server (request manager)Multiple ServersSQL Server StorageActive/Active. Request manager combined with a load balancer.Client retry
Exchange Server (DAG) with OutlookMultiple Servers in a ClusterDatabase Access Groups (Synchronous Replication)Active/Active. Client Access Point (uses AD for Mailbox/DAG information). Names published to DNS.Outlook client goes in cached mode, reconnects

 

Conclusion

 

I hope this post helped you understand the principles behind increasing server availability.

As a final note, please take into consideration that not all services require the highest possible level of availability. This might be an easier decision for certain services like DHCP, DNS and Active Directory, where the additional cost is relatively small and the benefits are sizable. You might want to think twice when increasing the availability of a large backup server, where some hours of down time might be acceptable and the cost of duplicating the infrastructure is significantly higher.

Depending on how much availability you service level agreement states, you might need different types of solutions. We generally measure availability in “nines”, as described in the table below:

Nines%AvailabilityDowntime per yearDowntime per week
190%~ 36 days~ 16 hours
299%~ 3.6 days~ 90 minutes
399.9%~ 8 hours~ 10 minutes
499.99%~ 52 minutes~ 1 minute
599.999%~ 5 minutes~ 6 seconds

You should consider your overall requirements and the related infrastructure investments that would give you the most “nines” per dollar.

SQLIO, PowerShell and storage performance: measuring IOPs, throughput and latency for both local disks and SMB file shares

$
0
0

1. Introduction

 

I have been doing storage-related demos and publishing blogs with some storage performance numbers for a while, and I commonly get questions such as “How do you run these tests?” or “What tools do you use to generate IOs for your demos?”. While it’s always best to use a real workload to test storage, sometimes that is not convenient. So, I very frequently use a free tool from Microsoft to simulate IOs called SQLIO. It’s a small and simple tool that simulate several types of workloads, including common SQL Server ones. And you can apply it to several configurations, from a physical host or virtual machine, using a local disk, a LUN on a SAN, a Storage Space or an SMB file share.

 

2. Download the tool

 

To get started, you need to download and install the SQLIO tool. You can get it from http://www.microsoft.com/en-us/download/details.aspx?id=20163. The setup will install the tool in a folder of your choice. In the end, you really only need one file: SQLIO.EXE. You can copy it to any folder and it runs in pretty much every Windows version, client or server. In this blog post, I assume that you installed SQLIO on the C:\SQLIO folder.

 

3. Prepare a test file

 

Next, you need to create a file in the disk or file share that you will be using for your demo or test.

Ideally, you should create a file as big as possible, so that you can exercise the entire disk. For hard disks, creating a small file causes the head movement to be restricted to a portion of the disk. Unless you’re willing to use only a fraction of the hard disk capacity, these numbers show unrealistically high random IO performance. Storage professionals call this technique “short stroking”. For SANs, small files might end up being entirely cached in the controller RAM, again giving you great numbers that won’t hold true for real deployments. You can actually use SQLIO to measure the difference between using a large file and a small file for your specific configuration.

To create a large file for your test, the easiest way is using the FSUTIL.EXE tool, which is included with all versions of Windows.

For instance, to create a 1TB file on the X: drive, using the following command from a PowerShell prompt:

FSUTIL.EXE file createnew X:\testfile.dat (1TB)
FSUTIL.EXE file setvaliddata X:\testfile.dat (1TB)

Note 1: You must do this from PowerShell, in order to use the convenient (1TB) notation. If you run this from an old command prompt, you need to calculate 1 terabyte in bytes, which is 1099511627776 (2^40). Before you Storage professionals rush to correct me, I know this is technically incorrect. One terabyte is 10^12 (1000000000000) and 2^40 is actually one Tebibyte (1TiB). However, since both PowerShell and SQLIO use the TB/GB/MB/KB notation when referring to powers of 2, I will ask you to give me a pass here.

Note 2: The “set valid data” command lets you move the “end of file” marker, avoiding a lengthy initialization of the file. This is much faster than writing over the entire file. However, there are security implications for “set valid data” (it might expose leftover data on the disk if you don’t properly initialize the file) and you must be an administrator on the machine to use it.

Here’s another example, with output, using a smaller file size:

PS C:\> FSUTIL.EXE File CreateNew X:\TestFile.DAT (40GB)
File X:\TestFile.DAT is created
PS C:\> FSUTIL.EXE File SetValidData X:\TestFile.DAT (40GB)
Valid data length is changed

 

4. Run the tool

 

With the tool installed and the test file created, you can start running SQLIO.

You also want to make sure there’s nothing else running on the computer, so that other running process don’t interfere with your results by putting additional load on the CPU, network or storage. If the disk you are using is shared in any way (like a LUN on a SAN), you want to make sure that nothing else is competing with your testing. If you’re using any form of IP storage (iSCSI LUN, SMB file share), you want to make sure that you’re not running on a network congested with other kinds of traffic.

WARNING: You could be generating a whole lot of disk IO, network traffic and/or CPU load when you run SQLIO. If you’re in a shared environment, you might want to talk to your administrator and ask permission. This could generate a whole lot of load and disturb anyone else using other VMs in the same host, other LUNs on the same SAN or other traffic on the same network.

From an old command prompt or a PowerShell prompt, issue a single command line to start getting some performance results. Here is your first example, with output, generating random 8KB reads on that file we just created:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 -t8 -o16 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
8 threads reading for 10 secs from file X:\TestFile.DAT
        using 8KB random IOs
        enabling multiple I/Os per thread with 16 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 36096.60
MBs/sec:   282.00
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 3
Max_Latency(ms): 55
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 30 19 12  8  6  5  4  3  3  2  2  2  1  1  1  1  0  0  0  0  0  0  0  0  0

So, for this specific disk (a simple Storage Space created from a pool of 3 SSDs), I am getting over 36,000 IOPs of 8KB each with an average of 3 milliseconds of  latency (time it takes for the operation to complete, from start to finish). Not bad in terms of IOPS, but the latency for 8KB IOs seems a little high for SSD-based storage. We’ll investigate that later.

Let’s try now another command using sequential 512KB reads on that same file:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t2 -o16 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
2 threads reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 16 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1376.09
MBs/sec:   688.04
latency metrics:
Min_Latency(ms): 6
Avg_Latency(ms): 22
Max_Latency(ms): 23
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 33 67  0

I got about 688 MB/sec with an average latency of 22 milliseconds per IO. Again, good throughput, but the latency looks high for SSDs. We’ll dig deeper in a moment.

 

5. Understand the parameters used

 

Now let’s inspect the parameters on those SQLIO command line. I know it’s a bit overwhelming at first, so we’ll go slow. And keep in mind that, for SQLIO parameters, lowercase and uppercase mean different things, so be careful.

Here is the explanation for the parameters used above:

 

ParameterDescriptionNotes
-sThe duration of the test, in seconds.You can use 10 seconds for a quick test. For any serious work, use at least 60 seconds.
-kR=Read, W=writeBe careful with using writes on SSDs for a long time. They can wear out the drive.
-fRandom of SequentialRandom is common for OLTP workloads. Sequential is common for Reporting, Data Warehousing.
-bSize of the IO in KB8KB is the typical IO for OLTP workloads. 512KB is common for Reporting, Data Warehousing.
-tThreadsFor large IOs, just a couple is enough. Sometimes just one.
For small IOs, you could need as many as the number of CPU cores.
-oOutstanding IOs or queue depthIn RAID, SAN or Storage Spaces setups, a single disk can be made up of multiple physical disks. You can start with twice the number of physical disks used by the volume where the file sits. Using a higher number will increase your latency, but can get you more IOPs and throughput.
-LSCapture latency informationAlways important to know the average time to complete an IO, end-to-end.
-BNDo not bufferThis asks for no hardware or software buffering. Buffering plus a small file size will give you performance of the memory, not the disks.

 

For OLTP workloads, I commonly start with 8KB random IOs, 8 threads, 16 outstanding. 8KB is the size of the page used by SQL Server for its data files. In parameter form, that would be: -frandom -b8 -t8 -o16. For reporting or OLAP workloads with large IO, I commonly start with 512KB IOs, 2 threads and 16 outstanding. 512KB is a common IO size when SQL Server loads a batch of 64 data pages when using the read ahead technique for a table scan. In parameter form, that would be: -fsequential -b512 -t2 -o16. These numbers will need to be adjusted if you machine has many cores and/or if you volume is backed up by a large number of physical disks.

If you’re curious, here are more details about parameters for SQLIO, coming from the tool’s help itself:

Usage: D:\sqlio\sqlio.exe [options] [<filename>...]
        [options] may include any of the following:
        -k<R|W>                 kind of IO (R=reads, W=writes)
        -t<threads>             number of threads
        -s<secs>                number of seconds to run
        -d<drv_A><drv_B>..      use same filename on each drive letter given
        -R<drv_A/0>,<drv_B/1>.. raw drive letters/number for I/O
        -f<stripe factor>       stripe size in blocks, random, or sequential
        -p[I]<cpu affinity>     cpu number for affinity (0 based)(I=ideal)
        -a[R[I]]<cpu mask>      cpu mask for (R=roundrobin (I=ideal)) affinity
        -o<#outstanding>        depth to use for completion routines
        -b<io size(KB)>         IO block size in KB
        -i<#IOs/run>            number of IOs per IO run
        -m<[C|S]><#sub-blks>    do multi blk IO (C=copy, S=scatter/gather)
        -L<[S|P][i|]>           latencies from (S=system, P=processor) timer
        -B<[N|Y|H|S]>           set buffering (N=none, Y=all, H=hdwr, S=sfwr)
        -S<#blocks>             start I/Os #blocks into file
        -v1.1.1                 I/Os runs use same blocks, as in version 1.1.1
        -F<paramfile>           read parameters from <paramfile>
Defaults:
        -kR -t1 -s30 -f64 -b2 -i64 -BN testfile.dat
Maximums:
        -t (threads):                   256
        no. of files, includes -d & -R: 256
        filename length:                256

 

6. Tune the parameters for large IO

 

Now the you have the basics down, we can spend some time looking at how you can refine your number of threads and queue depth for your specific configuration. This might help us figure out why we had those higher than expected latency numbers in the initial runs. You basically need to experiment with the -t and the -o parameters until you find the one that give you the best results.

Let’s start with queue depth. You first want to find out the latency for a given system with a small queue depth, like 1 or 2. For 512KB IOs, here’s what I get from my test disk with a queue depth of 1 and a thread count of 1:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t1 -o1 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
1 thread reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 1 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   871.00
MBs/sec:   435.50
latency metrics:
Min_Latency(ms): 1
Avg_Latency(ms): 1
Max_Latency(ms): 1
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0 100  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

For large IOs, we typically look at the throughput (in MB/sec). With 1 outstanding IO, we are at 435 MB/sec with just 1 millisecond of latency per IO. However, if you don’t queue up some IO, we’re not extracting the full throughput of the disk, since we’ll be processing the data while the disk is idle waiting for more work. Let’s see what happens if we queue up more IOs:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t1 -o2 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
1 thread reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 2 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1377.70
MBs/sec:   688.85
latency metrics:
Min_Latency(ms): 1
Avg_Latency(ms): 1
Max_Latency(ms): 2
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0 100  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

OK. We are now up to 688 MB/sec with 2 outstanding IOs, and our average latency is still at the same 1 milliseconds per IO. You can also see that we now have a max latency of 2 milliseconds to complete, although in the histogram shows that most are still taking 1ms. Let’s double it up again to see what happens:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t1 -o4 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
1 thread reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 4 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1376.70
MBs/sec:   688.35
latency metrics:
Min_Latency(ms): 1
Avg_Latency(ms): 2
Max_Latency(ms): 3
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0 67 33  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Well, at a queue depth of 4, we gained nothing (we are still at 688 MB/sec), but our latency is now solid at 2 milliseconds, with 33% of the IOs taking 3 milliseconds. Let’s give it one more bump to see what happens. Trying now 8 outstanding IOs:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t1 -o8 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
1 thread reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 8 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1376.50
MBs/sec:   688.25
latency metrics:
Min_Latency(ms): 2
Avg_Latency(ms): 5
Max_Latency(ms): 6
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0  0  0  0 68 32  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

As you can see, increasing the –o parameter is not helping. After we doubled the queue depth from 4 to 8, there was no improvement in throughput. All we did was more than double our latency to an average of 5 milliseconds, with many IOs taking 6 milliseconds. That’s when you know you’re queueing up too much IO.

So, it seems like 2 outstanding IOs is a reasonable number for this disk. Now we can see if we can gain by spreading this across multiple threads. What we want to avoide here is bottlenecking on a single CPU core, which is very common we doing lots and lots of IO. A simple experiment is to double the number of threads while halfing the queue depth.  Let’s now try 2 threads with 1 outstanding IOs each. This will give us the same 2 outstanding IOs total:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t2 -o1 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
2 threads reading for 10 secs from file X:\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 1 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1377.90
MBs/sec:   688.95
latency metrics:
Min_Latency(ms): 1
Avg_Latency(ms): 1
Max_Latency(ms): 2
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0 100  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Well, it seems like using two threads here did not buy us anything. We’re still at about the same throughput and latency. That pretty much proves that 1 thread was enough for this kind of configuration and workload. This is not surprising for large IO. However, for smaller IO size, the CPU is more taxed and we might hit a single core bottleneck. Just in case, I looked at the CPU via Task Manager and confirmed we were only using 7% of the CPU and obviously none of the 4 cores were too busy.

 

7. Tune queue depth for for small IO

 

Performing the same tuning exercise for small IO is typically more interesting. For this one, we’ll automate things a bit using a little PowerShell scripting to run SQLIO in a loop and parse its output. This way we can try a lot of different options and see which one works best. This might take a while to run, though… Here’s a script that you can run from a PowerShell prompt, trying out many different queue depths:

1..64 | % { 
   $o = "-o $_"; 
   $r = C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 $o -t1 -LS -BN X:\testfile.dat
   $i = $r.Split("`n")[10].Split(":")[1].Trim()
   $m = $r.Split("`n")[11].Split(":")[1].Trim()
   $l = $r.Split("`n")[14].Split(":")[1].Trim()
   $o + ", " + $i + " iops, " + $m + " MB/sec, " + $l + " ms"
}

The  script basically runs SQLIO 64 times, each time using a different queue depth, from 1 to 64. The results from SQLIO are stored in the $r variable and parsed to show IOPs, throughput and latency on a single line. There is some fun string parsing there, leveraging the Split() function to break the output by line and then again to break each line in half to get the actual numbers. Here’s the sample output from my system:

-o 1, 9446.79 iops, 73.80 MB/sec, 0 ms
-o 2, 15901.80 iops, 124.23 MB/sec, 0 ms
-o 3, 20758.20 iops, 162.17 MB/sec, 0 ms
-o 4, 24021.20 iops, 187.66 MB/sec, 0 ms
-o 5, 26047.90 iops, 203.49 MB/sec, 0 ms
-o 6, 27559.10 iops, 215.30 MB/sec, 0 ms
-o 7, 28666.40 iops, 223.95 MB/sec, 0 ms
-o 8, 29320.90 iops, 229.06 MB/sec, 0 ms
-o 9, 29733.70 iops, 232.29 MB/sec, 0 ms
-o 10, 30337.00 iops, 237.00 MB/sec, 0 ms
-o 11, 30407.50 iops, 237.55 MB/sec, 0 ms
-o 12, 30609.78 iops, 239.13 MB/sec, 0 ms
-o 13, 30843.40 iops, 240.96 MB/sec, 0 ms
-o 14, 31548.50 iops, 246.47 MB/sec, 0 ms
-o 15, 30692.10 iops, 239.78 MB/sec, 0 ms
-o 16, 30810.40 iops, 240.70 MB/sec, 0 ms
-o 17, 31815.00 iops, 248.55 MB/sec, 0 ms
-o 18, 33115.19 iops, 258.71 MB/sec, 0 ms
-o 19, 31290.40 iops, 244.45 MB/sec, 0 ms
-o 20, 32430.40 iops, 253.36 MB/sec, 0 ms
-o 21, 33345.60 iops, 260.51 MB/sec, 0 ms
-o 22, 31634.80 iops, 247.14 MB/sec, 0 ms
-o 23, 31330.50 iops, 244.76 MB/sec, 0 ms
-o 24, 32769.40 iops, 256.01 MB/sec, 0 ms
-o 25, 34264.30 iops, 267.68 MB/sec, 0 ms
-o 26, 31679.00 iops, 247.49 MB/sec, 0 ms
-o 27, 31501.60 iops, 246.10 MB/sec, 0 ms
-o 28, 33259.40 iops, 259.83 MB/sec, 0 ms
-o 29, 33882.30 iops, 264.70 MB/sec, 0 ms
-o 30, 32009.40 iops, 250.07 MB/sec, 0 ms
-o 31, 31518.10 iops, 246.23 MB/sec, 0 ms
-o 32, 33548.30 iops, 262.09 MB/sec, 0 ms
-o 33, 33912.19 iops, 264.93 MB/sec, 0 ms
-o 34, 32640.00 iops, 255.00 MB/sec, 0 ms
-o 35, 31529.30 iops, 246.32 MB/sec, 0 ms
-o 36, 33973.50 iops, 265.41 MB/sec, 0 ms
-o 37, 34174.62 iops, 266.98 MB/sec, 0 ms
-o 38, 32556.50 iops, 254.34 MB/sec, 0 ms
-o 39, 31521.00 iops, 246.25 MB/sec, 0 ms
-o 40, 34337.60 iops, 268.26 MB/sec, 0 ms
-o 41, 34455.00 iops, 269.17 MB/sec, 0 ms
-o 42, 32265.00 iops, 252.07 MB/sec, 0 ms
-o 43, 31681.80 iops, 247.51 MB/sec, 0 ms
-o 44, 34017.69 iops, 265.76 MB/sec, 0 ms
-o 45, 34433.80 iops, 269.01 MB/sec, 0 ms
-o 46, 33213.19 iops, 259.47 MB/sec, 0 ms
-o 47, 31475.20 iops, 245.90 MB/sec, 0 ms
-o 48, 34467.50 iops, 269.27 MB/sec, 0 ms
-o 49, 34529.69 iops, 269.76 MB/sec, 0 ms
-o 50, 33086.19 iops, 258.48 MB/sec, 0 ms
-o 51, 31157.90 iops, 243.42 MB/sec, 1 ms
-o 52, 34075.30 iops, 266.21 MB/sec, 1 ms
-o 53, 34475.90 iops, 269.34 MB/sec, 1 ms
-o 54, 33333.10 iops, 260.41 MB/sec, 1 ms
-o 55, 31437.60 iops, 245.60 MB/sec, 1 ms
-o 56, 34072.69 iops, 266.19 MB/sec, 1 ms
-o 57, 34352.80 iops, 268.38 MB/sec, 1 ms
-o 58, 33524.21 iops, 261.90 MB/sec, 1 ms
-o 59, 31426.10 iops, 245.51 MB/sec, 1 ms
-o 60, 34763.19 iops, 271.58 MB/sec, 1 ms
-o 61, 34418.10 iops, 268.89 MB/sec, 1 ms
-o 62, 33223.19 iops, 259.55 MB/sec, 1 ms
-o 63, 31959.30 iops, 249.68 MB/sec, 1 ms
-o 64, 34760.90 iops, 271.56 MB/sec, 1 ms

As you can see, for small IOs, we got consistently better performance as we increased the queue depth for the first few runs. After 14 outstanding IOs, adding more started giving us very little improvement until things flatten out completely. As we keept adding more queue depth, all he had was more latency with no additional benefit in IOPS or throughput. Here’s that same data on a chart:

clip_image001

So, in this setup, we seem to start losing steam at around 10 outstanding IOs. However, I noticed in Task Manager that one core was really busy and our overall CPU utilization was at 40%.

clip_image002

In this quad-core system, any overall utilization above 25% could mean there was a core bottleneck when using a single thread. Maybe we can do better with multiple threads. Let’s try increasing the number of threads with a matching reduction of queue depth so we end up with the same number of total outstanding IOs.

$o = 32
$t = 1
While ($o -ge 1) { 
   $pt = "-t $t"; 
   $po = "-o $o"; 
   $r = C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 $po $pt -LS -BN X:\testfile.dat
   $i = $r.Split("`n")[10].Split(":")[1].Trim()
   $m = $r.Split("`n")[11].Split(":")[1].Trim()
   $l = $r.Split("`n")[14].Split(":")[1].Trim()
   $pt + “ “ + $po + ", " + $i + " iops, " + $m + " MB/sec, " + $l + " ms"
   $o = $o / 2
   $t = $t * 2
}

Here’s the output:

-t 1 -o 32, 32859.30 iops, 256.71 MB/sec, 0 ms
-t 2 -o 16, 35946.30 iops, 280.83 MB/sec, 0 ms
-t 4 -o 8, 35734.80 iops, 279.17 MB/sec, 0 ms
-t 8 -o 4, 35470.69 iops, 277.11 MB/sec, 0 ms
-t 16 -o 2, 35418.60 iops, 276.70 MB/sec, 0 ms
-t 32 -o 1, 35273.60 iops, 275.57 MB/sec, 0 ms

As you can see, in my system, adding a second thread improved things by about 10%, reaching nearly 36,000 IOPS. It seems like we were a bit limited by the performance of a single core. We call that being “core bound”. See below the more even per-core CPU utilization when using 2 threads.

clip_image003

However, 4 threads did not help and the overall CPU utilization was below 50% the whole time. Here’s the full SQLIO.EXE output with my final selected parameters for 8KB random IO in this configuration:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 -t2 -o16 -LS -BN X:\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337894 counts per second
2 threads reading for 10 secs from file X:\TestFile.DAT
        using 8KB random IOs
        enabling multiple I/Os per thread with 16 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: X:\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 35917.90
MBs/sec:   280.60
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 4
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 66 26  7  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

For systems with more capable storage, it’s easier to get “core bound” and adding more threads can make a much more significant difference. As I mentioned, it’s important to to monitor the per-core CPU utilization via Task Manager or Performance monitor to look out for these bottlenecks.

 

8. Multiple runs are better than one

 

One thing you might have notice with SQLIO (or any other tools like it) is that the results are not always the same given the same parameters. For instance, one of our “-b8 -t2 -o16” runs yielded 35,946 IOPs while another gave us 35,917 IOPs. How can you tell which one is right? Ideally, once you settle on a specific set of parameters, you should run SQLIO a few times and average out the results. Here’s a sample PowerShell script to do it, using the last set of parameters we used for the 8KB IOs:

$ti=0
$tm=0
$tl=0
$tr=10
1..$tr | % {
   $r = C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 -t2 -o16 -LS -BN X:\TestFile.DAT
   $i = $r.Split("`n")[10].Split(":")[1].Trim()
   $m = $r.Split("`n")[11].Split(":")[1].Trim()
   $l = $r.Split("`n")[14].Split(":")[1].Trim()
   "Run " + $_ + " = " + $i + " IOPs, " + $m + " MB/sec, " + $l + " ms"
   $ti = $ti + $i
   $tm = $tm + $m
   $tl = $tl + $l
}
$ai = $ti / $tr
$am = $tm / $tr
$al = $tl / $tr
"Average = " + $ai + " IOPs, " + $am + " MB/sec, " + $al + " ms"

The script essentially runs SQLIO that number of times, totalling the numbers for IOPs, throughput and latency, so it can show an average at the end. The $tr variable represents the total number of runs desired. Variables starting with $t hold the totals. Variables starting with $a hold averages. Here’s a sample output:

Run 1 = 36027.40 IOPs, 281.46 MB/sec, 0 ms
Run 2 = 35929.80 IOPs, 280.70 MB/sec, 0 ms
Run 3 = 35955.90 IOPs, 280.90 MB/sec, 0 ms
Run 4 = 35963.30 IOPs, 280.96 MB/sec, 0 ms
Run 5 = 35944.19 IOPs, 280.81 MB/sec, 0 ms
Run 6 = 35903.60 IOPs, 280.49 MB/sec, 0 ms
Run 7 = 35922.60 IOPs, 280.64 MB/sec, 0 ms
Run 8 = 35949.19 IOPs, 280.85 MB/sec, 0 ms
Run 9 = 35979.30 IOPs, 281.08 MB/sec, 0 ms
Run 10 = 35921.60 IOPs, 280.63 MB/sec, 0 ms
Average = 35949.688 IOPs, 280.852 MB/sec, 0 ms

As you can see, there’s a bit of variance there and it’s always a good idea to capture multiple runs. You might want to run each iteration for a longer time, like 60 seconds each.

 

9. Performance Monitor

 

Performance Monitor is a tool built into Windows (client and server) that shows specific performance information for several components of the system. For local storage, you can look into details about the performance of physical disks, logical disks and Hyper-V virtual disks. For remote storage you can inspect networking, SMB file shares and much more. In any case, you want to keep an eye on your processors, as a whole and per core.

Here are a few counters we can inspect, for instance, while running that random 8KB IO workload we just finished investigating:

 

Counter SetCounterInstanceNotes
Logical DiskAvg. Disk Bytes/TransferSpecific disk and/or TotalAverage IO size
Logical DiskAvg. Disk Queue LengthSpecific disk and/or TotalAverage queue depth
Logical DiskAvg. Disk sec/TransferSpecific disk and/or TotalAverage latency
Logical DiskDisk Bytes/secSpecific disk and/or TotalThroughput
Logical DiskDisk Transfers/secSpecific disk and/or TotalIOPs
Processor% Processor TimeSpecific core and/or TotalTotal CPU utilization
Processor% Privileged TimeSpecific core and/or TotalCPU used by privileged system services
Processor% Interrupt TimeSpecific core and/or TotalCPU used to handle hardware interrupts

 

Performance Monitor defaults to a line graph view, but I personally prefer to use the report view (you can get to it from the line chart view by pressing CTRL-G twice). Here’s an example of what I see for my test system while running “C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 -t2 -o16 -LS -BN X:\TestFile.DAT”.

clip_image004

Note 1: Disk counters here are in bytes, base 10. That means that what SQLIO defines as 8KB shows here as 8,192 and the 282.49 MB/sec shows as 296,207,602 bytes/sec. So, for those concerned with the difference between a megabyte (MB) and a mebibyte (MiB), there’s some more food  for thought and debate.

Note 2: Performance Monitor, by default, updates the information once every second and you will sometimes see numbers that are slightly higher or slightly lower than the SQLIO full run average.

 

10. SQLIO and SMB file shares

 

You can use SQLIO to get the same type of performance information for SMB file shares. It is as simple as mapping the file share to a drive letter using the old “NET USE” command or the new PowerShell cmdlet “New-SmbMapping”. You can also use a UNC path directly instead of using drive letters. Here are a couple of examples:

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -fsequential -b512 -t1 -o3 -LS -BN \\FSC5-D\X$\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337892 counts per second
1 thread reading for 10 secs from file \\FSC5-D\X$\TestFile.DAT
        using 512KB sequential IOs
        enabling multiple I/Os per thread with 3 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: \\FSC5-D\X$\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  1376.40
MBs/sec:   688.20
latency metrics:
Min_Latency(ms): 2
Avg_Latency(ms): 2
Max_Latency(ms): 3
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0 100  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Notice I bumped up the queue depth a bit to get the same throughput as we were getting on the local disk. We’re at 2 milliseconds of latency here. As you can probably tell, this SMB configuration is using an RDMA network interface.

PS C:\> C:\SQLIO\SQLIO.EXE -s10 -kR -frandom -b8 -t2 -o24 -LS -BN \\FSC5-D\X$\TestFile.DAT
sqlio v1.5.SG
using system counter for latency timings, 2337892 counts per second
2 threads reading for 10 secs from file \\FSC5-D\X$\TestFile.DAT
        using 8KB random IOs
        enabling multiple I/Os per thread with 24 outstanding
        buffering set to not use file nor disk caches (as is SQL Server)
using current size: 40960 MB for file: \\FSC5-D\X$\TestFile.DAT
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 34020.69
MBs/sec:   265.78
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 6
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 44 33 15  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Again I increased the queue depth a bit to get the best IOPs in this configuration. This is also close to the local performance and average latency is still under 1 millisecond.

 

11. Performance Monitor and SMB shares

 

When using Performance Monitor to look at SMB Shares, you should use the “SMB Client Shares” set of performance counters. Here are the main counters to watch:

 

Counter SetCounterInstanceNotes
SMB Client SharesAvg. Data Bytes/RequestSpecific share and/or TotalAverage IO size
SMB Client SharesAvg. Data Queue LengthSpecific share and/or TotalAverage queue depth
SMB Client SharesAvg. Sec/Data RequestSpecific share and/or TotalAverage latency
SMB Client SharesData Bytes/secSpecific share and/or TotalThroughput
SMB Client SharesData Requests/secSpecific share and/or TotalIOPs

 

Also, here is a view of performance monitor while running the random 8KB workload shown above:

clip_image005

 

12. Conclusion

 

I hope you have learned how to use SQLIO to perform some storage testing of your own. I encourage you to use it to look at the performance of the new features in Windows Server 2012, like Storage Spaces and SMB 3.0. Let me know if you were able to try it out and feel free to share some of your experiments via blog comments.

Q and A: Is it possible to run SMB Direct from within a VM?

$
0
0

Question received via blog mail:

Jose-

I picked up a couple ConnectX-2 adapters and a cable off of ebay for cheap (about $300 for everything) to test out SMB Direct.  I followed your blog "Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-2/ConnectX-3 using InfiniBand – Step by Step" and got it working.  It sure is fast and easy to setup! 

Another technology I was looking to explore was SR-IOV in Hyper-V.  When I created the virtual switch using the HCA and enabled single-root IO on it, SMB Direct no longer worked from the host.  Are these two technologies (SMB Direct and SR-IOV) mutually exclusive?  The Get-SmbServerNetworkInterface does not show an RDMA capable interface after enabling the virtual switch. 

I was hoping SMB Direct would work from within a virtual machine.  More specifically I was hoping I'd be able to utilize network direct/NDKPI from within the VM and I was using SMB Direct to verify if this was possible or not. 

Long story short, is it possible to run SMB Direct from within a VM?

 

Answer:

Those are known limitations of RDMA and SMB Direct. If you enable SR-IOV for the NIC, you lose the RDMA capabilities. If you team the RDMA NICs, you lose the RDMA capabilities. If you connect the RDMA NIC to the virtual switch, you lose the RDMA capabilities.

Essentially SMB needs to have a direct line of sight to the RDMA hardware to do its magic. You include any additional layers in between, we can no longer program the NIC for RDMA.

If you want to use RDMA in your Hyper-V over SMB configuration, you need to have a NIC (or two for fault tolerance) used for RDMA and a NIC (or two for fault tolerance) that you connect to the virtual switch.

More details at http://blogs.technet.com/b/josebda/archive/2013/02/04/hyper-v-over-smb-performance-considerations.aspx

Q and A: Can I use SMB3 storage without RDMA?

$
0
0

Question received via e-mail:

Is it practical use SMB3 storage without RDMA or do we have a use case for production rather than development or test?
I thought RDMA would be essential for production deployment of Hyper-V SMB storage.

Answer:

RDMA is not a requirement for the Hyper-V over SMB scenario.
The most important things that RDMA can give you are lower latency and lower CPU utilization.

To give you an idea, without RDMA, I was able to keep a single 10GbE port busy in a 16-core/2-socket Romley system using a little over 10% of the CPU.
For many, using 10% of the CPU is OK in this case. With RDMA, it dropped to less than 5% of the CPU.
Those become much more important if you are using very high bandwidth, like multiple 10GbE, 40GbE (Ethernet) or 54GbIB (InfiniBand).
In those cases, without RDMA, you could end up using much more of your CPU just to do network IO. Not good.

To make a better estimate of your requirements, you need to consider:

  • Number of VMs per host
  • Number of virtual processors per VM
  • Average number of IOs per VM
  • Average size of the IOs from the VM
  • Number of physical cores and sockets per host
  • Physical network configuration (type/speed/count of ports)

With that we can think of the expected load on the CPU and on the network, and how important using RDMA would be.
 

Q and A: I only have two NICs on my Hyper-V host. Should I team them or not?

$
0
0

Question via e-mail:

I am using blade servers for my Hyper-V cluster and I can only have two NICs per blade in this configuration.

I am considering two options on how to configure the NICs:

1)      Use one NIC for internal network and one NIC for external network, connected to the virtual switch
2)      Team the two NICs together and use the same path for all kinds of traffic

What would you recommend?

Answer

If you're using clusters, I assume you're concerned with high availability and network fault tolerance. In this case using one NIC for each kind of traffic creates two single points of failure. You should avoid that.

I would recommend that you team the two NICs, connect the team to the virtual switch and add a few virtual NICs to the parent partition for you storage, migration, cluster and management traffic. You can then use QoS policies manage your quality of service.

If you're using SMB for storage, be sure to have multiple vNICs (one for each physical NIC behind the team), so you can properly leverage SMB Multichannel in combination with NIC teaming. By the way, SMB Direct (RDMA) won't work with this scenario.

The first thing you want to do is create a team out of the two NICs and connect the team to a Hyper-V virtual switch. For instance:

New-NetLbfoTeam Team1 –TeamMembers NIC1, NIC2 –TeamNicName TeamNIC1
New-VMSwitch TeamSwitch –NetAdapterName TeamNIC1 –MinimumBandwidthMode Weight –AllowManagementOS $false

Next, you want to create multiple vNICs on the parent partition, one for each kind of traffic (two for SMB). Here's an example:

Add-VMNetworkAdapter –ManagementOS –Name SMB1 –SwitchName TeamSwitch
Add-VMNetworkAdapter –ManagementOS –Name SMB2 –SwitchName TeamSwitch
Add-VMNetworkAdapter –ManagementOS –Name Migration –SwitchName TeamSwitch
Add-VMNetworkAdapter –ManagementOS –Name Cluster –SwitchName TeamSwitch
Add-VMNetworkAdapter –ManagementOS –Name Management –SwitchName TeamSwitch

After this, you want to configure the NICs properly. This will include setting IP addresses, creating separate subnets for each kind of traffic. You can optionally put them each on a different VLAN.

Since you have lots of NICs now and you're already in manual configuration territory anyway, you might want to help the SMB Multichannel by pointing it to the NICs that should be used by SMB. You can do this by configuring SMB Multichannel constraints instead of letting SMB try all different paths. For instance, assuming that your Scale-Out File Server name is SOFS, you could use:

New-SmbMultichannelConstraint -ServerName SOFS -InterfaceAlias SMB1, SMB2

Last but not least you might also want set QoS for each kind of traffic, using the facilities provided by the Hyper-V virtual switch. One way to do it is:

Set-VMNetworkAdapter –ManagementOS –Name SMB1 –MinimumBandwidthWeight 20
Set-VMNetworkAdapter –ManagementOS –Name SMB2 –MinimumBandwidthWeight 20
Set-VMNetworkAdapter –ManagementOS –Name Migration –MinimumBandwidthWeight 20
Set-VMNetworkAdapter –ManagementOS –Name Cluster –MinimumBandwidthWeight 5
Set-VMNetworkAdapter –ManagementOS –Name Management –MinimumBandwidthWeight 5
Set-VMNetworkAdapter –VMName * -MinimumBandwidthWeight 1

There is a great TechNet page with details on this and other network configurations at http://technet.microsoft.com/en-us/library/jj735302.aspx


Slides for the Instructor-Led Lab on Windows Server 2012 Storage from MMS 2013

$
0
0

I delivered an instructor-led lab on Windows Server 2012 Storage at MMS 2013 yesterday (4/8), with a repeat scheduled for tomorrow (4/10) at 2:45 PM. You can find the details on this lab at http://www.2013mms.com/topic/details/WS-IL303

I used a few slides to comment on the contents of the lab, with an overview on the capabilities covered at the lab, including:

  • Storage Spaces
  • Data Deduplication
  • SMB 3.0 and application storage support
  • SMB Multichannel
  • SMB Transparent Failover
  • Cluster-Aware Updating (CAU)

A few people in the lab asked for the slides, so I am attaching it to this blog post in PDF format. If you're at MMS, you can run this also as a self-paced lab. Look under the "Windows Server Labs" and find the lab called "Building a Windows Server 2012 Storage Infrastructure".

Also, if you're here at MMS, don't miss my talk tomorrow (4/10) at 8:30 AM on "File Storage Strategies for Private Cloud" at Ballroom F. The session code is WS-B309 and you can see the details at http://www.2013mms.com/topic/details/WS-B309

MMS 2013 Demo: Hyper-V over SMB at high throughput with SMB Direct and SMB Multichannel

$
0
0

Overview

 

I delivered a demo of Hyper-V over SMB this week at MMS 2013 that’s an evolution of a demo I did back in the Windows Server 2012 launch and also via a TechNet Radio session.

Back then I showed a two physical servers running a SQLIO simulation. One played the role of the File Server and the other worked as a SQL Server.

This time around used using 12 VMs accessing a File Server at the same time. So this is a SQL in a VM running Hyper-V over SMB demo instead of showing SQL Server directly over SMB.

 

Hardware

 

The diagram below shows the details of the configuration.

You have an EchoStreams FlacheSAN2 working as File Server, with 2 Intel CPUs at 2.40 Ghz and 64GB of RAM. It includes 6 LSI SAS adapters and 48 Intel SSDs attached directly to the server. This is an impressively packed 2U unit.

The Hyper-V Server is a Dell PowerEdge R720 with 2 Intel CPUs at 2.70 GHz and 128GB of RAM. There are 12 VMs configured in the Hyper-V host, each with 4 virtual processors and 8GB of RAM.

Both the File Server and the Hyper-V host use three 54 Gbps Mellanox ConnectX-3 network interfaces sitting on PCIe Gen3 x8 slots.

 

image

Results

 

The demo showcases two workloads are shown: SQLIO with 512KB IOs and SQLIO with 32KB IOs. For each one, the results are shown for a physical host (single instance of SQLIO running over SMB, but without Hyper-V) and with virtualization (12 Hyper-V VMs running simultaneously over SMB). See the details below.

 

image

 

The first workload (using 512KB IOs) shows very high throughput from the VMs (around 16.8 GBytes/sec combined from all 12 VMs). That’s roughly the equivalent of fifteen 10Gbps Ethernet ports combined or around twenty 8Gbps Fibre Channel ports. And look at that low CPU utilization...

The second workload shows high IOPS (over 300,000 IOPs of 32KB each). That IO size is definitely larger than most high IOPs demos you’ve seen before. This also delivers throughput of over 10 GBytes/sec. It’s important to note that this demo accomplishes this on 2-socket/16-core servers, even though this specific workload is fairly CPU-intensive.

Notes:

  • The screenshots above show an instant snapshot of a running workload using Performance Monitor. I also ran each workload for only 20 seconds. Ideally you would run the workload multiple times with a longer duration and average things out.
  • Some of the 6 SAS HBAs on the File Server are sitting on a x4 PCIe slot, since not every one of the 9 slots on the server are x8. For this reason some of the HBAs perform better than others.
  • Using 4 virtual processors for each of the 12 VMs appears to be less than ideal. I'm planning to experiment with using more virtual processors per VM to potentially improve the performance a bit.

 

Conclusion

 

This is yet another example of how SMB Direct and SMB Multichannel can be combined to produce a high performance File Server for Hyper-V Storage.

This specific configuration pushes the limits of this box with 9 PCIe Gen3 slots in use (six for SAS HBAs and three for RDMA NICs).

I am planning to showcase this setup in a presentation planned for the MMS 2013 conference. If you’re planning to attend, I look forward to seeing you there.

 

P.S.: Some of the steps used for setting up a configuration similar to this one using PowerShell are available at http://blogs.technet.com/b/josebda/archive/2013/01/26/sample-scripts-for-storage-spaces-standalone-hyper-v-over-smb-and-sqlio-testing.aspx

P.P.S.: For further details on how to run SQLIO, you might want to read this post: http://blogs.technet.com/b/josebda/archive/2013/03/28/sqlio-powershell-and-storage-performance-measuring-iops-throughput-and-latency-for-both-local-disks-and-smb-file-shares.aspx

TechNet Radio series covers Windows Server 2012 File Server and SMB 3.0 scenarios

$
0
0

I have been working with Bob Hunt at the TechNet Radio team to provide a series of webcasts with information about SMB 3.0 and the File Server role in Windows Server 2012.

These are fairly informal conversations, but Bob is really good at posing interesting questions, clarifying the different scenarios and teasing out relevant details on SMB 3.0.

By the way, don’t be fooled by the “Radio” in the name. These are available as both video and audio, typically including at least one demo for each episode.

Here is a list of the TechNet Radio episodes Bob and I recorded (including one with Claus Joergensen, another PM in the SMB team), in the order they were published:

1) Windows Server 2012 Hyper-V over SMB (August 31st)

Summary: Bob Hunt and Jose Barreto join us for today’s show as they discuss Windows Server 2012 Hyper-V support for remote file storage using SMB 3.0. Tune in as they discuss the basic requirements for Hyper-V over SMB, as well as its latest enhancements and why this solution is an easy and affordable file storage alternative.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Windows-Server-2012-Hyper-V-over-SMB

2) Windows Server 2012 - How to Scale-Out a File Server and use it for Hyper-V (September 5th)

Summary: Bob Hunt and Jose Barreto are back for today’s episode where they show us how to scale-out a file server in Windows Server 2012 and how to use it with Hyper-V. Tune in as they discuss the advancements made to file servers in terms of scale, storage, virtual processors, support for VMs per host and per cluster as well as demoing a classic vs. scaled out file server.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Windows-Server-2012-How-to-Scale-Out-a-File-Server-using-Hyper-V

3) Hyper-V over SMB: Step-by-Step Installation using PowerShell (September 16th)

Summary: Bob Hunt and Jose Barreto continue their Hyper-V over SMB for Windows Server 2012 series, and in today’s episode they discuss how you can configure this installation using PowerShell. Tune in as they take a deep dive into how you can leverage all the features of SMB 3.0 as they go through this extensive step-by-step walkthrough.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Hyper-V-over-SMB-Step-by-Step-Installation-using-PowerShell

4) SMB Multi-channel Basics for Windows Server 2012 and SMB 3.0 (October 8th)

Summary: Bob Hunt and Jose Barreto continue their SMB 3.0 for Windows Server 2012 series, and in today’s episode they discuss the recent improvements made around networking capabilities found within SMB Multichannel which can help increase network performance and availability for File Servers.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-SMB-Multi-channel-Basics-for-Windows-Server-2012-and-SMB-30

5) SQL Server over SMB 3.0 Overview (October 23rd)

Summary: Principal Program Manager from the Windows File Server team, Claus Joergensen joins Bob Hunt and Jose Barreto as they discuss how and why you would want to implement SQL Server 2012 over SMB 3.0. Tune in as they chat about the benefits, how to set it up as well as address any potential concerns you may have such as performance issues.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-SQL-Server-over-SMB-30-Overview

6) SMB 3.0 Encryption Overview (November 26th)

Summary: Bob Hunt and Jose Barreto continue their SMB series for Windows Server 2012 and in today’s episode they chat about SMB Encryption. Tune in as they discuss this security component from what it is and why its important as well as how this is implemented and configured in your environment with a quick demo on how to do this via the GUI and in PowerShell.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-SMB-30-Encryption-Overview

7) SMB 3.0 Deployment Scenarios (December 6th)

Summary: Bob Hunt and Jose Barreto continue their SMB series for Windows Server 2012 and in today’s episode they chat about deployment scenarios and ways in which you can implement all of the new features in SMB 3.0. Tune in as they dive deep into various deployment strategies for SMB.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-SMB-30-Deployment-Scenarios

8) Hyper-V over SMB 3.0 Performance Considerations(January 14th)

Summary: Bob Hunt and Jose Barreto continue their Hyper-V over SMB series and in today’s episode they discuss performance considerations for a sample Enterprise configuration and what you may want to think about during your deployment.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Hyper-V-over-SMB-30-Performance-Considerations

9) Hyper-V Local vs. Hyper-V SMB Performance(February 7th)

Summary: Bob Hunt and Jose Barreto continue their Hyper-V over SMB series and in today’s episode they discuss a recently released independent study that compares the performance between Hyper-V local and Hyper-V over SMB. Tune in as they chat about the results as they take a deep dive to see how Windows Server 2012 performs in terms of storage, networking performance, efficiency and scalability.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Hyper-V-Local-vs-Hyper-V-SMB-Performance

10) Windows Server 2012 File Server Tips and Tricks (March 5th)

Summary: Bob Hunt and Jose Barreto continue their Windows Server 2012 File Server and SMB 3.0 series and in today’s episode they lay out their top Tips and Tricks. Tune in as they disclose a number of useful bits of information such as how to use multiple subnets when deploying SMB multichannel in a cluster from how to avoid loopback configurations for Hyper-V over SMB.

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Windows-Server-2012-File-Server-Tips-and-Tricks

 11) Windows Server 2012 File Server and SMB 3.0 – Simpler By Design (April 11h)

Summary: Just how easy is it to use SMB 3.0? That’s the topic for today’s show as Bob Hunt and Jose Barreto discuss the ins and outs of Windows Server 2012 File Storage for Virtualization. Tune in as they chat about SMB Transparent Failover, Scale-Out, Multichannel, SMB Direct and Encryption and much, much more!

Link: http://channel9.msdn.com/Shows/TechNet+Radio/TechNet-Radio-Windows-Server-2012-File-Server-and-SMB-30--Simpler-By-Design

 

More episodes are coming. Check back soon…

Iron Networks shows a complete private cloud pod at MMS 2013 with Windows Server 2012 (Storage Spaces, SMB3, Hyper-V) and System Center 2012 SP1

$
0
0

I was visiting the expo area at MMS 2013 earlier today and saw that Iron Networks was showing a private cloud pod there, complete with a set of well-matched layers of compute, storage and networking.

They were demonstrating several of the latest capabilities in Windows Server 2012 and System Center 2012 SP1, including:

  • Shared SAS Storage with Windows Server 2012 Storage Spaces, with multiple JBODs and different tiers of storage (SSDs, performance HDDs, capacity HDDs)
  • Windows Server 2012 Scale-Out File Server Clusters using SMB 3.0, including SMB Multichannel and SMB Direct.
  • Dual 10GbE RDMA networks for storage, dual 10GbE networks for tenants. Plus dual 40GbE aggregate switches and a management switch.
  • Cluster-in-a-box solutions implementing several roles, including a Network Virtualization Gateway, System Center 2012 SP1 and File Server Clusters.
  • Dense set of compute nodes running Windows Server 2012 Hyper-V hosts with lots of cores and memory

Here is some additional information from their booth signage:

image

They had a live system in the booth that was actually used in several demos at MMS 2013, including some of the keynote demos.
I took a picture and put some labels on it. Compare the physical systems below with their diagram above.

image

Loved this hardware! Makes our software solution really come to life...

File Server Tip: How to rebalance a Scale-Out File Server using a little PowerShell

$
0
0

A Scale-Out File Server is a cluster role that offers the same SMB file share (or set of shares) on every node of the cluster. As clients come in, they are spread across the nodes using a round-robin mechanism based on DNS. In the common case of having many clients and just a few file server cluster nodes, things even out quite nicely.

You can check which clients are accessing which nodes using the SMB Witness facility in SMB 3.0:

Get-SmbWitnessClient

This cmdlet will show which clients are currently connected to the cluster along with the physical nodes they selected for data access and the other node that acts as an SMB Witness.

There are situations that might cause a cluster to become unbalanced. For instance, if you have a two-node Scale-Out File Server and one node fails, all clients will end up connect to the surviving node.

It's great that SMB Transparent Failover feature will make that event to cause no disruption to any of the clients. However, after you second node recovers, the clients won't automatically move back to the original node.

To remedy this situation, you can use a simple cmdlet to move the SMB client between Scale-Out File Server nodes. Here's how you do it:

Move-SmbWitnessClient -ClientName Client -DestinationNode Node 

You can also combine the two cmdlets mentioned above with the Get-ClusterNode cmdlet to create a PowerShell script to automatically spread all clients evenly across the cluster nodes:

$Clients = Get-SmbWitnessClient | Sort ClientName
$Nodes   = Get-ClusterNode | ? State -eq "Up" | Sort Name
$MaxNode = $Nodes.Count
$Node    = 0

foreach ($Client in $Clients) {

    $ClientName = $Client.ClientName
    $NodeName   = $Nodes[$Node].Name  

    "Moving client " + $ClientName + “ to node “ + $NodeName
    Move-SmbWitnessClient -ClientName $ClientName -DestinationNode $NodeName -Confirm:$false
    Start-Sleep -Seconds 5

    $Node++
    if ($Node -ge $MaxNode) {$Node=0}

}

Notes about the script:

  • Only clients with an SMB witness connection will be enumerated and able to move. If you just brought up a failed node in a two-node cluster, it might take a few minutes for the clients to re-establish the witness connections.
  • It will take some time for each actual move to occur. Each SMB clients will be notified that it should move, then it will perform the move lazily. Give a minute or so before checking if the moves happened via Get-SmbWitnessClient.
  • The move is transparent and does not cause any disruption to clients.
  • Make sure to validate the script in a test environment to confirm it's behaving as you expected.

This is the simplest form of this script and there’s are at least a few better ways to do it. For instance, you could take into account the current node for every client and do a minimum number of moves to achieve balance. You could also look at the number of open files per client (using Get-SmbOpenFile) and balance taking that information into account, trying to balance the number of open files per node instead of the number of clients. Those would obviously be a bit more complicated to write. If you invest the time to create a better version of the script, share it in the comments...

 

-----------

 

P.S.: Jeromy Statia from Microsoft IT shared the following script, which he wrote that takes into consideration the open files per node and minimal moves clients to provide a balanced SMB Scale-Out File Server Cluster.

He plans to have this running as a scheduled task which executes once every hour or so. This is definitely a more advanced solution than my script above. Thanks for sharing, Jeromy…

 

$clusterNodes = Get-ClusterNode | ? State -eq "Up" | Sort-Object Name | Select-Object -ExpandProperty Name
Write-Host "Grabbing all witness client information..."

$witnessClientObject = @(Get-SmbWitnessClient | %{
    $clientObj = @{};
    $clientObj['WitnessClient'] = $_;
    $clientObj['OpenFileCount'] = @(Get-SmbOpenFile -ClientUserName "*$($_.ClientName)*").Count;
    New-Object PSObject -Property $clientObj
    } | sort-object OpenFileCount -Descending)

if($witnessClientObject.count -gt 0)
{
    Write-Host "Found $($witnessClientObject.Count) objects"
    $witnessClientObject | ft {$_.witnessclient.ClientName}, {$_.OpenFileCount} -a
    Write-Host "Getting node distribution"
    $distributionOfFiles = @($witnessClientObject | Group-Object {$_.WitnessClient.FileServerNodeName})
    $distributionObjects = @()

    foreach($distribution in $distributionOfFiles)
    {
        $distributionObject = @{}
        $distributionObject['FileServerNodeName'] = $distribution.Name
        $distributionObject['OpenFileCount'] = ($distribution.Group | Measure-Object OpenFileCount -Sum).Sum
        $distributionObject['Clients'] = $distribution.Group
        $distributionObjects += New-Object PSObject -Property $distributionObject
    }

    #add in any cluster nodes that have 0 witness connections

    foreach($unusedClusterNode in ($clusterNodes |? { $name = $_; -not($distributionOfFiles |?{ $_.Name -match $name}) }))
    {
        $distributionObject = @{}
        $distributionObject['FileServerNodeName'] = $unusedClusterNode
        $distributionObject['OpenFileCount'] = 0
        $distributionObject['Clients'] = @()
        $distributionObjects += New-Object PSObject -Property $distributionObject
    }

    #sort by the number of open files per server node

    $sortedDistribution = $distributionObjects | Sort-Object OpenFileCount -Descending
    $sortedDistribution |%{ Write-Host "$($_.FileServerNodeName) - $($_.OpenFileCount)"}
    Write-Host ""
    Write-host "Distribution OpenFileCounts:"
    Write-Host ""

    #Balance where needed

    for($step = 0; $step -lt $sortedDistribution.Count/2; ++$step)
    {
        #Get the difference between the largest and smallest file counts for this step
        #divide by two so we don't flop a single connection back an forth on each run
        $currentFileOpenVariance = [Math]::Ceiling(($sortedDistribution[$step].OpenFileCount - $sortedDistribution[-1 - $step].OpenFileCount)/2)
        Write-Host "Variance for step $($step): $($currentFileOpenVariance)"
        $moveTargets = @()
        $moveOpenFiles = 0

        foreach($client in $sortedDistribution[$step].Clients)
        {
            if($client.OpenFileCount -gt 0)
            {
                $varianceAfterMove = ($moveOpenFiles + $client.OpenFileCount)
                Write-Host "Checking $($varianceAfterMove) to be less than or equal to $($currentFileOpenVariance) to be a move target"
                if($varianceAfterMove -le $currentFileOpenVariance)
                {
                    Write-Host "Client $($client.WitnessClient.ClientName) is a target for move"
                    $moveTargets += $client.WitnessClient
                    $moveOpenFiles += $client.OpenFileCount
                }
            }
        }

        if($moveTargets.Count -gt 0)
        {
            foreach($moveTarget in $moveTargets)
            {
                Write-Host "Moving witness client $($moveTarget.ClientName) to SMB file server node $($sortedDistribution[-1 - $step].FileServerNodeName)"
                Move-SmbWitnessClient -ClientName $moveTarget.ClientName `
                                      -DestinationNode $sortedDistribution[-1 - $step].FileServerNodeName `
                                      -Confirm:$false `
                                      -ErrorAction Continue | Out-Null
            }
        }
        else
        {
            Write-Host "No move targets available"
        }
    }
}

Write-Host "SMB Witness client connections should now be as balanced as possible"

Viewing all 143 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>