iscsi performance

Post Reply
cfreak
Posts: 2
Joined: Tue Dec 09, 2014 9:26 pm

iscsi performance

Post by cfreak »

hi,
i have a new 4 node 7450 3.2.1mu1 connected to 32 blades running vsphere 5.5u2 over iscsi connected via 4x6120xg switches. Each Vhost has 8 paths to each datastore, rr with iops=1.

When i migrate VMs from our P4900 to the new 3par (1TB thin vv) the average IO latency inside the linux guest increases by quite some margin.

As i spent the last two weeks debugging and ran out of ideas (besides ordering FC hardware) perhaps you guys can help me.

Benchmarks with fio also show the increased latency for single IO requests:

Code: Select all

root@3par:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/3136K /s] [0 /[color=#FF0000]784  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5096
  write: io=102400KB, bw=3151.7KB/s, iops=787 , runt= 32497msec
    slat (usec): min=28 , max=2825 , avg=43.49, stdev=28.71
    clat (usec): min=967 , max=6845 , avg=1219.50, stdev=156.39
     lat (usec): min=1004 , max=6892 , avg=1263.63, stdev=160.23
    clat percentiles (usec):
     |  1.00th=[ 1012],  5.00th=[ 1048], 10.00th=[ 1064], 20.00th=[ 1112],
     | 30.00th=[ 1160], 40.00th=[ 1192], 50.00th=[ 1224], 60.00th=[ 1240],
     | 70.00th=[ 1272], 80.00th=[ 1288], 90.00th=[ 1336], 95.00th=[ 1384],
     | 99.00th=[ 1608], 99.50th=[ [color=#BF0000]1816[/color]], 99.90th=[ 3184], 99.95th=[ 3376],
     | 99.99th=[ 5792]
    bw (KB/s)  : min= 3009, max= 3272, per=100.00%, avg=3153.97, stdev=51.20
    lat (usec) : 1000=0.42%
    lat (msec) : 2=99.21%, 4=0.34%, 10=0.02%
  cpu          : usr=0.82%, sys=3.95%, ctx=25627, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=3151KB/s, minb=3151KB/s, maxb=3151KB/s, mint=32497msec, maxt=32497msec

Disk stats (read/write):
    dm-0: ios=0/26098, merge=0/0, ticks=0/49716, in_queue=49716, util=92.15%, aggrios=0/25652, aggrmerge=0/504, aggrticks=0/33932, aggrin_queue=33888, aggrutil=91.79%
  sda: ios=0/25652, merge=0/504, ticks=0/33932, in_queue=33888, util=91.79%


Code: Select all

root@p4900:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 100MB)
Jobs: 1 (f=1): [w] [100.0% done] [0K/7892K /s] [0 /[color=#BF0000]1973  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3193
  write: io=102400KB, bw=7464.2KB/s, iops=1866 , runt= 13719msec
    slat (usec): min=21 , max=1721 , avg=29.71, stdev=14.23
    clat (usec): min=359 , max=22106 , avg=502.07, stdev=222.14
     lat (usec): min=388 , max=22138 , avg=532.20, stdev=222.72
    clat percentiles (usec):
     |  1.00th=[  386],  5.00th=[  402], 10.00th=[  410], 20.00th=[  426],
     | 30.00th=[  438], 40.00th=[  454], 50.00th=[  470], 60.00th=[  490],
     | 70.00th=[  516], 80.00th=[  548], 90.00th=[  596], 95.00th=[  660],
     | 99.00th=[ 1032], 99.50th=[[color=#FF0000] 1192[/color]], 99.90th=[ 2672], 99.95th=[ 4192],
     | 99.99th=[ 8032]
    bw (KB/s)  : min= 6784, max= 8008, per=100.00%, avg=7464.89, stdev=339.23
    lat (usec) : 500=64.13%, 750=32.82%, 1000=1.85%
    lat (msec) : 2=1.04%, 4=0.10%, 10=0.05%, 50=0.01%
  cpu          : usr=2.01%, sys=5.86%, ctx=25635, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=7464KB/s, minb=7464KB/s, maxb=7464KB/s, mint=13719msec, maxt=13719msec

Disk stats (read/write):
    dm-0: ios=0/25647, merge=0/0, ticks=0/12712, in_queue=12712, util=87.10%, aggrios=0/25617, aggrmerge=0/166, aggrticks=0/12748, aggrin_queue=12736, aggrutil=86.71%
  sda: ios=0/25617, merge=0/166, ticks=0/12748, in_queue=12736, util=86.71%
afidel
Posts: 216
Joined: Tue May 07, 2013 1:45 pm

Re: iscsi performance

Post by afidel »

Is that during the move, or after the VM is done moving?
cfreak
Posts: 2
Joined: Tue Dec 09, 2014 9:26 pm

Re: iscsi performance

Post by cfreak »

this benchmark is from 1 old an 1 new test vm
JohnMH
Posts: 505
Joined: Wed Nov 19, 2014 5:14 am

Re: iscsi performance

Post by JohnMH »

Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post viewtopic.php?f=18&t=883&p=4246&hilit=interrupt#p4246
afidel
Posts: 216
Joined: Tue May 07, 2013 1:45 pm

Re: iscsi performance

Post by afidel »

Oh, and a 100MB test file isn't going to tell you anything, you need to be several times the size of the storage cache to get any value out of a synthetic benchmark.
Schmoog
Posts: 242
Joined: Wed Oct 30, 2013 2:30 pm

Re: iscsi performance

Post by Schmoog »

JohnMH wrote:Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post http://www.3parug.com/viewtopic.php?f=1 ... rupt#p4246


+100 to what John and afidel said. I see this particular issue coming up relatively often here. These systems are designed to function under load in a multi threaded multi tenant environment. If your benchmarking isn't stressing the system, the performance number will be lackluster.

It's a little counterintuitive because conventional wisdom would say lower load = higher performance. But in the car of extremely small workloads that is not necessarily the case. A workload that isn't big enough to hit the buffer queues, write cache, etc, won't get very high performance numbers in a system that is otherwise idle
Post Reply