Skip to main content

performance - Different linux page cache behavior for servers doing the same work

I have two sets of servers w/128G memory, distinguished by when they were provisioned, that are behaving very differently while running the exact same daemon (elasticsearch). I am using elasticsearch for full text search, not log storage, so this baically a read heavy operation with minimal writes (less than 1MB/s). This daemon mmap's the full dataset of ~350GB into its virtual memory and then accesses certain portions of it to serve requests. These servers have no swap space configured.



The problem is one set of servers behaves well, issues ~50 major faults per second and need on average 10MB/s of disk io to satisfy that demand. The poorly performing servers see 500 major faults per second and need on average ~200MB/s of disk io to satisfy that. The increase in disk io leads to poor p95 response latencies and occasional overloads as it hits the disk limits of ~550MB/s.




They all sit behind the same load balancer, and are part of the same cluster. I could see if one server was behaving badly it might be differences in load, but to have such a clear distinction with 16 servers behaving poorly and 20 servers behaving well, with them being purhased+provisioned at different times, that something at the kernel/configuration level must be causing the issues.



To get to the question, how can i make these poorly behaving servers act like the ones that are well behaved? Where should the debugging efforts be focused?



Below is some data I've collected to look at what the system is doing from the sar and page-types tools in each of the three states.



Software:
- debian jessie
- linux 4.9.25
- elasticsearch 5.3.2

- openjdk 1.8.0_141



First some page fault data from a well behaved server (from sar -B):



07:55:01 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
08:05:01 PM 3105.89 811.60 2084.40 48.16 3385.30 0.00 0.00 810.35 0.00
08:15:01 PM 4317.65 665.28 910.29 57.03 3160.14 0.00 0.00 1047.14 0.00
08:25:01 PM 3132.84 648.48 868.41 50.00 2899.14 0.00 0.00 801.27 0.00
08:35:01 PM 2940.02 1026.47 2031.69 45.26 3418.65 0.00 0.00 764.05 0.00
08:45:01 PM 2985.72 1011.27 744.34 45.78 2796.54 0.00 0.00 817.23 0.00

08:55:01 PM 2970.14 588.34 858.08 47.65 2852.33 0.00 0.00 749.17 0.00
09:05:01 PM 3489.23 2364.78 2121.48 47.13 3945.93 0.00 0.00 1145.02 0.00
09:15:01 PM 2695.48 594.57 858.56 44.57 2616.84 0.00 0.00 624.13 0.00


And here is one of the poorly performing servers:



07:55:01 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
08:05:01 PM 268547.64 1223.73 5266.73 503.12 71836.18 0.00 0.00 67170.50 0.00
08:15:01 PM 279865.31 944.04 3995.52 520.17 74231.90 0.00 0.00 70000.23 0.00

08:25:01 PM 265806.75 966.55 3747.43 499.45 70443.49 0.00 0.00 66407.62 0.00
08:35:01 PM 251820.93 1831.04 4689.62 475.43 67445.29 0.00 0.00 63056.35 0.00
08:45:01 PM 236055.04 1022.32 3498.37 449.37 62955.37 0.00 0.00 59042.16 0.00
08:55:01 PM 239478.40 971.98 3523.61 451.76 63856.04 0.00 0.00 59953.38 0.00
09:05:01 PM 232213.81 1058.04 4436.75 437.09 62194.61 0.00 0.00 58100.47 0.00
09:15:01 PM 216433.72 911.94 3192.28 413.23 57737.62 0.00 0.00 54126.78 0.00


I suspect this is due to poor performance of the LRU portion of page reclamation. If i run while true; do echo 1 > /proc/sys/vm/drop_caches; sleep 30; done, which drops all non-mmaped pages, there will be an initial spike of disk io, but after about 30 minutes it will settle down . I ran this for ~48 hours on all servers to verify it would show the same reduction in IO both under peak daily load and the low points. It did. Sar now reports the following:




12:55:01 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
01:05:01 PM 121327.14 1482.09 2277.40 140.25 32781.26 0.00 0.00 1764.61 0.00
01:15:01 PM 109911.39 1334.51 1057.51 130.53 31095.68 0.00 0.00 1121.39 0.00
01:25:01 PM 126500.69 1652.51 1216.76 143.07 35830.38 0.00 0.00 2801.84 0.00
01:35:01 PM 132669.45 1857.62 2309.86 148.47 36735.79 0.00 0.00 3181.19 0.00
01:45:01 PM 126872.04 1451.94 1062.94 145.68 34678.26 0.00 0.00 992.60 0.00
01:55:01 PM 121002.21 1818.32 1142.16 139.40 34168.53 0.00 0.00 1640.18 0.00
02:05:01 PM 121824.18 1260.22 2319.56 142.80 33254.67 0.00 0.00 1738.25 0.00
02:15:02 PM 120768.12 1100.87 1143.36 140.20 34195.15 0.00 0.00 1702.83 0.00



Major page faults have been cut to 1/3 the prior value. Pages brought in from disk have cut in half. This reduces disk IO from ~200MB/s to ~100MB/s, but the well behaved server is still outperforming them all significantly with only 50 major faults/s, and only needing to do ~10MB/s of disk io.



To get a look at what the LRU algorithm has to work I use the page-types tool from the kernel. This is a well behaved server (from page-types | awk '$3 > 1000 { print $0 }' | sort -nk3):



             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000828 257715 1006 ___U_l_____M______________________________ uptodate,lru,mmap
0x0000000000000080 259789 1014 _______S__________________________________ slab
0x000000000000006c 279344 1091 __RU_lA___________________________________ referenced,uptodate,lru,active
0x0000000000000268 305744 1194 ___U_lA__I________________________________ uptodate,lru,active,reclaim

0x0000000000100000 524288 2048 ____________________n_____________________ nopage
0x000000000000082c 632704 2471 __RU_l_____M______________________________ referenced,uptodate,lru,mmap
0x0000000000000000 763312 2981 __________________________________________
0x0000000000000068 2108618 8236 ___U_lA___________________________________ uptodate,lru,active
0x000000000000086c 6987343 27294 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000005868 8589411 33552 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000000868 12513737 48881 ___U_lA____M______________________________ uptodate,lru,active,mmap
total 34078720 133120



This is a poorly performing server:



             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000100000 262144 1024 ____________________n_____________________ nopage
0x0000000000000828 340276 1329 ___U_l_____M______________________________ uptodate,lru,mmap
0x000000000000086c 515691 2014 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000000028 687263 2684 ___U_l____________________________________ uptodate,lru
0x0000000000000000 785662 3068 __________________________________________
0x0000000000000868 7946840 31042 ___U_lA____M______________________________ uptodate,lru,active,mmap
0x0000000000005868 8588629 33549 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked

0x0000000000000068 14133541 55209 ___U_lA___________________________________ uptodate,lru,active
total 33816576 132096


And here's what it looks like when looping the drop_caches command:



             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000100000 262144 1024 ____________________n_____________________ nopage
0x0000000000000400 394790 1542 __________B_______________________________ buddy
0x0000000000000000 761557 2974 __________________________________________

0x0000000000000868 1451890 5671 ___U_lA____M______________________________ uptodate,lru,active,mmap
0x000000000000082c 3123142 12199 __RU_l_____M______________________________ referenced,uptodate,lru,mmap
0x0000000000000828 5278755 20620 ___U_l_____M______________________________ uptodate,lru,mmap
0x0000000000005868 8622864 33683 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000086c 13630124 53242 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
total 33816576 132096


Things tried:





  • increase /proc/sys/vm/vfs_cache_pressure to various values between
    150 and 10000. This makes no difference in IO or the data reported by
    page-types, which makes sense because this balances kernel
    structures vs user pages, and my problem is with different user pages

  • increase /proc/sys/vm/swappiness. Didn't expect this to do anything, and it doesn't, but didn't hurt to check.

  • Disable mmap (instead using java's nio which is based on epoll). This immediatly making the server's IO usage look just like the well behaved ones in terms of IO usage. The downside here is that system cpu % is tied to how much IO is happening, with 10MB/s taking ~1.5% and the occasional load up to ~100MB/s of disk io saw system cpu % of 5 to 10%. On a 32 core server that's 1.5 to 3 cpus used entirely to handle epoll. Latency was also worse with nio (vs well behaved mmap'd servers). This is a plausible solution but its a cop-out, without understanding what actually is going wrong.

  • countless hours with the perf tool poking around at stack traces, flame graphs, and looking for differences in kernel behaviour. Little if any insight gained.

  • Checked disk read-ahead settings are the same across servers. raid0 on poorly performing servers defaulted to 2048 blocks, well performing servers raid0 defaulted to 256 block. Updating poorly performing servers to 256 with blockdev --setra shows no effect on IO behavior.

  • Start the jvm with numactl --interleave=all to ensure the problem isn't related to balancing between the two numa nodes. Made no difference.


  • verified with vmtouch, which basically uses mincore(2) to ask the kernel if files are cached, that 99%+ of the buffered memory is being used for the filesystem elasticsearch stores it's data on. This is true in all 3 cases from above.

  • verified with fuser -m that elasticsearch is the only process using files on the filesystem elasticsearch stores it's data on.



To test soon:




  • I'll be re-provisioning one of the misbehaving servers next week, but I'm not optimistic this will have much effect. During this provisioning i'm also updating the raid array to put LVM in front of it. Not expecting anything different from LVM but removing a variable seems worthwhile.

Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

ubuntu - Monitoring CPU, Mem, disk, on a single server

I've been looking for a simple starter solution for monitoring my [currently] single server hosted solution. Other than Nagios and similar, are there other good (simple) solutions people are using? Answer Everything depends on what you want. For example Munin is very simple, you can install and configure it in less then 10 minutes (on one server), it can sends alarms, make graphs from monitoring cpu, mem. apache connections, eaccellerator, disk io and many many more (it has many plugins). But if you are planning in future get some more machines, munin may not be enough. For example in munin you cant monitor state of individual processes, can't monitor changes in files (for security purpose). So if you wanna only see what is the utilization of basics parameters on your server and don't plan to buy some more servers Munin is what you are looking for, but if you wanna be alarmed when some of your service is down, take more control on what is happeninig on...

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits, ...