Discussion:
[patch 0/3] mm: memcontrol: performance fixlets for 3.18
Johannes Weiner
2014-09-24 15:08:55 UTC
Permalink
Hi Andrew,

here are 2 memcg performance fixlets for 3.18. One improves uncharge
batching to reduce expensive res_counter ops and irq-toggling, the
other one allows THP charges to succeed under cache pressure.

Thanks!

include/linux/swap.h | 6 ++-
mm/memcontrol.c | 116 +++++++++++++------------------------------------
mm/swap.c | 27 +++++++++---
mm/swap_state.c | 14 ++----
mm/vmscan.c | 7 +--
5 files changed, 63 insertions(+), 107 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-09-24 15:08:56 UTC
Permalink
From: Michal Hocko <***@suse.cz>

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18% 80.18% [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--66.59%-- res_counter_uncharge_until
| res_counter_uncharge
| uncharge_batch
| uncharge_list
| mem_cgroup_uncharge_list
| release_pages
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| |
| |--90.12%-- unmap_single_vma
| | unmap_vmas
| | unmap_region
| | do_munmap
| | vm_munmap
| | sys_munmap
| | system_call_fastpath
| | __GI___munmap
| |
| --9.88%-- tlb_flush_mmu
| tlb_finish_mmu
| unmap_region
| do_munmap
| vm_munmap
| sys_munmap
| system_call_fastpath
| __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.

Reported-by: Dave Hansen <***@sr71.net>
Signed-off-by: Michal Hocko <***@suse.cz>
Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/swap.c | 27 +++++++++++++++++++++------
mm/swap_state.c | 14 ++++----------
2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..8af99dd68dd2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -888,9 +888,9 @@ void lru_add_drain_all(void)
}

/*
- * Batched page_cache_release(). Decrement the reference count on all the
- * passed pages. If it fell to zero then remove the page from the LRU and
- * free it.
+ * Batched lru release. Decrement the reference count on all the passed pages.
+ * If it fell to zero then remove the page from the LRU and add it to the given
+ * list to be freed by the caller.
*
* Avoid taking zone->lru_lock if possible, but if it is taken, retain it
* for the remainder of the operation.
@@ -900,10 +900,10 @@ void lru_add_drain_all(void)
* grabbed the page via the LRU. If it did, give up: shrink_inactive_list()
* will free it.
*/
-void release_pages(struct page **pages, int nr, bool cold)
+static void release_lru_pages(struct page **pages, int nr,
+ struct list_head *pages_to_free)
{
int i;
- LIST_HEAD(pages_to_free);
struct zone *zone = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
@@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold)
/* Clear Active bit in case of parallel mark_page_accessed */
__ClearPageActive(page);

- list_add(&page->lru, &pages_to_free);
+ list_add(&page->lru, pages_to_free);
}
if (zone)
spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+/*
+ * Batched page_cache_release(). Frees and uncharges all given pages
+ * for which the reference count drops to 0.
+ */
+void release_pages(struct page **pages, int nr, bool cold)
+{
+ LIST_HEAD(pages_to_free);

+ while (nr) {
+ int batch = min(nr, PAGEVEC_SIZE);
+
+ release_lru_pages(pages, batch, &pages_to_free);
+ pages += batch;
+ nr -= batch;
+ }
mem_cgroup_uncharge_list(&pages_to_free);
free_hot_cold_page_list(&pages_to_free, cold);
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct page **pages, int nr)
{
struct page **pagep = pages;
+ int i;

lru_add_drain();
- while (nr) {
- int todo = min(nr, PAGEVEC_SIZE);
- int i;
-
- for (i = 0; i < todo; i++)
- free_swap_cache(pagep[i]);
- release_pages(pagep, todo, false);
- pagep += todo;
- nr -= todo;
- }
+ for (i = 0; i < nr; i++)
+ free_swap_cache(pagep[i]);
+ release_pages(pagep, nr, false);
}

/*
--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2014-09-24 19:42:34 UTC
Permalink
Post by Johannes Weiner
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
...
In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.
There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.
Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.
Looks OK.
Post by Johannes Weiner
--- a/mm/swap.c
+++ b/mm/swap.c
...
+}
+/*
+ * Batched page_cache_release(). Frees and uncharges all given pages
+ * for which the reference count drops to 0.
+ */
+void release_pages(struct page **pages, int nr, bool cold)
+{
+ LIST_HEAD(pages_to_free);
+ while (nr) {
+ int batch = min(nr, PAGEVEC_SIZE);
+
+ release_lru_pages(pages, batch, &pages_to_free);
+ pages += batch;
+ nr -= batch;
+ }
The use of PAGEVEC_SIZE here is pretty misleading - there are no
pagevecs in sight. SWAP_CLUSTER_MAX would be more appropriate.



afaict the only reason for this loop is to limit the hold duration for
lru_lock. And it does a suboptimal job of that because it treats all
lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
pages, the logic would then force release_lru_pages() to drop the lock
and return to release_pages() even though it doesn't need to.

So I'm thinking it would be better to move the lock-busting logic into
release_lru_pages() itself. With a suitable comment, natch ;) Only
bust the lock in the case where we really did hold a particular lru_lock
for 16 consecutive pages. Then s/release_lru_pages/release_pages/ and
zap the old release_pages().

Obviously it's not very important - presumably the common case is that
the LRU contains lengthy sequences of pages from the same zone. Maybe.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-09-24 21:03:22 UTC
Permalink
Post by Andrew Morton
Post by Johannes Weiner
+}
+/*
+ * Batched page_cache_release(). Frees and uncharges all given pages
+ * for which the reference count drops to 0.
+ */
+void release_pages(struct page **pages, int nr, bool cold)
+{
+ LIST_HEAD(pages_to_free);
+ while (nr) {
+ int batch = min(nr, PAGEVEC_SIZE);
+
+ release_lru_pages(pages, batch, &pages_to_free);
+ pages += batch;
+ nr -= batch;
+ }
The use of PAGEVEC_SIZE here is pretty misleading - there are no
pagevecs in sight. SWAP_CLUSTER_MAX would be more appropriate.
afaict the only reason for this loop is to limit the hold duration for
lru_lock. And it does a suboptimal job of that because it treats all
lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
pages, the logic would then force release_lru_pages() to drop the lock
and return to release_pages() even though it doesn't need to.
So I'm thinking it would be better to move the lock-busting logic into
release_lru_pages() itself. With a suitable comment, natch ;) Only
bust the lock in the case where we really did hold a particular lru_lock
for 16 consecutive pages. Then s/release_lru_pages/release_pages/ and
zap the old release_pages().
Yeah, that's better.
Post by Andrew Morton
Obviously it's not very important - presumably the common case is that
the LRU contains lengthy sequences of pages from the same zone. Maybe.
Even then, the end result is more concise and busts the lock where
it's actually taken, making the whole thing a bit more obvious:

---
From 367f3dcf25141ff6c30400c00ec09cc3c5486f94 Mon Sep 17 00:00:00 2001
From: Michal Hocko <***@suse.cz>
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [patch] mm: memcontrol: do not kill uncharge batching in
free_pages_and_swap_cache

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18% 80.18% [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--66.59%-- res_counter_uncharge_until
| res_counter_uncharge
| uncharge_batch
| uncharge_list
| mem_cgroup_uncharge_list
| release_pages
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| |
| |--90.12%-- unmap_single_vma
| | unmap_vmas
| | unmap_region
| | do_munmap
| | vm_munmap
| | sys_munmap
| | system_call_fastpath
| | __GI___munmap
| |
| --9.88%-- tlb_flush_mmu
| tlb_finish_mmu
| unmap_region
| do_munmap
| vm_munmap
| sys_munmap
| system_call_fastpath
| __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.

Also update the grossly out-of-date release_pages documentation.

Reported-by: Dave Hansen <***@sr71.net>
Signed-off-by: Michal Hocko <***@suse.cz>
Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/swap.c | 31 ++++++++++++++++++++-----------
mm/swap_state.c | 14 ++++----------
2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..39affa1932ce 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -887,18 +887,14 @@ void lru_add_drain_all(void)
mutex_unlock(&lock);
}

-/*
- * Batched page_cache_release(). Decrement the reference count on all the
- * passed pages. If it fell to zero then remove the page from the LRU and
- * free it.
- *
- * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
- * for the remainder of the operation.
+/**
+ * release_pages - batched page_cache_release()
+ * @pages: array of pages to release
+ * @nr: number of pages
+ * @cold: whether the pages are cache cold
*
- * The locking in this function is against shrink_inactive_list(): we recheck
- * the page count inside the lock to see whether shrink_inactive_list()
- * grabbed the page via the LRU. If it did, give up: shrink_inactive_list()
- * will free it.
+ * Decrement the reference count on all the pages in @pages. If it
+ * fell to zero, remove the page from the LRU and free it.
*/
void release_pages(struct page **pages, int nr, bool cold)
{
@@ -907,6 +903,7 @@ void release_pages(struct page **pages, int nr, bool cold)
struct zone *zone = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
+ unsigned int uninitialized_var(lock_batch);

for (i = 0; i < nr; i++) {
struct page *page = pages[i];
@@ -914,6 +911,7 @@ void release_pages(struct page **pages, int nr, bool cold)
if (unlikely(PageCompound(page))) {
if (zone) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
+ lock_batch = 0;
zone = NULL;
}
put_compound_page(page);
@@ -930,6 +928,7 @@ void release_pages(struct page **pages, int nr, bool cold)
if (zone)
spin_unlock_irqrestore(&zone->lru_lock,
flags);
+ lock_batch = 0;
zone = pagezone;
spin_lock_irqsave(&zone->lru_lock, flags);
}
@@ -938,6 +937,16 @@ void release_pages(struct page **pages, int nr, bool cold)
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
+
+ /*
+ * Make sure the IRQ-safe lock-holding time
+ * does not get excessive with a continuous
+ * string of pages from the same zone.
+ */
+ if (++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ zone = NULL;
+ }
}

/* Clear Active bit in case of parallel mark_page_accessed */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct page **pages, int nr)
{
struct page **pagep = pages;
+ int i;

lru_add_drain();
- while (nr) {
- int todo = min(nr, PAGEVEC_SIZE);
- int i;
-
- for (i = 0; i < todo; i++)
- free_swap_cache(pagep[i]);
- release_pages(pagep, todo, false);
- pagep += todo;
- nr -= todo;
- }
+ for (i = 0; i < nr; i++)
+ free_swap_cache(pagep[i]);
+ release_pages(pagep, nr, false);
}

/*
--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2014-09-24 21:15:44 UTC
Permalink
Post by Johannes Weiner
Post by Andrew Morton
Obviously it's not very important - presumably the common case is that
the LRU contains lengthy sequences of pages from the same zone. Maybe.
Even then, the end result is more concise and busts the lock where
Yes, that did come out better.
Post by Johannes Weiner
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [patch] mm: memcontrol: do not kill uncharge batching in
free_pages_and_swap_cache
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
80.18% 80.18% [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--66.59%-- res_counter_uncharge_until
| res_counter_uncharge
| uncharge_batch
| uncharge_list
| mem_cgroup_uncharge_list
| release_pages
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| |
| |--90.12%-- unmap_single_vma
| | unmap_vmas
| | unmap_region
| | do_munmap
| | vm_munmap
| | sys_munmap
| | system_call_fastpath
| | __GI___munmap
| |
| --9.88%-- tlb_flush_mmu
| tlb_finish_mmu
| unmap_region
| do_munmap
| vm_munmap
| sys_munmap
| system_call_fastpath
| __GI___munmap
In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.
There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.
In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.
I beefed this paragraph up a bit:

: The release_pages() code was previously breaking the lru_lock each
: PAGEVEC_SIZE pages (ie, 14 pages). However this code has no usage of
: pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
: (32) pages. This means that the lock acquisition frequency is
: approximately halved and the max hold times are approximately doubled.
:
: The now unneeded batching is removed from free_pages_and_swap_cache().

I doubt if the increased irq-off time will hurt anyone, but who knows...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-09-25 13:44:03 UTC
Permalink
On Wed 24-09-14 17:03:22, Johannes Weiner wrote:
[...]
Post by Johannes Weiner
In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.
Actually I had something like that originally but then decided to
not change the break out logic to prevent from strange and subtle
regressions. I have focused only on the memcg batching POV and led the
rest untouched.

I do agree that lru_lock batching can be improved as well. Your change
looks almost correct but you should count all the pages while the lock
is held otherwise you might happen to hold the lock for too long just
because most pages are off the LRU already for some reason. At least
that is what my original attempt was doing. Something like the following
on top of the current patch:
---
diff --git a/mm/swap.c b/mm/swap.c
index 39affa1932ce..8a12b33936b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -911,13 +911,22 @@ void release_pages(struct page **pages, int nr, bool cold)
if (unlikely(PageCompound(page))) {
if (zone) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
- lock_batch = 0;
zone = NULL;
}
put_compound_page(page);
continue;
}

+ /*
+ * Make sure the IRQ-safe lock-holding time does not get
+ * excessive with a continuous string of pages from the
+ * same zone. The lock is held only if zone != NULL.
+ */
+ if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ zone = NULL;
+ }
+
if (!put_page_testzero(page))
continue;

@@ -937,16 +946,6 @@ void release_pages(struct page **pages, int nr, bool cold)
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
-
- /*
- * Make sure the IRQ-safe lock-holding time
- * does not get excessive with a continuous
- * string of pages from the same zone.
- */
- if (++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- zone = NULL;
- }
}

/* Clear Active bit in case of parallel mark_page_accessed */
[...]
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-02 15:57:50 UTC
Permalink
Post by Michal Hocko
[...]
Post by Johannes Weiner
In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.
Actually I had something like that originally but then decided to
not change the break out logic to prevent from strange and subtle
regressions. I have focused only on the memcg batching POV and led the
rest untouched.
I do agree that lru_lock batching can be improved as well. Your change
looks almost correct but you should count all the pages while the lock
is held otherwise you might happen to hold the lock for too long just
because most pages are off the LRU already for some reason. At least
that is what my original attempt was doing. Something like the following
Yep, that makes sense.

Would you care to send it in such that Andrew can pick it up? Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-09-24 15:08:58 UTC
Permalink
In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.

The reason for this is that the memcg reclaim code was never designed
for higher-order charges. It reclaims in small batches until there is
room for at least one page. Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.

Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.

This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
allocation rates amid cache activity are drastically improved:

vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

[ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Vladimir Davydov <***@parallels.com>
---
include/linux/swap.h | 6 +++--
mm/memcontrol.c | 69 +++++++++++++---------------------------------------
mm/vmscan.c | 7 +++---
3 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ea4f926e6b9b..37a585beef5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
+ gfp_t gfp_mask,
+ bool may_swap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89c920156c2a..c2c75262a209 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -478,14 +478,6 @@ enum res_type {
#define OOM_CONTROL (0)

/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
-#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-
-/*
* The memcg_create_mutex will be held whenever a new cgroup is created.
* As a consequence, any change that needs to protect against new child cgroups
* appearing has to hold it as well.
@@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
NULL, "Memory cgroup out of memory");
}

-static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
- gfp_t gfp_mask,
- unsigned long flags)
-{
- unsigned long total = 0;
- bool noswap = false;
- int loop;
-
- if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
- noswap = true;
-
- for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
- if (loop)
- drain_all_stock_async(memcg);
- total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
- /*
- * Allow limit shrinkers, which are triggered directly
- * by userspace, to catch signals and stop reclaim
- * after minimal progress, regardless of the margin.
- */
- if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
- break;
- if (mem_cgroup_margin(memcg))
- break;
- /*
- * If nothing was reclaimed after two attempts, there
- * may be no reclaimable pages in this hierarchy.
- */
- if (loop && !total)
- break;
- }
- return total;
-}
-
/**
* test_mem_cgroup_node_reclaimable
* @memcg: the target memcg
@@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct mem_cgroup *mem_over_limit;
struct res_counter *fail_res;
unsigned long nr_reclaimed;
- unsigned long flags = 0;
unsigned long long size;
+ bool may_swap = true;
+ bool drained = false;
int ret = 0;

if (mem_cgroup_is_root(memcg))
@@ -2547,7 +2506,7 @@ retry:
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
} else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
- flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+ may_swap = false;
}

if (batch > nr_pages) {
@@ -2572,11 +2531,18 @@ retry:
if (!(gfp_mask & __GFP_WAIT))
goto nomem;

- nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+ nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+ gfp_mask, may_swap);

if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
goto retry;

+ if (!drained) {
+ drain_all_stock_async(mem_over_limit);
+ drained = true;
+ goto retry;
+ }
+
if (gfp_mask & __GFP_NORETRY)
goto nomem;
/*
@@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
+
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;

- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_NOSWAP |
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
+
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
if (signal_pending(current))
return -EINTR;

- progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
- false);
+ progress = try_to_free_mem_cgroup_pages(memcg, 1,
+ GFP_KERNEL, true);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06123f20a326..dcb47074ae03 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
}

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
gfp_t gfp_mask,
- bool noswap)
+ bool may_swap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
int nid;
struct scan_control sc = {
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
+ .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.target_mem_cgroup = memcg,
.priority = DEF_PRIORITY,
.may_writepage = !laptop_mode,
.may_unmap = 1,
- .may_swap = !noswap,
+ .may_swap = may_swap,
};

/*
--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-09-29 13:57:07 UTC
Permalink
Post by Johannes Weiner
In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.
The reason for this is that the memcg reclaim code was never designed
for higher-order charges. It reclaims in small batches until there is
room for at least one page. Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.
Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.
I had concerns the last time you were posting similar patch
(http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
any of them neither mentioned nor addressed here. Especially unexpected
long stalls and excessive swapout. 512 pages target for direct reclaim
is too much. Especially for smaller memcgs where we would need to drop
the priority considerably to even scan that many pages.

THP charges close to the limit are definitely a problem but this
is way too risky to fix this problem IMO. Maybe a better approach
would be to limit the reclaim to (clean) page cache (aka
MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
lower and excessive swapouts shouldn't happen at all. What is the point
to swap out a large number of pages just to allow THP charge which can
be used very sparsely?
Post by Johannes Weiner
This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
What is the load and configuration that you have measured?
Post by Johannes Weiner
[ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]
---
include/linux/swap.h | 6 +++--
mm/memcontrol.c | 69 +++++++++++++---------------------------------------
mm/vmscan.c | 7 +++---
3 files changed, 25 insertions(+), 57 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ea4f926e6b9b..37a585beef5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
+ gfp_t gfp_mask,
+ bool may_swap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89c920156c2a..c2c75262a209 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -478,14 +478,6 @@ enum res_type {
#define OOM_CONTROL (0)
/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
-#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-
-/*
* The memcg_create_mutex will be held whenever a new cgroup is created.
* As a consequence, any change that needs to protect against new child cgroups
* appearing has to hold it as well.
@@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
NULL, "Memory cgroup out of memory");
}
-static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
- gfp_t gfp_mask,
- unsigned long flags)
-{
- unsigned long total = 0;
- bool noswap = false;
- int loop;
-
- if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
- noswap = true;
-
- for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
- if (loop)
- drain_all_stock_async(memcg);
- total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
- /*
- * Allow limit shrinkers, which are triggered directly
- * by userspace, to catch signals and stop reclaim
- * after minimal progress, regardless of the margin.
- */
- if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
- break;
- if (mem_cgroup_margin(memcg))
- break;
- /*
- * If nothing was reclaimed after two attempts, there
- * may be no reclaimable pages in this hierarchy.
- */
- if (loop && !total)
- break;
- }
- return total;
-}
-
/**
* test_mem_cgroup_node_reclaimable
@@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct mem_cgroup *mem_over_limit;
struct res_counter *fail_res;
unsigned long nr_reclaimed;
- unsigned long flags = 0;
unsigned long long size;
+ bool may_swap = true;
+ bool drained = false;
int ret = 0;
if (mem_cgroup_is_root(memcg))
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
} else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
- flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+ may_swap = false;
}
if (batch > nr_pages) {
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
- nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+ nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+ gfp_mask, may_swap);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
goto retry;
+ if (!drained) {
+ drain_all_stock_async(mem_over_limit);
+ drained = true;
+ goto retry;
+ }
+
if (gfp_mask & __GFP_NORETRY)
goto nomem;
/*
@@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
+
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_NOSWAP |
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
+
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
if (signal_pending(current))
return -EINTR;
- progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
- false);
+ progress = try_to_free_mem_cgroup_pages(memcg, 1,
+ GFP_KERNEL, true);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06123f20a326..dcb47074ae03 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
}
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
gfp_t gfp_mask,
- bool noswap)
+ bool may_swap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
int nid;
struct scan_control sc = {
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
+ .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.target_mem_cgroup = memcg,
.priority = DEF_PRIORITY,
.may_writepage = !laptop_mode,
.may_unmap = 1,
- .may_swap = !noswap,
+ .may_swap = may_swap,
};
/*
--
2.1.0
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-09-29 17:57:00 UTC
Permalink
Hi Michal,
Post by Michal Hocko
Post by Johannes Weiner
In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.
The reason for this is that the memcg reclaim code was never designed
for higher-order charges. It reclaims in small batches until there is
room for at least one page. Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.
Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.
I had concerns the last time you were posting similar patch
(http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
any of them neither mentioned nor addressed here.
I actually made several attempts to address your concerns, but it's
Post by Michal Hocko
Especially unexpected long stalls and excessive swapout. 512 pages
target for direct reclaim is too much. Especially for smaller memcgs
where we would need to drop the priority considerably to even scan
that many pages. THP charges close to the limit are definitely a
problem but this is way too risky to fix this problem IMO.
Every change we make is a trade-off and bears a certain risk. THP is
a trade-off, it's pretty pointless to ignore the upsides and ride
around on the downsides. Of course there are downsides. This patch
makes THP work properly inside memcg, which invites both the upsides
as well as the downsides of THP into memcg. But they are well known
and we can deal with them. Why is THP inside memcg special?

Sure, the smaller the memcg, the bigger the THP fault and scan target
in comparison. We don't have control over the THP size, but the user
can always increase the size of the memcg, iff THP leads to increased
fragmentation and memory consumption for the given workload.

[ Until he can't, at which point he has to decide whether the cost of
THP outweighs the benefits on a system-wide level. For now - we
could certainly consider making a THP-knob available per memcg, I'm
just extrapolating from the fact that we don't have a per-process
knob that it's unlikely that we need one per memcg. ]
Post by Michal Hocko
Maybe a better approach
would be to limit the reclaim to (clean) page cache (aka
MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
lower and excessive swapouts shouldn't happen at all. What is the point
to swap out a large number of pages just to allow THP charge which can
be used very sparsely?
THP can lead to thrashing, that's not news. Preventing THP faults
from swapping is a reasonable proposal, but again has nothing to do
with memcg.

As for this patch, we don't have sufficient data on existing
configurations to know if this will lead to noticable regressions. It
might, but everything we do might cause a regression, that's just our
reality. That alone can't be grounds for rejecting a patch. However,
in this particular case a regression is trivial to pinpoint (comparing
vmstat, profiles), and trivial to rectify in the field by changing the
memcg limits or disabling THP.

What we DO know is that there are very good use cases for THP, but THP
inside memcg is broken: THP does worse inside a memcg when compared to
bare metal environments of the same size, both in terms of success
rate, as well as in fault latency due to wasted page allocator work.
Plus, the code is illogical, redundant, and full of magic numbers.

Based on this, this patch seems like a net improvement.
Post by Michal Hocko
Post by Johannes Weiner
This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
What is the load and configuration that you have measured?
It's just a single linear disk writer and another thread that faults
in an anonymous range in 4k steps.
Post by Michal Hocko
Post by Johannes Weiner
[ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]
Michal Hocko
2014-10-07 13:59:50 UTC
Permalink
Post by Johannes Weiner
Hi Michal,
Post by Michal Hocko
Post by Johannes Weiner
In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.
The reason for this is that the memcg reclaim code was never designed
for higher-order charges. It reclaims in small batches until there is
room for at least one page. Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.
Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.
I had concerns the last time you were posting similar patch
(http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
any of them neither mentioned nor addressed here.
I actually made several attempts to address your concerns, but it's
Post by Michal Hocko
Especially unexpected long stalls and excessive swapout. 512 pages
target for direct reclaim is too much. Especially for smaller memcgs
where we would need to drop the priority considerably to even scan
that many pages. THP charges close to the limit are definitely a
problem but this is way too risky to fix this problem IMO.
Every change we make is a trade-off and bears a certain risk. THP is
a trade-off, it's pretty pointless to ignore the upsides and ride
around on the downsides. Of course there are downsides. This patch
makes THP work properly inside memcg, which invites both the upsides
as well as the downsides of THP into memcg. But they are well known
and we can deal with them.
I do not see any evaluation nor discussion of the upsides and downsides
in the changelog. You are selling this as a net win which I cannot
agree with. I am completely missing any notes about potential excessive
swapouts or longer reclaim stalls which are a natural side effect of direct
reclaim with a larger target (or is this something we do not agree on?).
What is an admin/user supposed to do when one of the above happens?
Disable THP globally?

I still remember when THP was introduced and we have seen boatload of
reclaim related bugs. These were exactly those long stalls, excessive
swapouts and reclaim.
Post by Johannes Weiner
Why is THP inside memcg special?
For one thing the global case is hitting its limit (watermarks) much
more slowly and gracefully because it has kswapd working on the
background before we are getting into troubles. Memcg will just hit the
wall and rely solely on the direct reclaim so everything we do will end
up latency sensitive.

Moreover, THP allocations have self regulatory mechanisms to prevent
from excessive stalls. This means that THP allocations are less probable
under heavy memory pressure. On the other hand, memcg might be under
serious memory pressure when THP charge comes. The only back off
mechanism we use in memcg is GFP_NORETRY and that happens after one
round of the reclaim. So we should make sure that the first round of the
reclaim doesn't take terribly long.

Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
The size also makes any potential problem much more probable because the
limit would be hit much more often than extremely low memory conditions
globally.

Also the reclaim decisions are subtly different for memcg because of the
missing per-memcg dirty throttling and flushing. So we can stall on
pages under writeback or get stuck in the write out path which is not
the case for direct reclaim during THP allocation. A large reclaim
target is more probable to hit into dirty or writeback pages.
Post by Johannes Weiner
Sure, the smaller the memcg, the bigger the THP fault and scan target
in comparison. We don't have control over the THP size, but the user
can always increase the size of the memcg, iff THP leads to increased
fragmentation and memory consumption for the given workload.
[ Until he can't, at which point he has to decide whether the cost of
THP outweighs the benefits on a system-wide level. For now - we
could certainly consider making a THP-knob available per memcg, I'm
just extrapolating from the fact that we don't have a per-process
knob that it's unlikely that we need one per memcg. ]
There actually were proposals for per-process THP configuration already.
I haven't tracked it later so I don't know the current status.
Per-process knob sounds like a better fit than a memcg knob because
requirement is basically dependent on the usage pattern which might be
different among processes living in the same memcg.
Post by Johannes Weiner
Post by Michal Hocko
Maybe a better approach
would be to limit the reclaim to (clean) page cache (aka
MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
lower and excessive swapouts shouldn't happen at all. What is the point
to swap out a large number of pages just to allow THP charge which can
be used very sparsely?
THP can lead to thrashing, that's not news.
It shouldn't because the optimization doesn't make much sense
otherwise. Any thrashing is simply a bug.
Post by Johannes Weiner
Preventing THP faults from swapping is a reasonable proposal, but
again has nothing to do with memcg.
If we can do this inside the direct reclaim path then I am all for it
because this means less trickery in the memcg code.

I am still not sure this is sufficient because memcg still might stall
on IO so the safest approach would be ~GFP_IO reclaim for memcg reclaim
path.

I feel strong about the first one (.may_swap = 0) and would be OK with
your patch if this is added (to the memcg or common path).
GFP_IO is an extra safety step. Smaller groups would be more likely to
fail to reclaim enough and so THP success rate will be lower but that
doesn't sound terribly wrong to me. I am not insisting on it, though.
Post by Johannes Weiner
As for this patch, we don't have sufficient data on existing
configurations to know if this will lead to noticable regressions. It
might, but everything we do might cause a regression, that's just our
reality. That alone can't be grounds for rejecting a patch.
That alone certainly does not but then we have to evaluate the risk and
consider other possible ways with a smaller risk.
Post by Johannes Weiner
However, in this particular case a regression is trivial to pinpoint
(comparing vmstat, profiles), and trivial to rectify in the field by
changing the memcg limits or disabling THP.
What we DO know is that there are very good use cases for THP, but THP
All those usecases rely on amortizing THP initial costs by less faults
(assuming the memory range is not used sparsely too much) and the TLB
pressure reduction. Once we are hitting swap or excessive reclaim all
the bets are off and THP is no longer beneficial.
Post by Johannes Weiner
THP does worse inside a memcg when compared to
bare metal environments of the same size, both in terms of success
rate, as well as in fault latency due to wasted page allocator work.
Because memcg is not equivalent to the bare metal with the same amount
of memory. If for nothing else then because the background reclaim is
missing.
Post by Johannes Weiner
Plus, the code is illogical, redundant, and full of magic numbers.
I am not objecting to the removal of magic numbers and to getting rid of
retry loops outside of direct reclaim path (aka mem_cgroup_reclaim). I
would be willing to take a risk and get rid of them just to make the
code saner. Because those were never justified properly and look more or
less random. This would be a separate patch of course.
Post by Johannes Weiner
Based on this, this patch seems like a net improvement.
Sigh, yes, if we ignore all the downsides everything will look like a
net improvement :/
Post by Johannes Weiner
Post by Michal Hocko
Post by Johannes Weiner
This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
What is the load and configuration that you have measured?
It's just a single linear disk writer and another thread that faults
in an anonymous range in 4k steps.
This is really vague description...
Which portion of the limit is the anon consumer, what is the memcg limit
size, IO size, etc...? I find it really interesting that _all_ THP
charges failed so the memcg had to be almost fully populated by the page
cache already when the thread tries so fault in the first huge page.

Also 4k steps is basically the best case for THP because the full THP
block is populated. The question is how the system behaves when THP
ranges are populated sparsely (because this is often the case).

Have you checked any anon mostly load?

Finally what are the (average,highest) latencies for the page fault and
how much memory was swapped out.

I would expect this kind of information for testing of such a patch.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-08 01:11:06 UTC
Permalink
Post by Michal Hocko
Post by Johannes Weiner
Every change we make is a trade-off and bears a certain risk. THP is
a trade-off, it's pretty pointless to ignore the upsides and ride
around on the downsides. Of course there are downsides. This patch
makes THP work properly inside memcg, which invites both the upsides
as well as the downsides of THP into memcg. But they are well known
and we can deal with them.
I do not see any evaluation nor discussion of the upsides and downsides
in the changelog. You are selling this as a net win which I cannot
agree with.
I'm not sure why you want me to regurgitate the pros and cons of
transparent huge pages here. They have been a well-established
default feature for a while now, and they are currently not working
properly inside memcg, which this patch addresses.

The only valid argument against merging this patch at this point can
be that THP inside memcg will lead to distinct issues that do not
exist on the global level. So let's find out if there are any, okay?
Post by Michal Hocko
I am completely missing any notes about potential excessive
swapouts or longer reclaim stalls which are a natural side effect of direct
reclaim with a larger target (or is this something we do not agree on?).
Yes, we disagree here. Why is reclaiming 2MB once worse than entering
reclaim 16 times to reclaim SWAP_CLUSTER_MAX? There is no inherent
difference in reclaiming a big chunk and reclaiming many small chunks
that add up to the same size.

It's only different if you don't actually use the full 2MB pages, but
then the issue simply boils down to increased memory consumption. But
that is easy to deal with and I offered two solutions in my changelog.
Post by Michal Hocko
What is an admin/user supposed to do when one of the above happens?
Disable THP globally?
I already wrote all that. It would be easier if you read my entire
line of reasoning instead of attacking fragments in isolation, so that
we can make forward progress on this.
Post by Michal Hocko
I still remember when THP was introduced and we have seen boatload of
reclaim related bugs. These were exactly those long stalls, excessive
swapouts and reclaim.
THP certainly had a bumpy introduction, and I can understand that you
want to prevent the same happening to memcg.

But it's important to evaluate which THP costs actually translate to
memcg. The worst problems we had came *not* from faulting in bigger
steps, but from creating physically contiguous pages: aggressive lumpy
reclaim, (synchroneous) migration, reclaim beyond the allocation size
to create breathing room for compaction etc. This is a massive amount
of work ON TOP of the bigger fault granularity.

Memcg only has to reclaim the allocation size in individual 4k pages.
Our only risk from THP is internal fragmentation from users not fully
utilizing the entire 2MB regions, but the work we MIGHT waste is
negligible compared to the work we are DEFINITELY wasting right now by
failing to charge already allocated THPs.
Post by Michal Hocko
Post by Johannes Weiner
Why is THP inside memcg special?
For one thing the global case is hitting its limit (watermarks) much
more slowly and gracefully because it has kswapd working on the
background before we are getting into troubles. Memcg will just hit the
wall and rely solely on the direct reclaim so everything we do will end
up latency sensitive.
THP allocations do not wake up kswapd, they go into direct reclaim.
It's likely that kswapd balancing triggered by concurrent order-0
allocations will help THP somewhat, but because the global level needs
contiguous pages, it will still likely enter direct reclaim and direct
compaction. For example, on my 16G desktop, there are 12MB between
the high and low watermark in the Normal zone; compaction needs double
the allocation size to work, so kswapd can cache reclaim work for up
to 3 THP in the best case (which those concurrent order-0 allocations
that woke kswapd in the first place will likely eat into), and direct
compaction will still have to run.

So AFAICS the synchroneous work required to fit a THP inside memcg is
much less. And again, under pressure all this global work is already
expensed at that point anyway.
Post by Michal Hocko
Moreover, THP allocations have self regulatory mechanisms to prevent
from excessive stalls. This means that THP allocations are less probable
under heavy memory pressure.
These mechanisms exist for migration/compaction, but direct reclaim is
still fairly persistent - again, see should_continue_reclaim().

Am I missing something?
Post by Michal Hocko
On the other hand, memcg might be under serious memory pressure when
THP charge comes. The only back off mechanism we use in memcg is
GFP_NORETRY and that happens after one round of the reclaim. So we
should make sure that the first round of the reclaim doesn't take
terribly long.
The same applies globally. ANY allocation under serious memory
pressure will have a high latency, but nobody forces you to use THP in
an already underprovisioned environment.
Post by Michal Hocko
Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
I already addressed page size vs. memcg size before.

However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met. See shrink_lruvec(),
shrink_zone() etc.
Post by Michal Hocko
The size also makes any potential problem much more probable because the
limit would be hit much more often than extremely low memory conditions
globally.
Also the reclaim decisions are subtly different for memcg because of the
missing per-memcg dirty throttling and flushing. So we can stall on
pages under writeback or get stuck in the write out path which is not
the case for direct reclaim during THP allocation. A large reclaim
target is more probable to hit into dirty or writeback pages.
These things again boil down to potential internal fragmentation and
higher memory consumption, as 16 128k reclaims are equally likely to
hit both problems as one 2MB reclaim.
Post by Michal Hocko
Post by Johannes Weiner
Preventing THP faults from swapping is a reasonable proposal, but
again has nothing to do with memcg.
If we can do this inside the direct reclaim path then I am all for it
because this means less trickery in the memcg code.
I am still not sure this is sufficient because memcg still might stall
on IO so the safest approach would be ~GFP_IO reclaim for memcg reclaim
path.
I feel strong about the first one (.may_swap = 0) and would be OK with
your patch if this is added (to the memcg or common path).
GFP_IO is an extra safety step. Smaller groups would be more likely to
fail to reclaim enough and so THP success rate will be lower but that
doesn't sound terribly wrong to me. I am not insisting on it, though.
Would you like to propose the no-swapping patch for the generic
reclaim code? I'm certainly not against it, but I think the reason
nobody has proposed this yet is that the VM is heavily tuned to prefer
cache reclaim anyway and it's rare that environments run out of cache
and actually swap. It usually means that memory is underprovisioned.

So I wouldn't be opposed to it as a fail-safe, in case worst comes to
worst, but I think it's a lot less important than you do.
Post by Michal Hocko
Post by Johannes Weiner
However, in this particular case a regression is trivial to pinpoint
(comparing vmstat, profiles), and trivial to rectify in the field by
changing the memcg limits or disabling THP.
What we DO know is that there are very good use cases for THP, but THP
All those usecases rely on amortizing THP initial costs by less faults
(assuming the memory range is not used sparsely too much) and the TLB
pressure reduction. Once we are hitting swap or excessive reclaim all
the bets are off and THP is no longer beneficial.
Yes, we agree on this, just disagree on the importance of that case.
And both problem and solution would be unrelated to this patch.
Post by Michal Hocko
Post by Johannes Weiner
THP does worse inside a memcg when compared to
bare metal environments of the same size, both in terms of success
rate, as well as in fault latency due to wasted page allocator work.
Because memcg is not equivalent to the bare metal with the same amount
of memory. If for nothing else then because the background reclaim is
missing.
Which THP is explicitely not using globally.
Post by Michal Hocko
Post by Johannes Weiner
Plus, the code is illogical, redundant, and full of magic numbers.
I am not objecting to the removal of magic numbers and to getting rid of
retry loops outside of direct reclaim path (aka mem_cgroup_reclaim). I
would be willing to take a risk and get rid of them just to make the
code saner. Because those were never justified properly and look more or
less random. This would be a separate patch of course.
Post by Johannes Weiner
Based on this, this patch seems like a net improvement.
Sigh, yes, if we ignore all the downsides everything will look like a
net improvement :/
I don't think you honestly read my email.
Post by Michal Hocko
Post by Johannes Weiner
Post by Michal Hocko
Post by Johannes Weiner
This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
What is the load and configuration that you have measured?
It's just a single linear disk writer and another thread that faults
in an anonymous range in 4k steps.
This is really vague description...
Which portion of the limit is the anon consumer, what is the memcg limit
size, IO size, etc...? I find it really interesting that _all_ THP
charges failed so the memcg had to be almost fully populated by the page
cache already when the thread tries so fault in the first huge page.
Also 4k steps is basically the best case for THP because the full THP
block is populated. The question is how the system behaves when THP
ranges are populated sparsely (because this is often the case).
You are missing the point :(

Sure there are cases that don't benefit from THP, this test just shows
that THP inside memcg can be trivially broken - which harms cases that
WOULD benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-08 15:33:29 UTC
Permalink
[I do not have time to get over all points here and will be offline
until Monday - will get back to the rest then]
[...]
Post by Johannes Weiner
Post by Michal Hocko
I am completely missing any notes about potential excessive
swapouts or longer reclaim stalls which are a natural side effect of direct
reclaim with a larger target (or is this something we do not agree on?).
Yes, we disagree here. Why is reclaiming 2MB once worse than entering
reclaim 16 times to reclaim SWAP_CLUSTER_MAX?
You can enter DEF_PRIORITY reclaim 16 times and reclaim your target but
you need at least 512<<DEF_PRIORITY pages on your LRUs to do it in a
single run on that priority. So especially small groups will pay more
and would be subject to mentioned problems (e.g. over-reclaim).
Post by Johannes Weiner
There is no inherent difference in reclaiming a big chunk and
reclaiming many small chunks that add up to the same size.
[...]
Post by Johannes Weiner
Post by Michal Hocko
Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
I already addressed page size vs. memcg size before.
However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met. See shrink_lruvec(),
shrink_zone() etc.
Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
over nr[...] until they are empty and only updates the numbers to be
roughly proportional once:

if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;

/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs are scanned
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
[...]
scan_adjusted = true;

Or do you rely on
/*
* It's just vindictive to attack the larger once the smaller
* has gone to zero. And given the way we stop scanning the
* smaller below, this makes sure that we only make one nudge
* towards proportionality once we've got nr_to_reclaim.
*/
if (!nr_file || !nr_anon)
break;

and SCAN_FILE because !inactive_file_is_low?

[...]
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-08 17:47:25 UTC
Permalink
Post by Michal Hocko
[I do not have time to get over all points here and will be offline
until Monday - will get back to the rest then]
[...]
Post by Johannes Weiner
Post by Michal Hocko
I am completely missing any notes about potential excessive
swapouts or longer reclaim stalls which are a natural side effect of direct
reclaim with a larger target (or is this something we do not agree on?).
Yes, we disagree here. Why is reclaiming 2MB once worse than entering
reclaim 16 times to reclaim SWAP_CLUSTER_MAX?
You can enter DEF_PRIORITY reclaim 16 times and reclaim your target but
you need at least 512<<DEF_PRIORITY pages on your LRUs to do it in a
single run on that priority. So especially small groups will pay more
and would be subject to mentioned problems (e.g. over-reclaim).
No, even low priority scans bail out shortly after nr_to_reclaim.
Post by Michal Hocko
Post by Johannes Weiner
There is no inherent difference in reclaiming a big chunk and
reclaiming many small chunks that add up to the same size.
[...]
Post by Johannes Weiner
Post by Michal Hocko
Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
I already addressed page size vs. memcg size before.
However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met. See shrink_lruvec(),
shrink_zone() etc.
Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
over nr[...] until they are empty and only updates the numbers to be
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;
/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs are scanned
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
[...]
scan_adjusted = true;
Or do you rely on
/*
* It's just vindictive to attack the larger once the smaller
* has gone to zero. And given the way we stop scanning the
* smaller below, this makes sure that we only make one nudge
* towards proportionality once we've got nr_to_reclaim.
*/
if (!nr_file || !nr_anon)
break;
and SCAN_FILE because !inactive_file_is_low?
That function is indeed quite confusing.

Once nr_to_reclaim has been met, it looks at both LRUs and decides
which one has the smaller scan target left, sets it to 0, and then
scales the bigger target in proportion - that means if it scanned 10%
of nr[file], it sets nr[anon] to 10% of its original size, minus the
anon pages it already scanned. Based on that alone, overscanning is
limited to twice the requested size, i.e. 4MB for a 2MB THP page,
regardless of how low the priority drops.

In practice, the VM is heavily biased to avoid swapping. The forceful
SCAN_FILE you point out is one condition that avoids the proportional
scan most of the time. But even the proportional scan is heavily
biased towards cache - every cache insertion decreases the file
recent_rotated/recent_scanned ratio, whereas anon faults do not.

On top of that, anon pages start out on the active list, whereas cache
starts on the inactive, which means that the majority of the anon scan
target - should there be one - usually translates to deactivation.

So in most cases, I'd expect the scanner to bail after nr_to_reclaim
cache pages, and in low cache situations it might scan up to 2MB worth
of anon pages, a small amount of which it might swap.

I don't particularly like the decisions the current code makes, but it
should work. We have put in a lot of effort to prevent overreclaim in
the last few years, and a big part of this was decoupling the priority
level from the actual reclaim results. Nowadays, the priority level
should merely dictate the scan window and have no impact on the number
of pages actually reclaimed. I don't expect that it will, but if it
does, that's a reclaim bug that needs to be addressed. If we ask for
N pages, it should never reclaim significantly more than that,
regardless of how aggressively it has to scan to accomplish that.
Johannes Weiner
2014-10-11 23:27:59 UTC
Permalink
Post by Johannes Weiner
Post by Michal Hocko
Post by Johannes Weiner
Post by Michal Hocko
Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
I already addressed page size vs. memcg size before.
However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met. See shrink_lruvec(),
shrink_zone() etc.
Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
over nr[...] until they are empty and only updates the numbers to be
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;
/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs are scanned
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
[...]
scan_adjusted = true;
Or do you rely on
/*
* It's just vindictive to attack the larger once the smaller
* has gone to zero. And given the way we stop scanning the
* smaller below, this makes sure that we only make one nudge
* towards proportionality once we've got nr_to_reclaim.
*/
if (!nr_file || !nr_anon)
break;
and SCAN_FILE because !inactive_file_is_low?
That function is indeed quite confusing.
Once nr_to_reclaim has been met, it looks at both LRUs and decides
which one has the smaller scan target left, sets it to 0, and then
scales the bigger target in proportion - that means if it scanned 10%
of nr[file], it sets nr[anon] to 10% of its original size, minus the
anon pages it already scanned. Based on that alone, overscanning is
limited to twice the requested size, i.e. 4MB for a 2MB THP page,
regardless of how low the priority drops.
Sorry, this conclusion is incorrect. The proportionality code can
indeed lead to more overreclaim than that, although I think this is
actually not intended: the comment says "this makes sure we only make
one nudge towards proportionality once we've got nr_to_reclaim," but
once scan_adjusted we never actually check anymore. We we can end up
making a lot more nudges toward proportionality.
Post by Johannes Weiner
In practice, the VM is heavily biased to avoid swapping. The forceful
SCAN_FILE you point out is one condition that avoids the proportional
scan most of the time. But even the proportional scan is heavily
biased towards cache - every cache insertion decreases the file
recent_rotated/recent_scanned ratio, whereas anon faults do not.
On top of that, anon pages start out on the active list, whereas cache
starts on the inactive, which means that the majority of the anon scan
target - should there be one - usually translates to deactivation.
So in most cases, I'd expect the scanner to bail after nr_to_reclaim
cache pages, and in low cache situations it might scan up to 2MB worth
of anon pages, a small amount of which it might swap.
I don't particularly like the decisions the current code makes, but it
should work. We have put in a lot of effort to prevent overreclaim in
the last few years, and a big part of this was decoupling the priority
level from the actual reclaim results. Nowadays, the priority level
should merely dictate the scan window and have no impact on the number
of pages actually reclaimed. I don't expect that it will, but if it
does, that's a reclaim bug that needs to be addressed. If we ask for
N pages, it should never reclaim significantly more than that,
regardless of how aggressively it has to scan to accomplish that.
That being said, I don't get the rationale behind the proportionality
code in shrink_lruvec(). The patch that introduced it - e82e0561dae9
("mm: vmscan: obey proportional scanning requirements for kswapd") -
mentions respecting swappiness, but as per above we ignore swappiness
anyway until we run low on cache and into actual pressure. And under
pressure, once we struggle to reclaim nr_to_reclaim, proportionality
enforces itself when one LRU type target hits zero and we continue to
scan the one for which more pressure was allocated. But as long as we
scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem
getting nr_to_reclaim pages with left-over todo for *both* LRU types,
what's the point of going on?

The justification for enforcing proportionality in direct reclaim is
particularly puzzling:

---

commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
Author: Mel Gorman <***@suse.de>
Date: Wed Jun 4 16:10:49 2014 -0700

mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY

[...]

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

---

The timings show nothing useful, but the statistics strongly speak
*against* this patch. Sure, direct reclaim invocations are reduced,
but everything else worsened: minor faults increased, major faults
doubled(!), swapins quadrupled(!!), swapins increased, pages scanned
increased, pages reclaimed increased, reclaim page writes increased.

If direct reclaim is invoked at that rate, kswapd is failing at its
job, and the solution shouldn't be to overscan in direct reclaim. On
the other hand, multi-threaded sparse readers are kind of expected to
overwhelm a single kswapd worker, I'm not sure we should be tuning
allocation latency to such a workload in the first place.

Mel, do you maybe remember details that are not in the changelogs?
Because based on them alone, I think we should look at other ways to
ensure we scan with the right amount of vigor...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2014-10-17 09:37:02 UTC
Permalink
Post by Johannes Weiner
Post by Johannes Weiner
I don't particularly like the decisions the current code makes, but it
should work. We have put in a lot of effort to prevent overreclaim in
the last few years, and a big part of this was decoupling the priority
level from the actual reclaim results. Nowadays, the priority level
should merely dictate the scan window and have no impact on the number
of pages actually reclaimed. I don't expect that it will, but if it
does, that's a reclaim bug that needs to be addressed. If we ask for
N pages, it should never reclaim significantly more than that,
regardless of how aggressively it has to scan to accomplish that.
That being said, I don't get the rationale behind the proportionality
code in shrink_lruvec(). The patch that introduced it - e82e0561dae9
("mm: vmscan: obey proportional scanning requirements for kswapd") -
mentions respecting swappiness, but as per above we ignore swappiness
anyway until we run low on cache and into actual pressure.
Swappiness is ignored until active < inactive which is not necessarily
indicative of pressure. Filesystems may mark pages active and move them
to the active list in the absense of pressure for example.

The changelog does not mention it but a big motivator for the patch was
this result https://lkml.org/lkml/2014/5/22/420 and the opening paragraph
mentions "The intent was to minimse direct reclaim latency but Yuanhan
Liu pointer out that it substitutes one long stall for many small stalls
and distorts aging for normal workloads like streaming readers/writers."
Post by Johannes Weiner
And under
pressure, once we struggle to reclaim nr_to_reclaim, proportionality
enforces itself when one LRU type target hits zero and we continue to
scan the one for which more pressure was allocated. But as long as we
scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem
getting nr_to_reclaim pages with left-over todo for *both* LRU types,
what's the point of going on?
If we always scanned evenly then it risks reintroducing the long-lived
problem where starting a large amount of IO in the background pushed
everything into swap
Post by Johannes Weiner
The justification for enforcing proportionality in direct reclaim is
---
commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
Date: Wed Jun 4 16:10:49 2014 -0700
mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
[...]
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)
The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891
Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.
---
The timings show nothing useful, but the statistics strongly speak
*against* this patch. Sure, direct reclaim invocations are reduced,
but everything else worsened: minor faults increased, major faults
doubled(!), swapins quadrupled(!!), swapins increased, pages scanned
increased, pages reclaimed increased, reclaim page writes increased.
Direct reclaim invocations reducing was in line with the intent to
prevent many small stalls.

The increase in minor faults is marginal in absolute terms and very
likely within the noise.

Major faults might have doubled but in absolute terms is about the size
of a THP allocation or roughly a 25ms stall (depending on disk obviously)
overall in a long-lived test.

Differences in swap figures are also large in relative terms but very
small in absolute terms. Again, I suspected it was within the noise.

Same rational for the others. The changes in reclaim stats in absolute
terms are small considering the type of test being executed. The reclaim
figure changes look terrifying but if you look at the sum of direct and
kswapd reclaimed pages you'll see that the difference is marginal and all
that changed was who did the work. Same for scanning. Total scanning and
reclaim work was approximately similar, all that changed in this test is
what process did the work.

What was more important in this case was Yuanhan Liu report that the patch
addressed a major regression.
Post by Johannes Weiner
Mel, do you maybe remember details that are not in the changelogs?
The link to Yuanhan Liu's report was missing but that happened after the
changelog was written so that is hardly a surprise. Nothing else springs
to mind.
Post by Johannes Weiner
Because based on them alone, I think we should look at other ways to
ensure we scan with the right amount of vigor...
I'm not against it per-se but avoid over-reacting about changes in stats
that are this small in absolute terms. If you do change this area, I
strongly suggest you also test with the parallelio-memcachetest configs
from mmtests and watch for IO causing excessive swap.
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-09-24 15:08:57 UTC
Permalink
When attempting to charge pages, we first charge the memory counter
and then the memory+swap counter. If one of the counters is at its
limit, we enter reclaim, but if it's the memory+swap counter, reclaim
shouldn't swap because that wouldn't change the situation. However,
if the counters have the same limits, we never get to the memory+swap
limit. To know whether reclaim should swap or not, there is a state
flag that indicates whether the limits are equal and whether hitting
the memory limit implies hitting the memory+swap limit.

Just try the memory+swap counter first.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/memcontrol.c | 47 +++++++++++++----------------------------------
1 file changed, 13 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec22bf380d0..89c920156c2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,9 +315,6 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;

- /* set when res.limit == memsw.limit */
- bool memsw_is_minimum;
-
/* protect arrays of thresholds */
struct mutex thresholds_lock;

@@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,

if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
noswap = true;
- if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
- noswap = true;

for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
if (loop)
@@ -2543,16 +2538,17 @@ retry:
goto done;

size = batch * PAGE_SIZE;
- if (!res_counter_charge(&memcg->res, size, &fail_res)) {
- if (!do_swap_account)
+ if (!do_swap_account ||
+ !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+ if (!res_counter_charge(&memcg->res, size, &fail_res))
goto done_restock;
- if (!res_counter_charge(&memcg->memsw, size, &fail_res))
- goto done_restock;
- res_counter_uncharge(&memcg->res, size);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, size);
+ mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ } else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
- } else
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ }

if (batch > nr_pages) {
batch = nr_pages;
@@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memswlimit, memlimit;
int ret = 0;
int children = mem_cgroup_count_children(memcg);
u64 curusage, oldusage;
@@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val) {
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}

- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit < val)
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
enlarge = 1;

ret = res_counter_set_limit(&memcg->res, val);
- if (!ret) {
- if (memswlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);

if (!ret)
@@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memlimit, memswlimit, oldusage, curusage;
+ u64 oldusage, curusage;
int children = mem_cgroup_count_children(memcg);
int ret = -EBUSY;
int enlarge = 0;
@@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit > val) {
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val)
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
enlarge = 1;
ret = res_counter_set_limit(&memcg->memsw, val);
- if (!ret) {
- if (memlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);

if (!ret)
--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Vladimir Davydov
2014-09-24 15:14:21 UTC
Permalink
Post by Johannes Weiner
When attempting to charge pages, we first charge the memory counter
and then the memory+swap counter. If one of the counters is at its
limit, we enter reclaim, but if it's the memory+swap counter, reclaim
shouldn't swap because that wouldn't change the situation. However,
if the counters have the same limits, we never get to the memory+swap
limit. To know whether reclaim should swap or not, there is a state
flag that indicates whether the limits are equal and whether hitting
the memory limit implies hitting the memory+swap limit.
Just try the memory+swap counter first.
Reviewed-by: Vladimir Davydov <***@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-09-25 15:27:54 UTC
Permalink
Post by Johannes Weiner
When attempting to charge pages, we first charge the memory counter
and then the memory+swap counter. If one of the counters is at its
limit, we enter reclaim, but if it's the memory+swap counter, reclaim
shouldn't swap because that wouldn't change the situation. However,
if the counters have the same limits, we never get to the memory+swap
limit. To know whether reclaim should swap or not, there is a state
flag that indicates whether the limits are equal and whether hitting
the memory limit implies hitting the memory+swap limit.
Just try the memory+swap counter first.
OK, this makes sense and makes the reclaim code little bit more
readable (). I would just add that the patch shouldn't have any visible
effectes because that is not apparent from the description.
Post by Johannes Weiner
---
mm/memcontrol.c | 47 +++++++++++++----------------------------------
1 file changed, 13 insertions(+), 34 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec22bf380d0..89c920156c2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,9 +315,6 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
- /* set when res.limit == memsw.limit */
- bool memsw_is_minimum;
-
/* protect arrays of thresholds */
struct mutex thresholds_lock;
@@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
noswap = true;
- if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
- noswap = true;
for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
if (loop)
goto done;
size = batch * PAGE_SIZE;
- if (!res_counter_charge(&memcg->res, size, &fail_res)) {
- if (!do_swap_account)
+ if (!do_swap_account ||
+ !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+ if (!res_counter_charge(&memcg->res, size, &fail_res))
goto done_restock;
- if (!res_counter_charge(&memcg->memsw, size, &fail_res))
- goto done_restock;
- res_counter_uncharge(&memcg->res, size);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, size);
+ mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ } else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
- } else
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ }
if (batch > nr_pages) {
batch = nr_pages;
@@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memswlimit, memlimit;
int ret = 0;
int children = mem_cgroup_count_children(memcg);
u64 curusage, oldusage;
@@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val) {
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}
- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit < val)
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
enlarge = 1;
ret = res_counter_set_limit(&memcg->res, val);
- if (!ret) {
- if (memswlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);
if (!ret)
@@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memlimit, memswlimit, oldusage, curusage;
+ u64 oldusage, curusage;
int children = mem_cgroup_count_children(memcg);
int ret = -EBUSY;
int enlarge = 0;
@@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit > val) {
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val)
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
enlarge = 1;
ret = res_counter_set_limit(&memcg->memsw, val);
- if (!ret) {
- if (memlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);
if (!ret)
--
2.1.0
--
Michal Hocko
SUSE Labs
Loading...