Discussion:
[patch 0/5] mm: memcontrol: eliminate charge reparenting v2
Johannes Weiner
2014-10-14 16:20:32 UTC
Permalink
Hi,

this series gets rid of charge reparenting at cgroup deletion, which
is possible now that the css can outlive the user-visible cgroup. Any
cache charges left after cgroup deletion simply remain with their css,
where they continue to get reclaimed during pressure on the parent.

Version 2:

- remove memcg->dead_count [vladimir]
- restore iterator generations [vladimir]
- restore memcg-initialized test [michal]
- document shared walk lockless magic [michal]
- split out sync stock draining removal [michal]

include/linux/cgroup.h | 26 ++
include/linux/page_counter.h | 4 +-
include/linux/percpu-refcount.h | 47 ++-
mm/memcontrol.c | 593 ++++++--------------------------------
mm/page_counter.c | 23 +-
5 files changed, 163 insertions(+), 530 deletions(-)
Johannes Weiner
2014-10-14 16:20:34 UTC
Permalink
Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of
them have been reparented, whereas kmemcg acquires a keep-alive
reference if outstanding kernel pages are detected at that point.

In preparation for removing all this complexity, make the pinning
explicit and acquire a css references for every charged page.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Vladimir Davydov <***@parallels.com>
---
include/linux/cgroup.h | 26 +++++++++++++++++++++++
include/linux/percpu-refcount.h | 47 +++++++++++++++++++++++++++++++++--------
mm/memcontrol.c | 21 ++++++++++++++----
3 files changed, 81 insertions(+), 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d5196889048..9f96b25965c2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -113,6 +113,19 @@ static inline void css_get(struct cgroup_subsys_state *css)
}

/**
+ * css_get_many - obtain references on the specified css
+ * @css: target css
+ * @n: number of references to get
+ *
+ * The caller must already have a reference.
+ */
+static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+ if (!(css->flags & CSS_NO_REF))
+ percpu_ref_get_many(&css->refcnt, n);
+}
+
+/**
* css_tryget - try to obtain a reference on the specified css
* @css: target css
*
@@ -159,6 +172,19 @@ static inline void css_put(struct cgroup_subsys_state *css)
percpu_ref_put(&css->refcnt);
}

+/**
+ * css_put_many - put css references
+ * @css: target css
+ * @n: number of references to put
+ *
+ * Put references obtained via css_get() and css_tryget_online().
+ */
+static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+ if (!(css->flags & CSS_NO_REF))
+ percpu_ref_put_many(&css->refcnt, n);
+}
+
/* bits in struct cgroup flags field */
enum {
/* Control Group requires release notifications to userspace */
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d5c89e0dd0e6..494ab0588b65 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -141,28 +141,42 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref,
}

/**
- * percpu_ref_get - increment a percpu refcount
+ * percpu_ref_get_many - increment a percpu refcount
* @ref: percpu_ref to get
+ * @nr: number of references to get
*
- * Analagous to atomic_long_inc().
+ * Analogous to atomic_long_add().
*
* This function is safe to call as long as @ref is between init and exit.
*/
-static inline void percpu_ref_get(struct percpu_ref *ref)
+static inline void percpu_ref_get_many(struct percpu_ref *ref, unsigned long nr)
{
unsigned long __percpu *percpu_count;

rcu_read_lock_sched();

if (__ref_is_percpu(ref, &percpu_count))
- this_cpu_inc(*percpu_count);
+ this_cpu_add(*percpu_count, nr);
else
- atomic_long_inc(&ref->count);
+ atomic_long_add(nr, &ref->count);

rcu_read_unlock_sched();
}

/**
+ * percpu_ref_get - increment a percpu refcount
+ * @ref: percpu_ref to get
+ *
+ * Analagous to atomic_long_inc().
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+ percpu_ref_get_many(ref, 1);
+}
+
+/**
* percpu_ref_tryget - try to increment a percpu refcount
* @ref: percpu_ref to try-get
*
@@ -225,29 +239,44 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
}

/**
- * percpu_ref_put - decrement a percpu refcount
+ * percpu_ref_put_many - decrement a percpu refcount
* @ref: percpu_ref to put
+ * @nr: number of references to put
*
* Decrement the refcount, and if 0, call the release function (which was passed
* to percpu_ref_init())
*
* This function is safe to call as long as @ref is between init and exit.
*/
-static inline void percpu_ref_put(struct percpu_ref *ref)
+static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)
{
unsigned long __percpu *percpu_count;

rcu_read_lock_sched();

if (__ref_is_percpu(ref, &percpu_count))
- this_cpu_dec(*percpu_count);
- else if (unlikely(atomic_long_dec_and_test(&ref->count)))
+ this_cpu_sub(*percpu_count, nr);
+ else if (unlikely(atomic_long_sub_and_test(nr, &ref->count)))
ref->release(ref);

rcu_read_unlock_sched();
}

/**
+ * percpu_ref_put - decrement a percpu refcount
+ * @ref: percpu_ref to put
+ *
+ * Decrement the refcount, and if 0, call the release function (which was passed
+ * to percpu_ref_init())
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline void percpu_ref_put(struct percpu_ref *ref)
+{
+ percpu_ref_put_many(ref, 1);
+}
+
+/**
* percpu_ref_is_zero - test whether a percpu refcount reached zero
* @ref: percpu_ref to test
*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 67dabe8b0aa6..a3feead6be15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2256,6 +2256,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_swap_account)
page_counter_uncharge(&old->memsw, stock->nr_pages);
+ css_put_many(&old->css, stock->nr_pages);
stock->nr_pages = 0;
}
stock->cached = NULL;
@@ -2513,6 +2514,7 @@ bypass:
return -EINTR;

done_restock:
+ css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
done:
@@ -2527,6 +2529,8 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
page_counter_uncharge(&memcg->memsw, nr_pages);
+
+ css_put_many(&memcg->css, nr_pages);
}

/*
@@ -2722,6 +2726,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
page_counter_charge(&memcg->memory, nr_pages);
if (do_swap_account)
page_counter_charge(&memcg->memsw, nr_pages);
+ css_get_many(&memcg->css, nr_pages);
ret = 0;
} else if (ret)
page_counter_uncharge(&memcg->kmem, nr_pages);
@@ -2737,8 +2742,10 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
page_counter_uncharge(&memcg->memsw, nr_pages);

/* Not down to 0 */
- if (page_counter_uncharge(&memcg->kmem, nr_pages))
+ if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
+ css_put_many(&memcg->css, nr_pages);
return;
+ }

/*
* Releases a reference taken in kmem_cgroup_css_offline in case
@@ -2750,6 +2757,8 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
*/
if (memcg_kmem_test_and_clear_dead(memcg))
css_put(&memcg->css);
+
+ css_put_many(&memcg->css, nr_pages);
}

/*
@@ -3377,10 +3386,13 @@ static int mem_cgroup_move_parent(struct page *page,
ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
if (!ret) {
+ if (!mem_cgroup_is_root(parent))
+ css_get_many(&parent->css, nr_pages);
/* Take charge off the local counters */
page_counter_cancel(&child->memory, nr_pages);
if (do_swap_account)
page_counter_cancel(&child->memsw, nr_pages);
+ css_put_many(&child->css, nr_pages);
}

if (nr_pages > 1)
@@ -5750,7 +5762,6 @@ static void __mem_cgroup_clear_mc(void)
{
struct mem_cgroup *from = mc.from;
struct mem_cgroup *to = mc.to;
- int i;

/* we must uncharge all the leftover precharges from mc.to */
if (mc.precharge) {
@@ -5778,8 +5789,7 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.to))
page_counter_uncharge(&mc.to->memory, mc.moved_swap);

- for (i = 0; i < mc.moved_swap; i++)
- css_put(&mc.from->css);
+ css_put_many(&mc.from->css, mc.moved_swap);

/* we've already done css_get(mc.to) */
mc.moved_swap = 0;
@@ -6326,6 +6336,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
memcg_check_events(memcg, dummy_page);
local_irq_restore(flags);
+
+ if (!mem_cgroup_is_root(memcg))
+ css_put_many(&memcg->css, max(nr_mem, nr_memsw));
}

static void uncharge_list(struct list_head *page_list)
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-15 15:18:36 UTC
Permalink
Post by Johannes Weiner
Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of
them have been reparented, whereas kmemcg acquires a keep-alive
reference if outstanding kernel pages are detected at that point.
In preparation for removing all this complexity, make the pinning
explicit and acquire a css references for every charged page.
---
include/linux/cgroup.h | 26 +++++++++++++++++++++++
include/linux/percpu-refcount.h | 47 +++++++++++++++++++++++++++++++++--------
mm/memcontrol.c | 21 ++++++++++++++----
3 files changed, 81 insertions(+), 13 deletions(-)
[...]
Post by Johannes Weiner
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 67dabe8b0aa6..a3feead6be15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2256,6 +2256,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_swap_account)
page_counter_uncharge(&old->memsw, stock->nr_pages);
+ css_put_many(&old->css, stock->nr_pages);
I have suggested to add a comment about pairing css_get here because the
corresponding refill_stock doesn't take any reference which might be
little bit confusing. Nothing earth shattering of course...
Post by Johannes Weiner
stock->nr_pages = 0;
}
stock->cached = NULL;
return -EINTR;
+ css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
[...]
--
Michal Hocko
SUSE Labs
Johannes Weiner
2014-10-15 18:44:10 UTC
Permalink
Post by Michal Hocko
Post by Johannes Weiner
Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of
them have been reparented, whereas kmemcg acquires a keep-alive
reference if outstanding kernel pages are detected at that point.
In preparation for removing all this complexity, make the pinning
explicit and acquire a css references for every charged page.
---
include/linux/cgroup.h | 26 +++++++++++++++++++++++
include/linux/percpu-refcount.h | 47 +++++++++++++++++++++++++++++++++--------
mm/memcontrol.c | 21 ++++++++++++++----
3 files changed, 81 insertions(+), 13 deletions(-)
[...]
Post by Johannes Weiner
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 67dabe8b0aa6..a3feead6be15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2256,6 +2256,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_swap_account)
page_counter_uncharge(&old->memsw, stock->nr_pages);
+ css_put_many(&old->css, stock->nr_pages);
I have suggested to add a comment about pairing css_get here because the
corresponding refill_stock doesn't take any reference which might be
little bit confusing. Nothing earth shattering of course...
Ah, but this isn't the counter-part to refill_stock(), consume_stock()
is. The references are taken for the charges, and those two functions
do not change the account. css get/put pair exactly like page counter
charge/uncharge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-14 16:20:33 UTC
Permalink
The memcg reclaim iterators use a complicated weak reference scheme to
prevent pinning cgroups indefinitely in the absence of memory pressure.

However, during the ongoing cgroup core rework, css lifetime has been
decoupled such that a pinned css no longer interferes with removal of
the user-visible cgroup, and all this complexity is now unnecessary.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/memcontrol.c | 250 +++++++++++++++++---------------------------------------
1 file changed, 76 insertions(+), 174 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b62972c80055..67dabe8b0aa6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -143,14 +143,8 @@ struct mem_cgroup_stat_cpu {
unsigned long targets[MEM_CGROUP_NTARGETS];
};

-struct mem_cgroup_reclaim_iter {
- /*
- * last scanned hierarchy member. Valid only if last_dead_count
- * matches memcg->dead_count of the hierarchy root group.
- */
- struct mem_cgroup *last_visited;
- int last_dead_count;
-
+struct reclaim_iter {
+ struct mem_cgroup *position;
/* scan generation, increased every round-trip */
unsigned int generation;
};
@@ -162,7 +156,7 @@ struct mem_cgroup_per_zone {
struct lruvec lruvec;
unsigned long lru_size[NR_LRU_LISTS];

- struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
+ struct reclaim_iter iter[DEF_PRIORITY + 1];

struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
@@ -346,7 +340,6 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;

- atomic_t dead_count;
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct cg_proto tcp_mem;
#endif
@@ -1067,122 +1060,6 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return memcg;
}

-/*
- * Returns a next (in a pre-order walk) alive memcg (with elevated css
- * ref. count) or NULL if the whole root's subtree has been visited.
- *
- * helper function to be used by mem_cgroup_iter
- */
-static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
- struct mem_cgroup *last_visited)
-{
- struct cgroup_subsys_state *prev_css, *next_css;
-
- prev_css = last_visited ? &last_visited->css : NULL;
-skip_node:
- next_css = css_next_descendant_pre(prev_css, &root->css);
-
- /*
- * Even if we found a group we have to make sure it is
- * alive. css && !memcg means that the groups should be
- * skipped and we should continue the tree walk.
- * last_visited css is safe to use because it is
- * protected by css_get and the tree walk is rcu safe.
- *
- * We do not take a reference on the root of the tree walk
- * because we might race with the root removal when it would
- * be the only node in the iterated hierarchy and mem_cgroup_iter
- * would end up in an endless loop because it expects that at
- * least one valid node will be returned. Root cannot disappear
- * because caller of the iterator should hold it already so
- * skipping css reference should be safe.
- */
- if (next_css) {
- struct mem_cgroup *memcg = mem_cgroup_from_css(next_css);
-
- if (next_css == &root->css)
- return memcg;
-
- if (css_tryget_online(next_css)) {
- /*
- * Make sure the memcg is initialized:
- * mem_cgroup_css_online() orders the the
- * initialization against setting the flag.
- */
- if (smp_load_acquire(&memcg->initialized))
- return memcg;
- css_put(next_css);
- }
-
- prev_css = next_css;
- goto skip_node;
- }
-
- return NULL;
-}
-
-static void mem_cgroup_iter_invalidate(struct mem_cgroup *root)
-{
- /*
- * When a group in the hierarchy below root is destroyed, the
- * hierarchy iterator can no longer be trusted since it might
- * have pointed to the destroyed group. Invalidate it.
- */
- atomic_inc(&root->dead_count);
-}
-
-static struct mem_cgroup *
-mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
- struct mem_cgroup *root,
- int *sequence)
-{
- struct mem_cgroup *position = NULL;
- /*
- * A cgroup destruction happens in two stages: offlining and
- * release. They are separated by a RCU grace period.
- *
- * If the iterator is valid, we may still race with an
- * offlining. The RCU lock ensures the object won't be
- * released, tryget will fail if we lost the race.
- */
- *sequence = atomic_read(&root->dead_count);
- if (iter->last_dead_count == *sequence) {
- smp_rmb();
- position = iter->last_visited;
-
- /*
- * We cannot take a reference to root because we might race
- * with root removal and returning NULL would end up in
- * an endless loop on the iterator user level when root
- * would be returned all the time.
- */
- if (position && position != root &&
- !css_tryget_online(&position->css))
- position = NULL;
- }
- return position;
-}
-
-static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
- struct mem_cgroup *last_visited,
- struct mem_cgroup *new_position,
- struct mem_cgroup *root,
- int sequence)
-{
- /* root reference counting symmetric to mem_cgroup_iter_load */
- if (last_visited && last_visited != root)
- css_put(&last_visited->css);
- /*
- * We store the sequence count from the time @last_visited was
- * loaded successfully instead of rereading it here so that we
- * don't lose destruction events in between. We could have
- * raced with the destruction of @new_position after all.
- */
- iter->last_visited = new_position;
- smp_wmb();
- iter->last_dead_count = sequence;
-}
-
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -1204,8 +1081,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
struct mem_cgroup_reclaim_cookie *reclaim)
{
+ struct reclaim_iter *uninitialized_var(iter);
+ struct cgroup_subsys_state *css = NULL;
struct mem_cgroup *memcg = NULL;
- struct mem_cgroup *last_visited = NULL;
+ struct mem_cgroup *pos = NULL;

if (mem_cgroup_disabled())
return NULL;
@@ -1214,50 +1093,93 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
root = root_mem_cgroup;

if (prev && !reclaim)
- last_visited = prev;
+ pos = prev;

if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
- goto out_css_put;
+ goto out;
return root;
}

rcu_read_lock();
- while (!memcg) {
- struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- int uninitialized_var(seq);
-
- if (reclaim) {
- struct mem_cgroup_per_zone *mz;
-
- mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
- iter = &mz->reclaim_iter[reclaim->priority];
- if (prev && reclaim->generation != iter->generation) {
- iter->last_visited = NULL;
- goto out_unlock;
- }

- last_visited = mem_cgroup_iter_load(iter, root, &seq);
+ if (reclaim) {
+ struct mem_cgroup_per_zone *mz;
+
+ mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
+ iter = &mz->iter[reclaim->priority];
+
+ if (prev && reclaim->generation != iter->generation)
+ goto out_unlock;
+
+ do {
+ pos = ACCESS_ONCE(iter->position);
+ /*
+ * A racing update may change the position and
+ * put the last reference, hence css_tryget(),
+ * or retry to see the updated position.
+ */
+ } while (pos && !css_tryget(&pos->css));
+ }
+
+ if (pos)
+ css = &pos->css;
+
+ for (;;) {
+ css = css_next_descendant_pre(css, &root->css);
+ if (!css) {
+ /*
+ * Reclaimers share the hierarchy walk, and a
+ * new one might jump in right at the end of
+ * the hierarchy - make sure they see at least
+ * one group and restart from the beginning.
+ */
+ if (!prev)
+ continue;
+ break;
}

- memcg = __mem_cgroup_iter_next(root, last_visited);
+ /*
+ * Verify the css and acquire a reference. The root
+ * is provided by the caller, so we know it's alive
+ * and kicking, and don't take an extra reference.
+ */
+ memcg = mem_cgroup_from_css(css);

- if (reclaim) {
- mem_cgroup_iter_update(iter, last_visited, memcg, root,
- seq);
+ if (css == &root->css)
+ break;
+
+ if (css_tryget_online(css)) {
+ /*
+ * Make sure the memcg is initialized:
+ * mem_cgroup_css_online() orders the the
+ * initialization against setting the flag.
+ */
+ if (smp_load_acquire(&memcg->initialized))
+ break;

- if (!memcg)
- iter->generation++;
- else if (!prev && memcg)
- reclaim->generation = iter->generation;
+ css_put(css);
}

- if (prev && !memcg)
- goto out_unlock;
+ memcg = NULL;
+ }
+
+ if (reclaim) {
+ if (cmpxchg(&iter->position, pos, memcg) == pos && memcg)
+ css_get(&memcg->css);
+
+ if (pos)
+ css_put(&pos->css);
+
+ if (!memcg)
+ iter->generation++;
+ else if (!prev)
+ reclaim->generation = iter->generation;
}
+
out_unlock:
rcu_read_unlock();
-out_css_put:
+out:
if (prev && prev != root)
css_put(&prev->css);

@@ -5438,24 +5360,6 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

-/*
- * Announce all parents that a group from their hierarchy is gone.
- */
-static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
-{
- struct mem_cgroup *parent = memcg;
-
- while ((parent = parent_mem_cgroup(parent)))
- mem_cgroup_iter_invalidate(parent);
-
- /*
- * if the root memcg is not hierarchical we have to check it
- * explicitely.
- */
- if (!root_mem_cgroup->use_hierarchy)
- mem_cgroup_iter_invalidate(root_mem_cgroup);
-}
-
static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5476,8 +5380,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)

kmem_cgroup_css_offline(memcg);

- mem_cgroup_invalidate_reclaim_iterators(memcg);
-
/*
* This requires that offlining is serialized. Right now that is
* guaranteed because css_killed_work_fn() holds the cgroup_mutex.
--
2.1.2
Michal Hocko
2014-10-15 15:02:56 UTC
Permalink
Post by Johannes Weiner
The memcg reclaim iterators use a complicated weak reference scheme to
prevent pinning cgroups indefinitely in the absence of memory pressure.
However, during the ongoing cgroup core rework, css lifetime has been
decoupled such that a pinned css no longer interferes with removal of
the user-visible cgroup, and all this complexity is now unnecessary.
---
mm/memcontrol.c | 250 +++++++++++++++++---------------------------------------
1 file changed, 76 insertions(+), 174 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b62972c80055..67dabe8b0aa6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
[...]
Post by Johannes Weiner
+ do {
+ pos = ACCESS_ONCE(iter->position);
+ /*
+ * A racing update may change the position and
+ * put the last reference, hence css_tryget(),
+ * or retry to see the updated position.
+ */
+ } while (pos && !css_tryget(&pos->css));
+ }
[...]
Post by Johannes Weiner
+ if (reclaim) {
+ if (cmpxchg(&iter->position, pos, memcg) == pos && memcg)
+ css_get(&memcg->css);
+
+ if (pos)
+ css_put(&pos->css);
This looks like a reference leak. css_put pairs with the above
css_tryget but no css_put pairs with css_get for the cached one. We
need:
---
From 2810937ec6c16afc0bf924e761ff8305bd478a42 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-***@public.gmane.org>
Date: Wed, 15 Oct 2014 16:26:22 +0200
Subject: [PATCH]
mm-memcontrol-convert-reclaim-iterator-to-simple-css-refcounting-fix

Make sure that the cached reference is always released.

Signed-off-by: Michal Hocko <mhocko-***@public.gmane.org>
---
mm/memcontrol.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdcd0416a017..42842a47b4c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1160,9 +1160,17 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}

if (reclaim) {
- if (cmpxchg(&iter->position, pos, memcg) == pos && memcg)
- css_get(&memcg->css);
+ if (cmpxchg(&iter->position, pos, memcg) == pos) {
+ if (memcg)
+ css_get(&memcg->css);
+ if (pos)
+ css_put(&pos->css);
+ }

+ /*
+ * pairs with css_tryget when dereferencing iter->position
+ * above.
+ */
if (pos)
css_put(&pos->css);
--
2.1.1
--
Michal Hocko
SUSE Labs
Johannes Weiner
2014-10-15 18:31:37 UTC
Permalink
Post by Michal Hocko
Post by Johannes Weiner
The memcg reclaim iterators use a complicated weak reference scheme to
prevent pinning cgroups indefinitely in the absence of memory pressure.
However, during the ongoing cgroup core rework, css lifetime has been
decoupled such that a pinned css no longer interferes with removal of
the user-visible cgroup, and all this complexity is now unnecessary.
---
mm/memcontrol.c | 250 +++++++++++++++++---------------------------------------
1 file changed, 76 insertions(+), 174 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b62972c80055..67dabe8b0aa6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
[...]
Post by Johannes Weiner
+ do {
+ pos = ACCESS_ONCE(iter->position);
+ /*
+ * A racing update may change the position and
+ * put the last reference, hence css_tryget(),
+ * or retry to see the updated position.
+ */
+ } while (pos && !css_tryget(&pos->css));
+ }
[...]
Post by Johannes Weiner
+ if (reclaim) {
+ if (cmpxchg(&iter->position, pos, memcg) == pos && memcg)
+ css_get(&memcg->css);
+
+ if (pos)
+ css_put(&pos->css);
This looks like a reference leak. css_put pairs with the above
css_tryget but no css_put pairs with css_get for the cached one. We
---
From 2810937ec6c16afc0bf924e761ff8305bd478a42 Mon Sep 17 00:00:00 2001
Date: Wed, 15 Oct 2014 16:26:22 +0200
Subject: [PATCH]
mm-memcontrol-convert-reclaim-iterator-to-simple-css-refcounting-fix
Make sure that the cached reference is always released.
You're right, thanks for catching that.

Acked-by: Johannes Weiner <***@cmpxchg.org>
Johannes Weiner
2014-10-14 16:20:35 UTC
Permalink
As charges now pin the css explicitely, there is no more need for
kmemcg to acquire a proxy reference for outstanding pages during
offlining, or maintain state to identify such "dead" groups.

This was the last user of the uncharge functions' return values, so
remove them as well.

Signed-off-by: Johannes Weiner <hannes-***@public.gmane.org>
Reviewed-by: Vladimir Davydov <vdavydov-***@public.gmane.org>
---
include/linux/page_counter.h | 4 +--
mm/memcontrol.c | 74 +-------------------------------------------
mm/page_counter.c | 23 +++-----------
3 files changed, 7 insertions(+), 94 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index d92d18949474..a878ef61d073 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -32,12 +32,12 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
return atomic_long_read(&counter->count);
}

-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_try_charge(struct page_counter *counter,
unsigned long nr_pages,
struct page_counter **fail);
-int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_limit(struct page_counter *counter, unsigned long limit);
int page_counter_memparse(const char *buf, unsigned long *nr_pages);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3feead6be15..7551e12f8ff7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -369,7 +369,6 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
- KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
};

#ifdef CONFIG_MEMCG_KMEM
@@ -383,22 +382,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}

-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
-{
- /*
- * Our caller must use css_get() first, because memcg_uncharge_kmem()
- * will call css_put() if it sees the memcg is dead.
- */
- smp_wmb();
- if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
- set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
-}
-
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
-{
- return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
- &memcg->kmem_account_flags);
-}
#endif

/* Stuffs for move charges at task migration. */
@@ -2741,22 +2724,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
if (do_swap_account)
page_counter_uncharge(&memcg->memsw, nr_pages);

- /* Not down to 0 */
- if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
- css_put_many(&memcg->css, nr_pages);
- return;
- }
-
- /*
- * Releases a reference taken in kmem_cgroup_css_offline in case
- * this last uncharge is racing with the offlining code or it is
- * outliving the memcg existence.
- *
- * The memory barrier imposed by test&clear is paired with the
- * explicit one in memcg_kmem_mark_dead().
- */
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
+ page_counter_uncharge(&memcg->kmem, nr_pages);

css_put_many(&memcg->css, nr_pages);
}
@@ -4740,40 +4708,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
mem_cgroup_sockets_destroy(memcg);
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
- if (!memcg_kmem_is_active(memcg))
- return;
-
- /*
- * kmem charges can outlive the cgroup. In the case of slab
- * pages, for instance, a page contain objects from various
- * processes. As we prevent from taking a reference for every
- * such allocation we have to be careful when doing uncharge
- * (see memcg_uncharge_kmem) and here during offlining.
- *
- * The idea is that that only the _last_ uncharge which sees
- * the dead memcg will drop the last reference. An additional
- * reference is taken here before the group is marked dead
- * which is then paired with css_put during uncharge resp. here.
- *
- * Although this might sound strange as this path is called from
- * css_offline() when the referencemight have dropped down to 0 and
- * shouldn't be incremented anymore (css_tryget_online() would
- * fail) we do not have other options because of the kmem
- * allocations lifetime.
- */
- css_get(&memcg->css);
-
- memcg_kmem_mark_dead(memcg);
-
- if (page_counter_read(&memcg->kmem))
- return;
-
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
-}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
@@ -4783,10 +4717,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
#endif

/*
@@ -5390,8 +5320,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);

- kmem_cgroup_css_offline(memcg);
-
/*
* This requires that offlining is serialized. Right now that is
* guaranteed because css_killed_work_fn() holds the cgroup_mutex.
diff --git a/mm/page_counter.c b/mm/page_counter.c
index fc4990c6bb5b..71a0e92e7051 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -12,19 +12,14 @@
* page_counter_cancel - take pages out of the local counter
* @counter: counter
* @nr_pages: number of pages to cancel
- *
- * Returns whether there are remaining pages in the counter.
*/
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
{
long new;

new = atomic_long_sub_return(nr_pages, &counter->count);
-
/* More uncharges than charges? */
WARN_ON_ONCE(new < 0);
-
- return new > 0;
}

/**
@@ -113,23 +108,13 @@ failed:
* page_counter_uncharge - hierarchically uncharge pages
* @counter: counter
* @nr_pages: number of pages to uncharge
- *
- * Returns whether there are remaining charges in @counter.
*/
-int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
{
struct page_counter *c;
- int ret = 1;

- for (c = counter; c; c = c->parent) {
- int remainder;
-
- remainder = page_counter_cancel(c, nr_pages);
- if (c == counter && !remainder)
- ret = 0;
- }
-
- return ret;
+ for (c = counter; c; c = c->parent)
+ page_counter_cancel(c, nr_pages);
}

/**
--
2.1.2
Michal Hocko
2014-10-15 15:21:28 UTC
Permalink
Post by Johannes Weiner
As charges now pin the css explicitely, there is no more need for
kmemcg to acquire a proxy reference for outstanding pages during
offlining, or maintain state to identify such "dead" groups.
This was the last user of the uncharge functions' return values, so
remove them as well.
---
include/linux/page_counter.h | 4 +--
mm/memcontrol.c | 74 +-------------------------------------------
mm/page_counter.c | 23 +++-----------
3 files changed, 7 insertions(+), 94 deletions(-)
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index d92d18949474..a878ef61d073 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -32,12 +32,12 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
return atomic_long_read(&counter->count);
}
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_try_charge(struct page_counter *counter,
unsigned long nr_pages,
struct page_counter **fail);
-int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_limit(struct page_counter *counter, unsigned long limit);
int page_counter_memparse(const char *buf, unsigned long *nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3feead6be15..7551e12f8ff7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -369,7 +369,6 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
- KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
};
#ifdef CONFIG_MEMCG_KMEM
@@ -383,22 +382,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
-{
- /*
- * Our caller must use css_get() first, because memcg_uncharge_kmem()
- * will call css_put() if it sees the memcg is dead.
- */
- smp_wmb();
- if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
- set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
-}
-
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
-{
- return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
- &memcg->kmem_account_flags);
-}
#endif
/* Stuffs for move charges at task migration. */
@@ -2741,22 +2724,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
if (do_swap_account)
page_counter_uncharge(&memcg->memsw, nr_pages);
- /* Not down to 0 */
- if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
- css_put_many(&memcg->css, nr_pages);
- return;
- }
-
- /*
- * Releases a reference taken in kmem_cgroup_css_offline in case
- * this last uncharge is racing with the offlining code or it is
- * outliving the memcg existence.
- *
- * The memory barrier imposed by test&clear is paired with the
- * explicit one in memcg_kmem_mark_dead().
- */
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
+ page_counter_uncharge(&memcg->kmem, nr_pages);
css_put_many(&memcg->css, nr_pages);
}
@@ -4740,40 +4708,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
mem_cgroup_sockets_destroy(memcg);
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
- if (!memcg_kmem_is_active(memcg))
- return;
-
- /*
- * kmem charges can outlive the cgroup. In the case of slab
- * pages, for instance, a page contain objects from various
- * processes. As we prevent from taking a reference for every
- * such allocation we have to be careful when doing uncharge
- * (see memcg_uncharge_kmem) and here during offlining.
- *
- * The idea is that that only the _last_ uncharge which sees
- * the dead memcg will drop the last reference. An additional
- * reference is taken here before the group is marked dead
- * which is then paired with css_put during uncharge resp. here.
- *
- * Although this might sound strange as this path is called from
- * css_offline() when the referencemight have dropped down to 0 and
- * shouldn't be incremented anymore (css_tryget_online() would
- * fail) we do not have other options because of the kmem
- * allocations lifetime.
- */
- css_get(&memcg->css);
-
- memcg_kmem_mark_dead(memcg);
-
- if (page_counter_read(&memcg->kmem))
- return;
-
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
-}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
@@ -4783,10 +4717,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
#endif
/*
@@ -5390,8 +5320,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);
- kmem_cgroup_css_offline(memcg);
-
/*
* This requires that offlining is serialized. Right now that is
* guaranteed because css_killed_work_fn() holds the cgroup_mutex.
diff --git a/mm/page_counter.c b/mm/page_counter.c
index fc4990c6bb5b..71a0e92e7051 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -12,19 +12,14 @@
* page_counter_cancel - take pages out of the local counter
- *
- * Returns whether there are remaining pages in the counter.
*/
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
{
long new;
new = atomic_long_sub_return(nr_pages, &counter->count);
-
/* More uncharges than charges? */
WARN_ON_ONCE(new < 0);
-
- return new > 0;
}
/**
* page_counter_uncharge - hierarchically uncharge pages
- *
*/
-int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
{
struct page_counter *c;
- int ret = 1;
- for (c = counter; c; c = c->parent) {
- int remainder;
-
- remainder = page_counter_cancel(c, nr_pages);
- if (c == counter && !remainder)
- ret = 0;
- }
-
- return ret;
+ for (c = counter; c; c = c->parent)
+ page_counter_cancel(c, nr_pages);
}
/**
--
2.1.2
--
Michal Hocko
SUSE Labs
Johannes Weiner
2014-10-14 16:20:37 UTC
Permalink
With charge reparenting, the last synchroneous stock drainer left.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/memcontrol.c | 46 ++++++----------------------------------------
1 file changed, 6 insertions(+), 40 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ce3ed7cc5c30..ac7d6cefcc63 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,8 +634,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
disarm_kmem_keys(memcg);
}

-static void drain_all_stock_async(struct mem_cgroup *memcg);
-
static struct mem_cgroup_per_zone *
mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
{
@@ -2285,13 +2283,15 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)

/*
* Drains all per-CPU charge caches for given root_memcg resp. subtree
- * of the hierarchy under it. sync flag says whether we should block
- * until the work is done.
+ * of the hierarchy under it.
*/
-static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
+static void drain_all_stock(struct mem_cgroup *root_memcg)
{
int cpu, curcpu;

+ /* If someone's already draining, avoid adding running more workers. */
+ if (!mutex_trylock(&percpu_charge_mutex))
+ return;
/* Notify other cpus that system-wide "drain" is running */
get_online_cpus();
curcpu = get_cpu();
@@ -2312,41 +2312,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
}
}
put_cpu();
-
- if (!sync)
- goto out;
-
- for_each_online_cpu(cpu) {
- struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
- if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
- flush_work(&stock->work);
- }
-out:
put_online_cpus();
-}
-
-/*
- * Tries to drain stocked charges in other cpus. This function is asynchronous
- * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back later but cannot wait for it.
- */
-static void drain_all_stock_async(struct mem_cgroup *root_memcg)
-{
- /*
- * If someone calls draining, avoid adding more kworker runs.
- */
- if (!mutex_trylock(&percpu_charge_mutex))
- return;
- drain_all_stock(root_memcg, false);
- mutex_unlock(&percpu_charge_mutex);
-}
-
-/* This is a synchronous drain interface. */
-static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
-{
- /* called when force_empty is called */
- mutex_lock(&percpu_charge_mutex);
- drain_all_stock(root_memcg, true);
mutex_unlock(&percpu_charge_mutex);
}

@@ -2455,7 +2421,7 @@ retry:
goto retry;

if (!drained) {
- drain_all_stock_async(mem_over_limit);
+ drain_all_stock(mem_over_limit);
drained = true;
goto retry;
}
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-15 15:28:30 UTC
Permalink
Post by Johannes Weiner
With charge reparenting, the last synchroneous stock drainer left.
---
mm/memcontrol.c | 46 ++++++----------------------------------------
1 file changed, 6 insertions(+), 40 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ce3ed7cc5c30..ac7d6cefcc63 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,8 +634,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
disarm_kmem_keys(memcg);
}
-static void drain_all_stock_async(struct mem_cgroup *memcg);
-
static struct mem_cgroup_per_zone *
mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
{
@@ -2285,13 +2283,15 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
/*
* Drains all per-CPU charge caches for given root_memcg resp. subtree
- * of the hierarchy under it. sync flag says whether we should block
- * until the work is done.
+ * of the hierarchy under it.
*/
-static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
+static void drain_all_stock(struct mem_cgroup *root_memcg)
{
int cpu, curcpu;
+ /* If someone's already draining, avoid adding running more workers. */
+ if (!mutex_trylock(&percpu_charge_mutex))
+ return;
/* Notify other cpus that system-wide "drain" is running */
get_online_cpus();
curcpu = get_cpu();
@@ -2312,41 +2312,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
}
}
put_cpu();
-
- if (!sync)
- goto out;
-
- for_each_online_cpu(cpu) {
- struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
- if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
- flush_work(&stock->work);
- }
put_online_cpus();
-}
-
-/*
- * Tries to drain stocked charges in other cpus. This function is asynchronous
- * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back later but cannot wait for it.
- */
-static void drain_all_stock_async(struct mem_cgroup *root_memcg)
-{
- /*
- * If someone calls draining, avoid adding more kworker runs.
- */
- if (!mutex_trylock(&percpu_charge_mutex))
- return;
- drain_all_stock(root_memcg, false);
- mutex_unlock(&percpu_charge_mutex);
-}
-
-/* This is a synchronous drain interface. */
-static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
-{
- /* called when force_empty is called */
- mutex_lock(&percpu_charge_mutex);
- drain_all_stock(root_memcg, true);
mutex_unlock(&percpu_charge_mutex);
}
goto retry;
if (!drained) {
- drain_all_stock_async(mem_over_limit);
+ drain_all_stock(mem_over_limit);
drained = true;
goto retry;
}
--
2.1.2
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Vladimir Davydov
2014-10-17 08:48:41 UTC
Permalink
Post by Johannes Weiner
With charge reparenting, the last synchroneous stock drainer left.
Reviewed-by: Vladimir Davydov <***@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-14 16:20:36 UTC
Permalink
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.

Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.

There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
mm/memcontrol.c | 218 +-------------------------------------------------------
1 file changed, 1 insertion(+), 217 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7551e12f8ff7..ce3ed7cc5c30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (css == &root->css)
break;

- if (css_tryget_online(css)) {
+ if (css_tryget(css)) {
/*
* Make sure the memcg is initialized:
* mem_cgroup_css_online() orders the the
@@ -3299,79 +3299,6 @@ out:
return ret;
}

-/**
- * mem_cgroup_move_parent - moves page to the parent group
- * @page: the page to move
- * @pc: page_cgroup of the page
- * @child: page's cgroup
- *
- * move charges to its parent or the root cgroup if the group has no
- * parent (aka use_hierarchy==0).
- * Although this might fail (get_page_unless_zero, isolate_lru_page or
- * mem_cgroup_move_account fails) the failure is always temporary and
- * it signals a race with a page removal/uncharge or migration. In the
- * first case the page is on the way out and it will vanish from the LRU
- * on the next attempt and the call should be retried later.
- * Isolation from the LRU fails only if page has been isolated from
- * the LRU since we looked at it and that usually means either global
- * reclaim or migration going on. The page will either get back to the
- * LRU or vanish.
- * Finaly mem_cgroup_move_account fails only if the page got uncharged
- * (!PageCgroupUsed) or moved to a different group. The page will
- * disappear in the next attempt.
- */
-static int mem_cgroup_move_parent(struct page *page,
- struct page_cgroup *pc,
- struct mem_cgroup *child)
-{
- struct mem_cgroup *parent;
- unsigned int nr_pages;
- unsigned long uninitialized_var(flags);
- int ret;
-
- VM_BUG_ON(mem_cgroup_is_root(child));
-
- ret = -EBUSY;
- if (!get_page_unless_zero(page))
- goto out;
- if (isolate_lru_page(page))
- goto put;
-
- nr_pages = hpage_nr_pages(page);
-
- parent = parent_mem_cgroup(child);
- /*
- * If no parent, move charges to root cgroup.
- */
- if (!parent)
- parent = root_mem_cgroup;
-
- if (nr_pages > 1) {
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- flags = compound_lock_irqsave(page);
- }
-
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent);
- if (!ret) {
- if (!mem_cgroup_is_root(parent))
- css_get_many(&parent->css, nr_pages);
- /* Take charge off the local counters */
- page_counter_cancel(&child->memory, nr_pages);
- if (do_swap_account)
- page_counter_cancel(&child->memsw, nr_pages);
- css_put_many(&child->css, nr_pages);
- }
-
- if (nr_pages > 1)
- compound_unlock_irqrestore(page, flags);
- putback_lru_page(page);
-put:
- put_page(page);
-out:
- return ret;
-}
-
#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
@@ -3665,105 +3592,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
return nr_reclaimed;
}

-/**
- * mem_cgroup_force_empty_list - clears LRU of a group
- * @memcg: group to clear
- * @node: NUMA node
- * @zid: zone id
- * @lru: lru to to clear
- *
- * Traverse a specified page_cgroup list and try to drop them all. This doesn't
- * reclaim the pages page themselves - pages are moved to the parent (or root)
- * group.
- */
-static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
- int node, int zid, enum lru_list lru)
-{
- struct lruvec *lruvec;
- unsigned long flags;
- struct list_head *list;
- struct page *busy;
- struct zone *zone;
-
- zone = &NODE_DATA(node)->node_zones[zid];
- lruvec = mem_cgroup_zone_lruvec(zone, memcg);
- list = &lruvec->lists[lru];
-
- busy = NULL;
- do {
- struct page_cgroup *pc;
- struct page *page;
-
- spin_lock_irqsave(&zone->lru_lock, flags);
- if (list_empty(list)) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- break;
- }
- page = list_entry(list->prev, struct page, lru);
- if (busy == page) {
- list_move(&page->lru, list);
- busy = NULL;
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- continue;
- }
- spin_unlock_irqrestore(&zone->lru_lock, flags);
-
- pc = lookup_page_cgroup(page);
-
- if (mem_cgroup_move_parent(page, pc, memcg)) {
- /* found lock contention or "pc" is obsolete. */
- busy = page;
- } else
- busy = NULL;
- cond_resched();
- } while (!list_empty(list));
-}
-
-/*
- * make mem_cgroup's charge to be 0 if there is no task by moving
- * all the charges and pages to the parent.
- * This enables deleting this mem_cgroup.
- *
- * Caller is responsible for holding css reference on the memcg.
- */
-static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
-{
- int node, zid;
-
- do {
- /* This is for making all *used* pages to be on LRU. */
- lru_add_drain_all();
- drain_all_stock_sync(memcg);
- mem_cgroup_start_move(memcg);
- for_each_node_state(node, N_MEMORY) {
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
- enum lru_list lru;
- for_each_lru(lru) {
- mem_cgroup_force_empty_list(memcg,
- node, zid, lru);
- }
- }
- }
- mem_cgroup_end_move(memcg);
- memcg_oom_recover(memcg);
- cond_resched();
-
- /*
- * Kernel memory may not necessarily be trackable to a specific
- * process. So they are not migrated, and therefore we can't
- * expect their value to drop to 0 here.
- * Having res filled up with kmem only is enough.
- *
- * This is a safety check because mem_cgroup_force_empty_list
- * could have raced with mem_cgroup_replace_page_cache callers
- * so the lru seemed empty but the page could have been added
- * right after the check. RES_USAGE should be safe as we always
- * charge before adding to the LRU.
- */
- } while (page_counter_read(&memcg->memory) -
- page_counter_read(&memcg->kmem) > 0);
-}
-
/*
* Test whether @memcg has children, dead or alive. Note that this
* function doesn't care whether @memcg has use_hierarchy enabled and
@@ -5306,7 +5134,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup_event *event, *tmp;
- struct cgroup_subsys_state *iter;

/*
* Unregister events and notify userspace.
@@ -5320,13 +5147,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);

- /*
- * This requires that offlining is serialized. Right now that is
- * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
- */
- css_for_each_descendant_post(iter, css)
- mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
-
memcg_unregister_all_caches(memcg);
vmpressure_cleanup(&memcg->vmpressure);
}
@@ -5334,42 +5154,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- /*
- * XXX: css_offline() would be where we should reparent all
- * memory to prepare the cgroup for destruction. However,
- * memcg does not do css_tryget_online() and page_counter charging
- * under the same RCU lock region, which means that charging
- * could race with offlining. Offlining only happens to
- * cgroups with no tasks in them but charges can show up
- * without any tasks from the swapin path when the target
- * memcg is looked up from the swapout record and not from the
- * current task as it usually is. A race like this can leak
- * charges and put pages with stale cgroup pointers into
- * circulation:
- *
- * #0 #1
- * lookup_swap_cgroup_id()
- * rcu_read_lock()
- * mem_cgroup_lookup()
- * css_tryget_online()
- * rcu_read_unlock()
- * disable css_tryget_online()
- * call_rcu()
- * offline_css()
- * reparent_charges()
- * page_counter_try_charge()
- * css_put()
- * css_free()
- * pc->mem_cgroup = dead memcg
- * add page to lru
- *
- * The bulk of the charges are still moved in offline_css() to
- * avoid pinning a lot of pages in case a long-term reference
- * like a swapout record is deferring the css_free() to long
- * after offlining. But this makes sure we catch any charges
- * made after offlining:
- */
- mem_cgroup_reparent_charges(memcg);

memcg_destroy_kmem(memcg);
__mem_cgroup_free(memcg);
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-15 15:25:55 UTC
Permalink
Post by Johannes Weiner
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.
Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
I think it would be nice to mention c2931b70a32c (cgroup: iterate
cgroup_subsys_states directly) here to have a full context about the
tryget vs tryget_online.
Post by Johannes Weiner
There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.
Thank you for pulling drain_stock cleanup out. This made the patch so
much easier to review.
Post by Johannes Weiner
---
mm/memcontrol.c | 218 +-------------------------------------------------------
1 file changed, 1 insertion(+), 217 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7551e12f8ff7..ce3ed7cc5c30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (css == &root->css)
break;
- if (css_tryget_online(css)) {
+ if (css_tryget(css)) {
/*
* mem_cgroup_css_online() orders the the
return ret;
}
-/**
- * mem_cgroup_move_parent - moves page to the parent group
- *
- * move charges to its parent or the root cgroup if the group has no
- * parent (aka use_hierarchy==0).
- * Although this might fail (get_page_unless_zero, isolate_lru_page or
- * mem_cgroup_move_account fails) the failure is always temporary and
- * it signals a race with a page removal/uncharge or migration. In the
- * first case the page is on the way out and it will vanish from the LRU
- * on the next attempt and the call should be retried later.
- * Isolation from the LRU fails only if page has been isolated from
- * the LRU since we looked at it and that usually means either global
- * reclaim or migration going on. The page will either get back to the
- * LRU or vanish.
- * Finaly mem_cgroup_move_account fails only if the page got uncharged
- * (!PageCgroupUsed) or moved to a different group. The page will
- * disappear in the next attempt.
- */
-static int mem_cgroup_move_parent(struct page *page,
- struct page_cgroup *pc,
- struct mem_cgroup *child)
-{
- struct mem_cgroup *parent;
- unsigned int nr_pages;
- unsigned long uninitialized_var(flags);
- int ret;
-
- VM_BUG_ON(mem_cgroup_is_root(child));
-
- ret = -EBUSY;
- if (!get_page_unless_zero(page))
- goto out;
- if (isolate_lru_page(page))
- goto put;
-
- nr_pages = hpage_nr_pages(page);
-
- parent = parent_mem_cgroup(child);
- /*
- * If no parent, move charges to root cgroup.
- */
- if (!parent)
- parent = root_mem_cgroup;
-
- if (nr_pages > 1) {
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- flags = compound_lock_irqsave(page);
- }
-
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent);
- if (!ret) {
- if (!mem_cgroup_is_root(parent))
- css_get_many(&parent->css, nr_pages);
- /* Take charge off the local counters */
- page_counter_cancel(&child->memory, nr_pages);
- if (do_swap_account)
- page_counter_cancel(&child->memsw, nr_pages);
- css_put_many(&child->css, nr_pages);
- }
-
- if (nr_pages > 1)
- compound_unlock_irqrestore(page, flags);
- putback_lru_page(page);
- put_page(page);
- return ret;
-}
-
#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
@@ -3665,105 +3592,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
return nr_reclaimed;
}
-/**
- * mem_cgroup_force_empty_list - clears LRU of a group
- *
- * Traverse a specified page_cgroup list and try to drop them all. This doesn't
- * reclaim the pages page themselves - pages are moved to the parent (or root)
- * group.
- */
-static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
- int node, int zid, enum lru_list lru)
-{
- struct lruvec *lruvec;
- unsigned long flags;
- struct list_head *list;
- struct page *busy;
- struct zone *zone;
-
- zone = &NODE_DATA(node)->node_zones[zid];
- lruvec = mem_cgroup_zone_lruvec(zone, memcg);
- list = &lruvec->lists[lru];
-
- busy = NULL;
- do {
- struct page_cgroup *pc;
- struct page *page;
-
- spin_lock_irqsave(&zone->lru_lock, flags);
- if (list_empty(list)) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- break;
- }
- page = list_entry(list->prev, struct page, lru);
- if (busy == page) {
- list_move(&page->lru, list);
- busy = NULL;
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- continue;
- }
- spin_unlock_irqrestore(&zone->lru_lock, flags);
-
- pc = lookup_page_cgroup(page);
-
- if (mem_cgroup_move_parent(page, pc, memcg)) {
- /* found lock contention or "pc" is obsolete. */
- busy = page;
- } else
- busy = NULL;
- cond_resched();
- } while (!list_empty(list));
-}
-
-/*
- * make mem_cgroup's charge to be 0 if there is no task by moving
- * all the charges and pages to the parent.
- * This enables deleting this mem_cgroup.
- *
- * Caller is responsible for holding css reference on the memcg.
- */
-static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
-{
- int node, zid;
-
- do {
- /* This is for making all *used* pages to be on LRU. */
- lru_add_drain_all();
- drain_all_stock_sync(memcg);
- mem_cgroup_start_move(memcg);
- for_each_node_state(node, N_MEMORY) {
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
- enum lru_list lru;
- for_each_lru(lru) {
- mem_cgroup_force_empty_list(memcg,
- node, zid, lru);
- }
- }
- }
- mem_cgroup_end_move(memcg);
- memcg_oom_recover(memcg);
- cond_resched();
-
- /*
- * Kernel memory may not necessarily be trackable to a specific
- * process. So they are not migrated, and therefore we can't
- * expect their value to drop to 0 here.
- * Having res filled up with kmem only is enough.
- *
- * This is a safety check because mem_cgroup_force_empty_list
- * could have raced with mem_cgroup_replace_page_cache callers
- * so the lru seemed empty but the page could have been added
- * right after the check. RES_USAGE should be safe as we always
- * charge before adding to the LRU.
- */
- } while (page_counter_read(&memcg->memory) -
- page_counter_read(&memcg->kmem) > 0);
-}
-
/*
@@ -5306,7 +5134,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup_event *event, *tmp;
- struct cgroup_subsys_state *iter;
/*
* Unregister events and notify userspace.
@@ -5320,13 +5147,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);
- /*
- * This requires that offlining is serialized. Right now that is
- * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
- */
- css_for_each_descendant_post(iter, css)
- mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
-
memcg_unregister_all_caches(memcg);
vmpressure_cleanup(&memcg->vmpressure);
}
@@ -5334,42 +5154,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- /*
- * XXX: css_offline() would be where we should reparent all
- * memory to prepare the cgroup for destruction. However,
- * memcg does not do css_tryget_online() and page_counter charging
- * under the same RCU lock region, which means that charging
- * could race with offlining. Offlining only happens to
- * cgroups with no tasks in them but charges can show up
- * without any tasks from the swapin path when the target
- * memcg is looked up from the swapout record and not from the
- * current task as it usually is. A race like this can leak
- * charges and put pages with stale cgroup pointers into
- *
- * #0 #1
- * lookup_swap_cgroup_id()
- * rcu_read_lock()
- * mem_cgroup_lookup()
- * css_tryget_online()
- * rcu_read_unlock()
- * disable css_tryget_online()
- * call_rcu()
- * offline_css()
- * reparent_charges()
- * page_counter_try_charge()
- * css_put()
- * css_free()
- * pc->mem_cgroup = dead memcg
- * add page to lru
- *
- * The bulk of the charges are still moved in offline_css() to
- * avoid pinning a lot of pages in case a long-term reference
- * like a swapout record is deferring the css_free() to long
- * after offlining. But this makes sure we catch any charges
- */
- mem_cgroup_reparent_charges(memcg);
memcg_destroy_kmem(memcg);
__mem_cgroup_free(memcg);
--
2.1.2
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-17 03:02:21 UTC
Permalink
Post by Michal Hocko
Post by Johannes Weiner
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.
Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
I think it would be nice to mention c2931b70a32c (cgroup: iterate
cgroup_subsys_states directly) here to have a full context about the
tryget vs tryget_online.
Yes, that commit is probably the most direct dependency.

Andrew, could you update the changelog in place to have that paragraph
read

Since c2931b70a32c ("cgroup: iterate cgroup_subsys_states directly")
css iterators now also include offlined css, so memcg iterators can be
changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.

please? Thanks!
Post by Michal Hocko
Post by Johannes Weiner
There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.
Thank you for pulling drain_stock cleanup out. This made the patch so
much easier to review.
Thanks you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2014-10-17 04:42:40 UTC
Permalink
Post by Johannes Weiner
Andrew, could you update the changelog in place to have that paragraph
read
Since c2931b70a32c ("cgroup: iterate cgroup_subsys_states directly")
css iterators now also include offlined css, so memcg iterators can be
changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
Done.
Vladimir Davydov
2014-10-17 08:40:11 UTC
Permalink
Post by Johannes Weiner
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.
Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.
---
mm/memcontrol.c | 218 +-------------------------------------------------------
1 file changed, 1 insertion(+), 217 deletions(-)
I do like the stats :-) However, as I've already mentioned, on big
machines we can end up with hundred of thousands of dead css's.
Iterating over all of them during reclaim may result in noticeable lags.
One day we'll have to do something about that I guess.

Another issue is that AFAICT currently we can't have more than 64K
cgroups due to the MEM_CGROUP_ID_MAX limit. The limit exists, because we
use css ids for tagging swap entries and we don't want to spend too much
memory on this. May be, we should simply use the mem_cgroup pointer
instead of the css id?

OTOH, the reparenting code looks really ugly. And we can't easily
reparent swap and kmem. So I think it's a reasonable change.

Acked-by: Vladimir Davydov <***@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-17 14:00:22 UTC
Permalink
Post by Vladimir Davydov
Post by Johannes Weiner
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.
Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.
---
mm/memcontrol.c | 218 +-------------------------------------------------------
1 file changed, 1 insertion(+), 217 deletions(-)
I do like the stats :-) However, as I've already mentioned, on big
machines we can end up with hundred of thousands of dead css's.
css->id is bound to the css life so this is bound to the maximum number
of allowed cgroups AFAIR. It is true that dead memcgs might block
creation of new. This is a good point. It would be a problem either when
there is no reclaim (global or memcg) or when groups are very short
lived. One possible way out would be counting dead memcgs and kick
background mem_cgroup_force_empty loop over those that are dead once we
hit a threshold. This should be pretty trivial to implement.
Post by Vladimir Davydov
Iterating over all of them during reclaim may result in noticeable lags.
One day we'll have to do something about that I guess.
Another issue is that AFAICT currently we can't have more than 64K
cgroups due to the MEM_CGROUP_ID_MAX limit.The limit exists, because we
use css ids for tagging swap entries and we don't want to spend too much
memory on this. May be, we should simply use the mem_cgroup pointer
instead of the css id?
We are using the id to reduce the memory footprint. We cannot effort 8B
per each swappage (we can have GBs of swap space in the system).
Post by Vladimir Davydov
OTOH, the reparenting code looks really ugly. And we can't easily
reparent swap and kmem. So I think it's a reasonable change.
At least swap shouldn't be a big deal. Hugh already had a patch for
that. You would simply have to go over all swap entries and change the
id. kmem should be doable as well as you have already shown in your
patches. The main question is. Do we really need it? I think we are good
now and should make the code more complicated once this starts being a
practical problem.
Post by Vladimir Davydov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
Michal Hocko
SUSE Labs
Vladimir Davydov
2014-10-17 15:20:07 UTC
Permalink
Post by Michal Hocko
Post by Vladimir Davydov
Post by Johannes Weiner
On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent. But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.
Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.
There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent. But they are still
included in the parent's hierarchical statistics.
---
mm/memcontrol.c | 218 +-------------------------------------------------------
1 file changed, 1 insertion(+), 217 deletions(-)
I do like the stats :-) However, as I've already mentioned, on big
machines we can end up with hundred of thousands of dead css's.
css->id is bound to the css life so this is bound to the maximum number
of allowed cgroups AFAIR. It is true that dead memcgs might block
creation of new. This is a good point. It would be a problem either when
there is no reclaim (global or memcg) or when groups are very short
lived. One possible way out would be counting dead memcgs and kick
background mem_cgroup_force_empty loop over those that are dead once we
hit a threshold. This should be pretty trivial to implement.
Right, this shouldn't be a problem.
Post by Michal Hocko
Post by Vladimir Davydov
Iterating over all of them during reclaim may result in noticeable lags.
One day we'll have to do something about that I guess.
Another issue is that AFAICT currently we can't have more than 64K
cgroups due to the MEM_CGROUP_ID_MAX limit.The limit exists, because we
use css ids for tagging swap entries and we don't want to spend too much
memory on this. May be, we should simply use the mem_cgroup pointer
instead of the css id?
We are using the id to reduce the memory footprint. We cannot effort 8B
per each swappage (we can have GBs of swap space in the system).
From my experience nowadays most servers have much more RAM than swap
space, that's why I'm wondering if it really makes any difference
to have 2 bytes allocated per swap entry vs 8 bytes.
Post by Michal Hocko
Post by Vladimir Davydov
OTOH, the reparenting code looks really ugly. And we can't easily
reparent swap and kmem. So I think it's a reasonable change.
At least swap shouldn't be a big deal. Hugh already had a patch for
that. You would simply have to go over all swap entries and change the
id. kmem should be doable as well as you have already shown in your
patches. The main question is. Do we really need it? I think we are good
now and should make the code more complicated once this starts being a
practical problem.
For kmem it isn't that simple. We can't merge slab caches due to their
high performance nature, so we have no choice rather having them hanging
around after css death attached either to the dead css or some other
intermediate structure (a kind of kmem context). But I agree there's no
point bothering until nobody complains.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Loading...