Discussion:
[patch 0/3] mm: memcontrol: lockless page counters v3
Johannes Weiner
2014-10-14 01:46:00 UTC
Permalink
Hi,

this series replaces the spinlock_irq-protected 64-bit res_counters
with lockless word-sized page counters. This improves memory cgroup
scalability on the higher end and gets rid of 64-bit math on 32-bit
machines in the core accounting, but also generally simplifies the
counter code and interface.

Version 3 has a few more touch-ups based on reviews from Michal and
Vladimir:

- documented ordering between limit() and try_charge() [vladimir]
- removed charge revert during uncharge underflow [michal]
- reworked limit update code flow [michal]
- removed rounding-up in limit setting [michal]
- fixed mem_cgroup_margin() memsw calculation [michal]
- updated page_counter raciness comments [vladimir]

Version 2:

cleanups:
- moved new page_counter API to its own file [vladimir, michal]
- documented page counter API [vladimir]
- documented acceptable race conditions [vladimir]
- split out res_counter removal to reduce patch size [vladimir]
- split out hugetlb controller conversion to reduce patch size
- split page_counter_charge and page_counter_try_charge [vladimir]
- wrapped signed-to-unsigned read in page_counter_read() [vladimir]
- wrapped watermark reset in page_counter_reset_watermark() [vladimir]
- reverted counter->limited back to counter->failcnt [vladimir]
- changed underflow to WARN_ON_ONCE and counter revert [kame, vladimir]

fixes:
- fixed kmem's value for unlimited [vladimir]
- fixed page_counter_cancel() return value [vladimir]
- based page counter range on atomic_long_t's max [vladimir]
- fixed tcp memcontrol's usage reporting [vladimir]
- fixed hugepage limit page alignment [vladimir]
- fixed page_counter_limit() serialization [vladimir]

optimizations:
- converted page_counter_try_charge() from CAS to FAA [vladimir]

Documentation/cgroups/hugetlb.txt | 2 +-
Documentation/cgroups/memory.txt | 4 +-
Documentation/cgroups/resource_counter.txt | 197 ---------
include/linux/hugetlb_cgroup.h | 1 -
include/linux/memcontrol.h | 5 +-
include/linux/page_counter.h | 49 +++
include/linux/res_counter.h | 223 ----------
include/net/sock.h | 26 +-
init/Kconfig | 12 +-
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------
mm/Makefile | 1 +
mm/hugetlb_cgroup.c | 103 +++--
mm/memcontrol.c | 633 +++++++++++++--------------
mm/page_counter.c | 203 +++++++++
net/ipv4/tcp_memcontrol.c | 87 ++--
16 files changed, 669 insertions(+), 1089 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-14 01:46:01 UTC
Permalink
Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.

Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then
only happens when interfacing with userspace.

The removed locking overhead is noticable when scaling beyond the
per-cpu charge caches - on a 4-socket machine with 144-threads, the
following test shows the performance differences of 288 memcgs
concurrently running a page fault benchmark:

vanilla:

18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

132.474343877 seconds time elapsed ( +- 0.21% )

lockless:

12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

91.369330729 seconds time elapsed ( +- 0.45% )

On top of improved scalability, this also gets rid of the icky long
long types in the very heart of memcg, which is great for 32 bit and
also makes the code a lot more readable.

Notable differences between the old and new API:

- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()

- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()

- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves

- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.

- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
---
Documentation/cgroups/memory.txt | 4 +-
include/linux/memcontrol.h | 5 +-
include/linux/page_counter.h | 49 +++
include/net/sock.h | 26 +-
init/Kconfig | 5 +-
mm/Makefile | 1 +
mm/memcontrol.c | 633 ++++++++++++++++++---------------------
mm/page_counter.c | 203 +++++++++++++
net/ipv4/tcp_memcontrol.c | 87 +++---
9 files changed, 609 insertions(+), 404 deletions(-)
create mode 100644 include/linux/page_counter.h
create mode 100644 mm/page_counter.c

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..f624727ab404 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,9 +52,9 @@ Brief summary of control files.
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
- memory.usage_in_bytes # show current res_counter usage for memory
+ memory.usage_in_bytes # show current usage for memory
(See 5.5 for details)
- memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
+ memory.memsw.usage_in_bytes # show current usage for memory+Swap
(See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19df5d857411..0daf383f8f1c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -471,9 +471,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
/*
* __GFP_NOFAIL allocations will move on even if charging is not
* possible. Therefore we don't even try, and have this allocation
- * unaccounted. We could in theory charge it with
- * res_counter_charge_nofail, but we hope those allocations are rare,
- * and won't be worth the trouble.
+ * unaccounted. We could in theory charge it forcibly, but we hope
+ * those allocations are rare, and won't be worth the trouble.
*/
if (gfp & __GFP_NOFAIL)
return true;
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
new file mode 100644
index 000000000000..d92d18949474
--- /dev/null
+++ b/include/linux/page_counter.h
@@ -0,0 +1,49 @@
+#ifndef _LINUX_PAGE_COUNTER_H
+#define _LINUX_PAGE_COUNTER_H
+
+#include <linux/atomic.h>
+
+struct page_counter {
+ atomic_long_t count;
+ unsigned long limit;
+ struct page_counter *parent;
+
+ /* legacy */
+ unsigned long watermark;
+ unsigned long failcnt;
+};
+
+#if BITS_PER_LONG == 32
+#define PAGE_COUNTER_MAX LONG_MAX
+#else
+#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
+#endif
+
+static inline void page_counter_init(struct page_counter *counter,
+ struct page_counter *parent)
+{
+ atomic_long_set(&counter->count, 0);
+ counter->limit = PAGE_COUNTER_MAX;
+ counter->parent = parent;
+}
+
+static inline unsigned long page_counter_read(struct page_counter *counter)
+{
+ return atomic_long_read(&counter->count);
+}
+
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_try_charge(struct page_counter *counter,
+ unsigned long nr_pages,
+ struct page_counter **fail);
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_limit(struct page_counter *counter, unsigned long limit);
+int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+
+static inline void page_counter_reset_watermark(struct page_counter *counter)
+{
+ counter->watermark = page_counter_read(counter);
+}
+
+#endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 7db3db112baa..0fa8cf908b65 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,8 +54,8 @@
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/page_counter.h>
#include <linux/memcontrol.h>
-#include <linux/res_counter.h>
#include <linux/static_key.h>
#include <linux/aio.h>
#include <linux/sched.h>
@@ -1061,7 +1061,7 @@ enum cg_proto_flags {
};

struct cg_proto {
- struct res_counter memory_allocated; /* Current allocated memory. */
+ struct page_counter memory_allocated; /* Current allocated memory. */
struct percpu_counter sockets_allocated; /* Current number of sockets. */
int memory_pressure;
long sysctl_mem[3];
@@ -1213,34 +1213,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
unsigned long amt,
int *parent_status)
{
- struct res_counter *fail;
- int ret;
+ page_counter_charge(&prot->memory_allocated, amt);

- ret = res_counter_charge_nofail(&prot->memory_allocated,
- amt << PAGE_SHIFT, &fail);
- if (ret < 0)
+ if (page_counter_read(&prot->memory_allocated) >
+ prot->memory_allocated.limit)
*parent_status = OVER_LIMIT;
}

static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
unsigned long amt)
{
- res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
-}
-
-static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
-{
- u64 ret;
- ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
- return ret >> PAGE_SHIFT;
+ page_counter_uncharge(&prot->memory_allocated, amt);
}

static inline long
sk_memory_allocated(const struct sock *sk)
{
struct proto *prot = sk->sk_prot;
+
if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
- return memcg_memory_allocated_read(sk->sk_cgrp);
+ return page_counter_read(&sk->sk_cgrp->memory_allocated);

return atomic_long_read(prot->memory_allocated);
}
@@ -1254,7 +1246,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
/* update the root cgroup regardless */
atomic_long_add_return(amt, prot->memory_allocated);
- return memcg_memory_allocated_read(sk->sk_cgrp);
+ return page_counter_read(&sk->sk_cgrp->memory_allocated);
}

return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/init/Kconfig b/init/Kconfig
index b454b162d2df..3fc577e9de75 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -985,9 +985,12 @@ config RESOURCE_COUNTERS
This option enables controller independent resource accounting
infrastructure that works with cgroups.

+config PAGE_COUNTER
+ bool
+
config MEMCG
bool "Memory Resource Controller for Control Groups"
- depends on RESOURCE_COUNTERS
+ select PAGE_COUNTER
select EVENTFD
help
Provides a memory resource controller that manages both anonymous
diff --git a/mm/Makefile b/mm/Makefile
index ba3ec4e1a66e..27ddb80403a9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
+obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23976fd885fd..b62972c80055 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,7 +25,7 @@
* GNU General Public License for more details.
*/

-#include <linux/res_counter.h>
+#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
@@ -165,7 +165,7 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];

struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
+ unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
struct mem_cgroup *memcg; /* Back pointer, we cannot */
@@ -198,7 +198,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;

struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
- u64 threshold;
+ unsigned long threshold;
};

/* For threshold */
@@ -284,10 +284,13 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
*/
struct mem_cgroup {
struct cgroup_subsys_state css;
- /*
- * the counter to account for memory usage
- */
- struct res_counter res;
+
+ /* Accounted resources */
+ struct page_counter memory;
+ struct page_counter memsw;
+ struct page_counter kmem;
+
+ unsigned long soft_limit;

/* vmpressure notifications */
struct vmpressure vmpressure;
@@ -296,15 +299,6 @@ struct mem_cgroup {
int initialized;

/*
- * the counter to account for mem+swap usage.
- */
- struct res_counter memsw;
-
- /*
- * the counter to account for kernel memory usage.
- */
- struct res_counter kmem;
- /*
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
@@ -650,7 +644,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
* This check can't live in kmem destruction function,
* since the charges will outlive the cgroup
*/
- WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+ WARN_ON(page_counter_read(&memcg->kmem));
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -706,7 +700,7 @@ soft_limit_tree_from_page(struct page *page)

static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
+ unsigned long new_usage_in_excess)
{
struct rb_node **p = &mctz->rb_root.rb_node;
struct rb_node *parent = NULL;
@@ -755,10 +749,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
spin_unlock_irqrestore(&mctz->lock, flags);
}

+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+ unsigned long nr_pages = page_counter_read(&memcg->memory);
+ unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
+ unsigned long excess = 0;
+
+ if (nr_pages > soft_limit)
+ excess = nr_pages - soft_limit;
+
+ return excess;
+}

static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
{
- unsigned long long excess;
+ unsigned long excess;
struct mem_cgroup_per_zone *mz;
struct mem_cgroup_tree_per_zone *mctz;

@@ -769,7 +774,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
*/
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
mz = mem_cgroup_page_zoneinfo(memcg, page);
- excess = res_counter_soft_limit_excess(&memcg->res);
+ excess = soft_limit_excess(memcg);
/*
* We have to update the tree if mz is on RB-tree or
* mem is over its softlimit.
@@ -825,7 +830,7 @@ retry:
* position in the tree.
*/
__mem_cgroup_remove_exceeded(mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+ if (!soft_limit_excess(mz->memcg) ||
!css_tryget_online(&mz->memcg->css))
goto retry;
done:
@@ -1492,7 +1497,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
return inactive * inactive_ratio < active;
}

-#define mem_cgroup_from_res_counter(counter, member) \
+#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

/**
@@ -1504,12 +1509,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
*/
static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
{
- unsigned long long margin;
+ unsigned long margin = 0;
+ unsigned long count;
+ unsigned long limit;

- margin = res_counter_margin(&memcg->res);
- if (do_swap_account)
- margin = min(margin, res_counter_margin(&memcg->memsw));
- return margin >> PAGE_SHIFT;
+ count = page_counter_read(&memcg->memory);
+ limit = ACCESS_ONCE(memcg->memory.limit);
+ if (count < limit)
+ margin = limit - count;
+
+ if (do_swap_account) {
+ count = page_counter_read(&memcg->memsw);
+ limit = ACCESS_ONCE(memcg->memsw.limit);
+ if (count <= limit)
+ margin = min(margin, limit - count);
+ }
+
+ return margin;
}

int mem_cgroup_swappiness(struct mem_cgroup *memcg)
@@ -1650,18 +1666,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)

rcu_read_unlock();

- pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->res, RES_FAILCNT));
- pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
- pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+ pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->memory)),
+ K((u64)memcg->memory.limit), memcg->memory.failcnt);
+ pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->memsw)),
+ K((u64)memcg->memsw.limit), memcg->memsw.failcnt);
+ pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->kmem)),
+ K((u64)memcg->kmem.limit), memcg->kmem.failcnt);

for_each_mem_cgroup_tree(iter, memcg) {
pr_info("Memory cgroup stats for ");
@@ -1701,28 +1714,17 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
/*
* Return the memory (and swap, if configured) limit for a memcg.
*/
-static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
- u64 limit;
+ unsigned long limit;

- limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-
- /*
- * Do not consider swap space if we cannot swap due to swappiness
- */
+ limit = memcg->memory.limit;
if (mem_cgroup_swappiness(memcg)) {
- u64 memsw;
+ unsigned long memsw_limit;

- limit += total_swap_pages << PAGE_SHIFT;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-
- /*
- * If memsw is finite and limits the amount of swap space
- * available to this memcg, return that limit.
- */
- limit = min(limit, memsw);
+ memsw_limit = memcg->memsw.limit;
+ limit = min(limit + total_swap_pages, memsw_limit);
}
-
return limit;
}

@@ -1746,7 +1748,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
- totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+ totalpages = mem_cgroup_get_limit(memcg) ? : 1;
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -1949,7 +1951,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
.priority = 0,
};

- excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+ excess = soft_limit_excess(root_memcg);

while (1) {
victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1980,7 +1982,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
zone, &nr_scanned);
*total_scanned += nr_scanned;
- if (!res_counter_soft_limit_excess(&root_memcg->res))
+ if (!soft_limit_excess(root_memcg))
break;
}
mem_cgroup_iter_break(root_memcg, victim);
@@ -2307,33 +2309,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
struct memcg_stock_pcp *stock;
- bool ret = true;
+ bool ret = false;

if (nr_pages > CHARGE_BATCH)
- return false;
+ return ret;

stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
stock->nr_pages -= nr_pages;
- else /* need to call res_counter_charge */
- ret = false;
+ ret = true;
+ }
put_cpu_var(memcg_stock);
return ret;
}

/*
- * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Returns stocks cached in percpu and reset cached information.
*/
static void drain_stock(struct memcg_stock_pcp *stock)
{
struct mem_cgroup *old = stock->cached;

if (stock->nr_pages) {
- unsigned long bytes = stock->nr_pages * PAGE_SIZE;
-
- res_counter_uncharge(&old->res, bytes);
+ page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_swap_account)
- res_counter_uncharge(&old->memsw, bytes);
+ page_counter_uncharge(&old->memsw, stock->nr_pages);
stock->nr_pages = 0;
}
stock->cached = NULL;
@@ -2362,7 +2362,7 @@ static void __init memcg_stock_init(void)
}

/*
- * Cache charges(val) which is from res_counter, to local per_cpu area.
+ * Cache charges(val) to local per_cpu area.
* This will be consumed by consume_stock() function, later.
*/
static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2422,8 +2422,7 @@ out:
/*
* Tries to drain stocked charges in other cpus. This function is asynchronous
* and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
+ * expects some charges will be back later but cannot wait for it.
*/
static void drain_all_stock_async(struct mem_cgroup *root_memcg)
{
@@ -2497,9 +2496,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int batch = max(CHARGE_BATCH, nr_pages);
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
- struct res_counter *fail_res;
+ struct page_counter *counter;
unsigned long nr_reclaimed;
- unsigned long long size;
bool may_swap = true;
bool drained = false;
int ret = 0;
@@ -2510,16 +2508,15 @@ retry:
if (consume_stock(memcg, nr_pages))
goto done;

- size = batch * PAGE_SIZE;
if (!do_swap_account ||
- !res_counter_charge(&memcg->memsw, size, &fail_res)) {
- if (!res_counter_charge(&memcg->res, size, &fail_res))
+ !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
+ if (!page_counter_try_charge(&memcg->memory, batch, &counter))
goto done_restock;
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, size);
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ page_counter_uncharge(&memcg->memsw, batch);
+ mem_over_limit = mem_cgroup_from_counter(counter, memory);
} else {
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+ mem_over_limit = mem_cgroup_from_counter(counter, memsw);
may_swap = false;
}

@@ -2602,32 +2599,12 @@ done:

static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
{
- unsigned long bytes = nr_pages * PAGE_SIZE;
-
if (mem_cgroup_is_root(memcg))
return;

- res_counter_uncharge(&memcg->res, bytes);
+ page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, bytes);
-}
-
-/*
- * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
- * This is useful when moving usage to parent cgroup.
- */
-static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
- unsigned int nr_pages)
-{
- unsigned long bytes = nr_pages * PAGE_SIZE;
-
- if (mem_cgroup_is_root(memcg))
- return;
-
- res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
- if (do_swap_account)
- res_counter_uncharge_until(&memcg->memsw,
- memcg->memsw.parent, bytes);
+ page_counter_uncharge(&memcg->memsw, nr_pages);
}

/*
@@ -2751,8 +2728,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
unlock_page_lru(page, isolated);
}

-static DEFINE_MUTEX(set_limit_mutex);
-
#ifdef CONFIG_MEMCG_KMEM
/*
* The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
@@ -2795,16 +2770,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
}
#endif

-static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
+ unsigned long nr_pages)
{
- struct res_counter *fail_res;
+ struct page_counter *counter;
int ret = 0;

- ret = res_counter_charge(&memcg->kmem, size, &fail_res);
- if (ret)
+ ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
+ if (ret < 0)
return ret;

- ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
+ ret = try_charge(memcg, gfp, nr_pages);
if (ret == -EINTR) {
/*
* try_charge() chose to bypass to root due to OOM kill or
@@ -2821,25 +2797,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
* when the allocation triggers should have been already
* directed to the root cgroup in memcontrol.h
*/
- res_counter_charge_nofail(&memcg->res, size, &fail_res);
+ page_counter_charge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_charge_nofail(&memcg->memsw, size,
- &fail_res);
+ page_counter_charge(&memcg->memsw, nr_pages);
ret = 0;
} else if (ret)
- res_counter_uncharge(&memcg->kmem, size);
+ page_counter_uncharge(&memcg->kmem, nr_pages);

return ret;
}

-static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
+ unsigned long nr_pages)
{
- res_counter_uncharge(&memcg->res, size);
+ page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, size);
+ page_counter_uncharge(&memcg->memsw, nr_pages);

/* Not down to 0 */
- if (res_counter_uncharge(&memcg->kmem, size))
+ if (page_counter_uncharge(&memcg->kmem, nr_pages))
return;

/*
@@ -3115,19 +3091,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,

int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
{
+ unsigned int nr_pages = 1 << order;
int res;

- res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
- PAGE_SIZE << order);
+ res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
if (!res)
- atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+ atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
return res;
}

void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
{
- memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
- atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+ unsigned int nr_pages = 1 << order;
+
+ memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
+ atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
}

/*
@@ -3248,7 +3226,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
return true;
}

- ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+ ret = memcg_charge_kmem(memcg, gfp, 1 << order);
if (!ret)
*_memcg = memcg;

@@ -3265,7 +3243,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,

/* The page allocation failed. Revert */
if (!page) {
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(memcg, 1 << order);
return;
}
/*
@@ -3298,7 +3276,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
return;

VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(memcg, 1 << order);
}
#else
static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -3476,8 +3454,12 @@ static int mem_cgroup_move_parent(struct page *page,

ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
- if (!ret)
- __mem_cgroup_cancel_local_charge(child, nr_pages);
+ if (!ret) {
+ /* Take charge off the local counters */
+ page_counter_cancel(&child->memory, nr_pages);
+ if (do_swap_account)
+ page_counter_cancel(&child->memsw, nr_pages);
+ }

if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
@@ -3507,7 +3489,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
*
* Returns 0 on success, -EINVAL on failure.
*
- * The caller must have charged to @to, IOW, called res_counter_charge() about
+ * The caller must have charged to @to, IOW, called page_counter_charge() about
* both res and memsw, and called css_get().
*/
static int mem_cgroup_move_swap_account(swp_entry_t entry,
@@ -3523,7 +3505,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
mem_cgroup_swap_statistics(to, true);
/*
* This function is only called from task migration context now.
- * It postpones res_counter and refcount handling till the end
+ * It postpones page_counter and refcount handling till the end
* of task migration(mem_cgroup_clear_mc()) for performance
* improvement. But we cannot postpone css_get(to) because if
* the process that has been moved to @to does swap-in, the
@@ -3581,60 +3563,57 @@ void mem_cgroup_print_bad_page(struct page *page)
}
#endif

+static DEFINE_MUTEX(memcg_limit_mutex);
+
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
+ unsigned long curusage;
+ unsigned long oldusage;
+ bool enlarge = false;
int retry_count;
- int ret = 0;
- int children = mem_cgroup_count_children(memcg);
- u64 curusage, oldusage;
- int enlarge;
+ int ret;

/*
* For keeping hierarchical_reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
* of # of children which we should visit in this loop.
*/
- retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+ retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+ mem_cgroup_count_children(memcg);

- oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+ oldusage = page_counter_read(&memcg->memory);

- enlarge = 0;
- while (retry_count) {
+ do {
if (signal_pending(current)) {
ret = -EINTR;
break;
}
- /*
- * Rather than hide all in some function, I do this in
- * open coded manner. You see what this really does.
- * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
- */
- mutex_lock(&set_limit_mutex);
- if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
+
+ mutex_lock(&memcg_limit_mutex);
+ if (limit > memcg->memsw.limit) {
+ mutex_unlock(&memcg_limit_mutex);
ret = -EINVAL;
- mutex_unlock(&set_limit_mutex);
break;
}
-
- if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
- enlarge = 1;
-
- ret = res_counter_set_limit(&memcg->res, val);
- mutex_unlock(&set_limit_mutex);
+ if (limit > memcg->memory.limit)
+ enlarge = true;
+ ret = page_counter_limit(&memcg->memory, limit);
+ mutex_unlock(&memcg_limit_mutex);

if (!ret)
break;

try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);

- curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+ curusage = page_counter_read(&memcg->memory);
/* Usage is reduced ? */
if (curusage >= oldusage)
retry_count--;
else
oldusage = curusage;
- }
+ } while (retry_count);
+
if (!ret && enlarge)
memcg_oom_recover(memcg);

@@ -3642,52 +3621,53 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
}

static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
+ unsigned long curusage;
+ unsigned long oldusage;
+ bool enlarge = false;
int retry_count;
- u64 oldusage, curusage;
- int children = mem_cgroup_count_children(memcg);
- int ret = -EBUSY;
- int enlarge = 0;
+ int ret;

/* see mem_cgroup_resize_res_limit */
- retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
- oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
- while (retry_count) {
+ retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+ mem_cgroup_count_children(memcg);
+
+ oldusage = page_counter_read(&memcg->memsw);
+
+ do {
if (signal_pending(current)) {
ret = -EINTR;
break;
}
- /*
- * Rather than hide all in some function, I do this in
- * open coded manner. You see what this really does.
- * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
- */
- mutex_lock(&set_limit_mutex);
- if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
+
+ mutex_lock(&memcg_limit_mutex);
+ if (limit < memcg->memory.limit) {
+ mutex_unlock(&memcg_limit_mutex);
ret = -EINVAL;
- mutex_unlock(&set_limit_mutex);
break;
}
- if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
- enlarge = 1;
- ret = res_counter_set_limit(&memcg->memsw, val);
- mutex_unlock(&set_limit_mutex);
+ if (limit > memcg->memsw.limit)
+ enlarge = true;
+ ret = page_counter_limit(&memcg->memsw, limit);
+ mutex_unlock(&memcg_limit_mutex);

if (!ret)
break;

try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);

- curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ curusage = page_counter_read(&memcg->memsw);
/* Usage is reduced ? */
if (curusage >= oldusage)
retry_count--;
else
oldusage = curusage;
- }
+ } while (retry_count);
+
if (!ret && enlarge)
memcg_oom_recover(memcg);
+
return ret;
}

@@ -3700,7 +3680,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
unsigned long reclaimed;
int loop = 0;
struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
+ unsigned long excess;
unsigned long nr_scanned;

if (order > 0)
@@ -3754,7 +3734,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
} while (1);
}
__mem_cgroup_remove_exceeded(mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->memcg->res);
+ excess = soft_limit_excess(mz->memcg);
/*
* One school of thought says that we should not add
* back the node to the tree if reclaim returns 0.
@@ -3847,7 +3827,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
int node, zid;
- u64 usage;

do {
/* This is for making all *used* pages to be on LRU. */
@@ -3879,9 +3858,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
* right after the check. RES_USAGE should be safe as we always
* charge before adding to the LRU.
*/
- usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
- res_counter_read_u64(&memcg->kmem, RES_USAGE);
- } while (usage > 0);
+ } while (page_counter_read(&memcg->memory) -
+ page_counter_read(&memcg->kmem) > 0);
}

/*
@@ -3921,7 +3899,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
/* try to free all pages in this cgroup */
- while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
+ while (nr_retries && page_counter_read(&memcg->memory)) {
int progress;

if (signal_pending(current))
@@ -3992,8 +3970,8 @@ out:
return retval;
}

-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
- enum mem_cgroup_stat_index idx)
+static unsigned long tree_stat(struct mem_cgroup *memcg,
+ enum mem_cgroup_stat_index idx)
{
struct mem_cgroup *iter;
long val = 0;
@@ -4011,55 +3989,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
{
u64 val;

- if (!mem_cgroup_is_root(memcg)) {
+ if (mem_cgroup_is_root(memcg)) {
+ val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
+ val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
+ if (swap)
+ val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
+ } else {
if (!swap)
- return res_counter_read_u64(&memcg->res, RES_USAGE);
+ val = page_counter_read(&memcg->memory);
else
- return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ val = page_counter_read(&memcg->memsw);
}
-
- /*
- * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
- * as well as in MEM_CGROUP_STAT_RSS_HUGE.
- */
- val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
- val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
- if (swap)
- val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
return val << PAGE_SHIFT;
}

+enum {
+ RES_USAGE,
+ RES_LIMIT,
+ RES_MAX_USAGE,
+ RES_FAILCNT,
+ RES_SOFT_LIMIT,
+};

static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- enum res_type type = MEMFILE_TYPE(cft->private);
- int name = MEMFILE_ATTR(cft->private);
+ struct page_counter *counter;

- switch (type) {
+ switch (MEMFILE_TYPE(cft->private)) {
case _MEM:
- if (name == RES_USAGE)
- return mem_cgroup_usage(memcg, false);
- return res_counter_read_u64(&memcg->res, name);
+ counter = &memcg->memory;
+ break;
case _MEMSWAP:
- if (name == RES_USAGE)
- return mem_cgroup_usage(memcg, true);
- return res_counter_read_u64(&memcg->memsw, name);
+ counter = &memcg->memsw;
+ break;
case _KMEM:
- return res_counter_read_u64(&memcg->kmem, name);
+ counter = &memcg->kmem;
break;
default:
BUG();
}
+
+ switch (MEMFILE_ATTR(cft->private)) {
+ case RES_USAGE:
+ if (counter == &memcg->memory)
+ return mem_cgroup_usage(memcg, false);
+ if (counter == &memcg->memsw)
+ return mem_cgroup_usage(memcg, true);
+ return (u64)page_counter_read(counter) * PAGE_SIZE;
+ case RES_LIMIT:
+ return (u64)counter->limit * PAGE_SIZE;
+ case RES_MAX_USAGE:
+ return (u64)counter->watermark * PAGE_SIZE;
+ case RES_FAILCNT:
+ return counter->failcnt;
+ case RES_SOFT_LIMIT:
+ return (u64)memcg->soft_limit * PAGE_SIZE;
+ default:
+ BUG();
+ }
}

#ifdef CONFIG_MEMCG_KMEM
/* should be called with activate_kmem_mutex held */
static int __memcg_activate_kmem(struct mem_cgroup *memcg,
- unsigned long long limit)
+ unsigned long nr_pages)
{
int err = 0;
int memcg_id;
@@ -4106,7 +4101,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
* We couldn't have accounted to this cgroup, because it hasn't got the
* active bit set yet, so this should succeed.
*/
- err = res_counter_set_limit(&memcg->kmem, limit);
+ err = page_counter_limit(&memcg->kmem, nr_pages);
VM_BUG_ON(err);

static_key_slow_inc(&memcg_kmem_enabled_key);
@@ -4122,25 +4117,27 @@ out:
}

static int memcg_activate_kmem(struct mem_cgroup *memcg,
- unsigned long long limit)
+ unsigned long nr_pages)
{
int ret;

mutex_lock(&activate_kmem_mutex);
- ret = __memcg_activate_kmem(memcg, limit);
+ ret = __memcg_activate_kmem(memcg, nr_pages);
mutex_unlock(&activate_kmem_mutex);
return ret;
}

static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
int ret;

+ mutex_lock(&memcg_limit_mutex);
if (!memcg_kmem_is_active(memcg))
- ret = memcg_activate_kmem(memcg, val);
+ ret = memcg_activate_kmem(memcg, limit);
else
- ret = res_counter_set_limit(&memcg->kmem, val);
+ ret = page_counter_limit(&memcg->kmem, limit);
+ mutex_unlock(&memcg_limit_mutex);
return ret;
}

@@ -4158,13 +4155,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
* after this point, because it has at least one child already.
*/
if (memcg_kmem_is_active(parent))
- ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+ ret = __memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
mutex_unlock(&activate_kmem_mutex);
return ret;
}
#else
static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
return -EINVAL;
}
@@ -4178,110 +4175,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- enum res_type type;
- int name;
- unsigned long long val;
+ unsigned long nr_pages;
int ret;

buf = strstrip(buf);
- type = MEMFILE_TYPE(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);
+ ret = page_counter_memparse(buf, &nr_pages);
+ if (ret)
+ return ret;

- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_LIMIT:
if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
ret = -EINVAL;
break;
}
- /* This function does all necessary parse...reuse it */
- ret = res_counter_memparse_write_strategy(buf, &val);
- if (ret)
+ switch (MEMFILE_TYPE(of_cft(of)->private)) {
+ case _MEM:
+ ret = mem_cgroup_resize_limit(memcg, nr_pages);
break;
- if (type == _MEM)
- ret = mem_cgroup_resize_limit(memcg, val);
- else if (type == _MEMSWAP)
- ret = mem_cgroup_resize_memsw_limit(memcg, val);
- else if (type == _KMEM)
- ret = memcg_update_kmem_limit(memcg, val);
- else
- return -EINVAL;
- break;
- case RES_SOFT_LIMIT:
- ret = res_counter_memparse_write_strategy(buf, &val);
- if (ret)
+ case _MEMSWAP:
+ ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
break;
- /*
- * For memsw, soft limits are hard to implement in terms
- * of semantics, for now, we support soft limits for
- * control without swap
- */
- if (type == _MEM)
- ret = res_counter_set_soft_limit(&memcg->res, val);
- else
- ret = -EINVAL;
+ case _KMEM:
+ ret = memcg_update_kmem_limit(memcg, nr_pages);
+ break;
+ }
break;
- default:
- ret = -EINVAL; /* should be BUG() ? */
+ case RES_SOFT_LIMIT:
+ memcg->soft_limit = nr_pages;
+ ret = 0;
break;
}
return ret ?: nbytes;
}

-static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
- unsigned long long *mem_limit, unsigned long long *memsw_limit)
-{
- unsigned long long min_limit, min_memsw_limit, tmp;
-
- min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (!memcg->use_hierarchy)
- goto out;
-
- while (memcg->css.parent) {
- memcg = mem_cgroup_from_css(memcg->css.parent);
- if (!memcg->use_hierarchy)
- break;
- tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
- min_limit = min(min_limit, tmp);
- tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- min_memsw_limit = min(min_memsw_limit, tmp);
- }
-out:
- *mem_limit = min_limit;
- *memsw_limit = min_memsw_limit;
-}
-
static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- int name;
- enum res_type type;
+ struct page_counter *counter;

- type = MEMFILE_TYPE(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);
+ switch (MEMFILE_TYPE(of_cft(of)->private)) {
+ case _MEM:
+ counter = &memcg->memory;
+ break;
+ case _MEMSWAP:
+ counter = &memcg->memsw;
+ break;
+ case _KMEM:
+ counter = &memcg->kmem;
+ break;
+ default:
+ BUG();
+ }

- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_MAX_USAGE:
- if (type == _MEM)
- res_counter_reset_max(&memcg->res);
- else if (type == _MEMSWAP)
- res_counter_reset_max(&memcg->memsw);
- else if (type == _KMEM)
- res_counter_reset_max(&memcg->kmem);
- else
- return -EINVAL;
+ page_counter_reset_watermark(counter);
break;
case RES_FAILCNT:
- if (type == _MEM)
- res_counter_reset_failcnt(&memcg->res);
- else if (type == _MEMSWAP)
- res_counter_reset_failcnt(&memcg->memsw);
- else if (type == _KMEM)
- res_counter_reset_failcnt(&memcg->kmem);
- else
- return -EINVAL;
+ counter->failcnt = 0;
break;
+ default:
+ BUG();
}

return nbytes;
@@ -4378,6 +4334,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
static int memcg_stat_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long memory, memsw;
struct mem_cgroup *mi;
unsigned int i;

@@ -4397,14 +4354,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);

/* Hierarchical information */
- {
- unsigned long long limit, memsw_limit;
- memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
- seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
- if (do_swap_account)
- seq_printf(m, "hierarchical_memsw_limit %llu\n",
- memsw_limit);
+ memory = memsw = PAGE_COUNTER_MAX;
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
+ memory = min(memory, mi->memory.limit);
+ memsw = min(memsw, mi->memsw.limit);
}
+ seq_printf(m, "hierarchical_memory_limit %llu\n",
+ (u64)memory * PAGE_SIZE);
+ if (do_swap_account)
+ seq_printf(m, "hierarchical_memsw_limit %llu\n",
+ (u64)memsw * PAGE_SIZE);

for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
long long val = 0;
@@ -4488,7 +4447,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
struct mem_cgroup_threshold_ary *t;
- u64 usage;
+ unsigned long usage;
int i;

rcu_read_lock();
@@ -4587,10 +4546,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
{
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- u64 threshold, usage;
+ unsigned long threshold;
+ unsigned long usage;
int i, size, ret;

- ret = res_counter_memparse_write_strategy(args, &threshold);
+ ret = page_counter_memparse(args, &threshold);
if (ret)
return ret;

@@ -4680,7 +4640,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
{
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- u64 usage;
+ unsigned long usage;
int i, j, size;

mutex_lock(&memcg->thresholds_lock);
@@ -4874,7 +4834,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)

memcg_kmem_mark_dead(memcg);

- if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+ if (page_counter_read(&memcg->kmem))
return;

if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5354,9 +5314,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
*/
struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
- if (!memcg->res.parent)
+ if (!memcg->memory.parent)
return NULL;
- return mem_cgroup_from_res_counter(memcg->res.parent, res);
+ return mem_cgroup_from_counter(memcg->memory.parent, memory);
}
EXPORT_SYMBOL(parent_mem_cgroup);

@@ -5401,9 +5361,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
- res_counter_init(&memcg->res, NULL);
- res_counter_init(&memcg->memsw, NULL);
- res_counter_init(&memcg->kmem, NULL);
+ page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->memsw, NULL);
+ page_counter_init(&memcg->kmem, NULL);
}

memcg->last_scanned_node = MAX_NUMNODES;
@@ -5442,18 +5402,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
memcg->swappiness = mem_cgroup_swappiness(parent);

if (parent->use_hierarchy) {
- res_counter_init(&memcg->res, &parent->res);
- res_counter_init(&memcg->memsw, &parent->memsw);
- res_counter_init(&memcg->kmem, &parent->kmem);
+ page_counter_init(&memcg->memory, &parent->memory);
+ page_counter_init(&memcg->memsw, &parent->memsw);
+ page_counter_init(&memcg->kmem, &parent->kmem);

/*
* No need to take a reference to the parent because cgroup
* core guarantees its existence.
*/
} else {
- res_counter_init(&memcg->res, NULL);
- res_counter_init(&memcg->memsw, NULL);
- res_counter_init(&memcg->kmem, NULL);
+ page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->memsw, NULL);
+ page_counter_init(&memcg->kmem, NULL);
/*
* Deeper hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
@@ -5535,7 +5495,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
/*
* XXX: css_offline() would be where we should reparent all
* memory to prepare the cgroup for destruction. However,
- * memcg does not do css_tryget_online() and res_counter charging
+ * memcg does not do css_tryget_online() and page_counter charging
* under the same RCU lock region, which means that charging
* could race with offlining. Offlining only happens to
* cgroups with no tasks in them but charges can show up
@@ -5555,7 +5515,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
* call_rcu()
* offline_css()
* reparent_charges()
- * res_counter_charge()
+ * page_counter_try_charge()
* css_put()
* css_free()
* pc->mem_cgroup = dead memcg
@@ -5590,10 +5550,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

- mem_cgroup_resize_limit(memcg, ULLONG_MAX);
- mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
- memcg_update_kmem_limit(memcg, ULLONG_MAX);
- res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+ mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
+ mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
+ memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+ memcg->soft_limit = 0;
}

#ifdef CONFIG_MMU
@@ -5907,19 +5867,18 @@ static void __mem_cgroup_clear_mc(void)
if (mc.moved_swap) {
/* uncharge swap account from the old cgroup */
if (!mem_cgroup_is_root(mc.from))
- res_counter_uncharge(&mc.from->memsw,
- PAGE_SIZE * mc.moved_swap);
-
- for (i = 0; i < mc.moved_swap; i++)
- css_put(&mc.from->css);
+ page_counter_uncharge(&mc.from->memsw, mc.moved_swap);

/*
- * we charged both to->res and to->memsw, so we should
- * uncharge to->res.
+ * we charged both to->memory and to->memsw, so we
+ * should uncharge to->memory.
*/
if (!mem_cgroup_is_root(mc.to))
- res_counter_uncharge(&mc.to->res,
- PAGE_SIZE * mc.moved_swap);
+ page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+
+ for (i = 0; i < mc.moved_swap; i++)
+ css_put(&mc.from->css);
+
/* we've already done css_get(mc.to) */
mc.moved_swap = 0;
}
@@ -6285,7 +6244,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
memcg = mem_cgroup_lookup(id);
if (memcg) {
if (!mem_cgroup_is_root(memcg))
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ page_counter_uncharge(&memcg->memsw, 1);
mem_cgroup_swap_statistics(memcg, false);
css_put(&memcg->css);
}
@@ -6451,11 +6410,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,

if (!mem_cgroup_is_root(memcg)) {
if (nr_mem)
- res_counter_uncharge(&memcg->res,
- nr_mem * PAGE_SIZE);
+ page_counter_uncharge(&memcg->memory, nr_mem);
if (nr_memsw)
- res_counter_uncharge(&memcg->memsw,
- nr_memsw * PAGE_SIZE);
+ page_counter_uncharge(&memcg->memsw, nr_memsw);
memcg_oom_recover(memcg);
}

diff --git a/mm/page_counter.c b/mm/page_counter.c
new file mode 100644
index 000000000000..fc4990c6bb5b
--- /dev/null
+++ b/mm/page_counter.c
@@ -0,0 +1,203 @@
+/*
+ * Lockless hierarchical page accounting & limiting
+ *
+ * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/page_counter.h>
+#include <linux/atomic.h>
+#include <linux/sched.h>
+
+/**
+ * page_counter_cancel - take pages out of the local counter
+ * @counter: counter
+ * @nr_pages: number of pages to cancel
+ *
+ * Returns whether there are remaining pages in the counter.
+ */
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+{
+ long new;
+
+ new = atomic_long_sub_return(nr_pages, &counter->count);
+
+ /* More uncharges than charges? */
+ WARN_ON_ONCE(new < 0);
+
+ return new > 0;
+}
+
+/**
+ * page_counter_charge - hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ *
+ * NOTE: This does not consider any configured counter limits.
+ */
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ for (c = counter; c; c = c->parent) {
+ long new;
+
+ new = atomic_long_add_return(nr_pages, &c->count);
+ /*
+ * This is indeed racy, but we can live with some
+ * inaccuracy in the watermark.
+ */
+ if (new > c->watermark)
+ c->watermark = new;
+ }
+}
+
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
+ * its ancestors has hit its configured limit.
+ */
+int page_counter_try_charge(struct page_counter *counter,
+ unsigned long nr_pages,
+ struct page_counter **fail)
+{
+ struct page_counter *c;
+
+ for (c = counter; c; c = c->parent) {
+ long new;
+ /*
+ * Charge speculatively to avoid an expensive CAS. If
+ * a bigger charge fails, it might falsely lock out a
+ * racing smaller charge and send it into reclaim
+ * early, but the error is limited to the difference
+ * between the two sizes, which is less than 2M/4M in
+ * case of a THP locking out a regular page charge.
+ *
+ * The atomic_long_add_return() implies a full memory
+ * barrier between incrementing the count and reading
+ * the limit. When racing with page_counter_limit(),
+ * we either see the new limit or the setter sees the
+ * counter has changed and retries.
+ */
+ new = atomic_long_add_return(nr_pages, &c->count);
+ if (new > c->limit) {
+ atomic_long_sub(nr_pages, &c->count);
+ /*
+ * This is racy, but we can live with some
+ * inaccuracy in the failcnt.
+ */
+ c->failcnt++;
+ *fail = c;
+ goto failed;
+ }
+ /*
+ * Just like with failcnt, we can live with some
+ * inaccuracy in the watermark.
+ */
+ if (new > c->watermark)
+ c->watermark = new;
+ }
+ return 0;
+
+failed:
+ for (c = counter; c != *fail; c = c->parent)
+ page_counter_cancel(c, nr_pages);
+
+ return -ENOMEM;
+}
+
+/**
+ * page_counter_uncharge - hierarchically uncharge pages
+ * @counter: counter
+ * @nr_pages: number of pages to uncharge
+ *
+ * Returns whether there are remaining charges in @counter.
+ */
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+ int ret = 1;
+
+ for (c = counter; c; c = c->parent) {
+ int remainder;
+
+ remainder = page_counter_cancel(c, nr_pages);
+ if (c == counter && !remainder)
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/**
+ * page_counter_limit - limit the number of pages allowed
+ * @counter: counter
+ * @limit: limit to set
+ *
+ * Returns 0 on success, -EBUSY if the current number of pages on the
+ * counter already exceeds the specified limit.
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+int page_counter_limit(struct page_counter *counter, unsigned long limit)
+{
+ for (;;) {
+ unsigned long old;
+ long count;
+
+ /*
+ * Update the limit while making sure that it's not
+ * below the concurrently-changing counter value.
+ *
+ * The xchg implies two full memory barriers before
+ * and after, so the read-swap-read is ordered and
+ * ensures coherency with page_counter_try_charge():
+ * that function modifies the count before checking
+ * the limit, so if it sees the old limit, we see the
+ * modified counter and retry.
+ */
+ count = atomic_long_read(&counter->count);
+
+ if (count > limit)
+ return -EBUSY;
+
+ old = xchg(&counter->limit, limit);
+
+ if (atomic_long_read(&counter->count) <= count)
+ return 0;
+
+ counter->limit = old;
+ cond_resched();
+ }
+}
+
+/**
+ * page_counter_memparse - memparse() for page counter limits
+ * @buf: string to parse
+ * @nr_pages: returns the result in number of pages
+ *
+ * Returns -EINVAL, or 0 and @nr_pages on success. @nr_pages will be
+ * limited to %PAGE_COUNTER_MAX.
+ */
+int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+{
+ char unlimited[] = "-1";
+ char *end;
+ u64 bytes;
+
+ if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+ *nr_pages = PAGE_COUNTER_MAX;
+ return 0;
+ }
+
+ bytes = memparse(buf, &end);
+ if (*end != '\0')
+ return -EINVAL;
+
+ *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
+
+ return 0;
+}
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1d191357bf88..272327134a1b 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -9,13 +9,13 @@
int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
/*
- * The root cgroup does not use res_counters, but rather,
+ * The root cgroup does not use page_counters, but rather,
* rely on the data already collected by the network
* subsystem
*/
- struct res_counter *res_parent = NULL;
- struct cg_proto *cg_proto, *parent_cg;
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ struct page_counter *counter_parent = NULL;
+ struct cg_proto *cg_proto, *parent_cg;

cg_proto = tcp_prot.proto_cgroup(memcg);
if (!cg_proto)
@@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)

parent_cg = tcp_prot.proto_cgroup(parent);
if (parent_cg)
- res_parent = &parent_cg->memory_allocated;
+ counter_parent = &parent_cg->memory_allocated;

- res_counter_init(&cg_proto->memory_allocated, res_parent);
+ page_counter_init(&cg_proto->memory_allocated, counter_parent);
percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);

return 0;
@@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(tcp_destroy_cgroup);

-static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
{
struct cg_proto *cg_proto;
int i;
@@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
if (!cg_proto)
return -EINVAL;

- if (val > RES_COUNTER_MAX)
- val = RES_COUNTER_MAX;
-
- ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
+ ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
if (ret)
return ret;

for (i = 0; i < 3; i++)
- cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
+ cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
sysctl_tcp_mem[i]);

- if (val == RES_COUNTER_MAX)
+ if (nr_pages == PAGE_COUNTER_MAX)
clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
- else if (val != RES_COUNTER_MAX) {
+ else {
/*
* The active bit needs to be written after the static_key
* update. This is what guarantees that the socket activation
@@ -102,11 +99,20 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
return 0;
}

+enum {
+ RES_USAGE,
+ RES_LIMIT,
+ RES_MAX_USAGE,
+ RES_FAILCNT,
+};
+
+static DEFINE_MUTEX(tcp_limit_mutex);
+
static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- unsigned long long val;
+ unsigned long nr_pages;
int ret = 0;

buf = strstrip(buf);
@@ -114,10 +120,12 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
switch (of_cft(of)->private) {
case RES_LIMIT:
/* see memcontrol.c */
- ret = res_counter_memparse_write_strategy(buf, &val);
+ ret = page_counter_memparse(buf, &nr_pages);
if (ret)
break;
- ret = tcp_update_limit(memcg, val);
+ mutex_lock(&tcp_limit_mutex);
+ ret = tcp_update_limit(memcg, nr_pages);
+ mutex_unlock(&tcp_limit_mutex);
break;
default:
ret = -EINVAL;
@@ -126,43 +134,36 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

-static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
-{
- struct cg_proto *cg_proto;
-
- cg_proto = tcp_prot.proto_cgroup(memcg);
- if (!cg_proto)
- return default_val;
-
- return res_counter_read_u64(&cg_proto->memory_allocated, type);
-}
-
-static u64 tcp_read_usage(struct mem_cgroup *memcg)
-{
- struct cg_proto *cg_proto;
-
- cg_proto = tcp_prot.proto_cgroup(memcg);
- if (!cg_proto)
- return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
-
- return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
-}
-
static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
u64 val;

switch (cft->private) {
case RES_LIMIT:
- val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
+ if (!cg_proto)
+ return PAGE_COUNTER_MAX;
+ val = cg_proto->memory_allocated.limit;
+ val *= PAGE_SIZE;
break;
case RES_USAGE:
- val = tcp_read_usage(memcg);
+ if (!cg_proto)
+ val = atomic_long_read(&tcp_memory_allocated);
+ else
+ val = page_counter_read(&cg_proto->memory_allocated);
+ val *= PAGE_SIZE;
break;
case RES_FAILCNT:
+ if (!cg_proto)
+ return 0;
+ val = cg_proto->memory_allocated.failcnt;
+ break;
case RES_MAX_USAGE:
- val = tcp_read_stat(memcg, cft->private, 0);
+ if (!cg_proto)
+ return 0;
+ val = cg_proto->memory_allocated.watermark;
+ val *= PAGE_SIZE;
break;
default:
BUG();
@@ -183,10 +184,10 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,

switch (of_cft(of)->private) {
case RES_MAX_USAGE:
- res_counter_reset_max(&cg_proto->memory_allocated);
+ page_counter_reset_watermark(&cg_proto->memory_allocated);
break;
case RES_FAILCNT:
- res_counter_reset_failcnt(&cg_proto->memory_allocated);
+ cg_proto->memory_allocated.failcnt = 0;
break;
}
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-14 15:56:47 UTC
Permalink
Post by Johannes Weiner
Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then
only happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the
per-cpu charge caches - on a 4-socket machine with 144-threads, the
following test shows the performance differences of 288 memcgs
I assume you had root.use_hierarchy = 1, right? Processes wouldn't bounce
on the same lock otherwise.
Post by Johannes Weiner
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
So this is ~30% improvement which is more than I expected. Nice!
It is true that this test case is never hitting any limit so this is
always a hot-path but once we hit the limit the res_counter is the last
thing we care about.
Post by Johannes Weiner
On top of improved scalability, this also gets rid of the icky long
long types in the very heart of memcg, which is great for 32 bit and
also makes the code a lot more readable.
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
Thanks!.
You have only missed MAINTAINERS...
Post by Johannes Weiner
---
Documentation/cgroups/memory.txt | 4 +-
include/linux/memcontrol.h | 5 +-
include/linux/page_counter.h | 49 +++
include/net/sock.h | 26 +-
init/Kconfig | 5 +-
mm/Makefile | 1 +
mm/memcontrol.c | 633 ++++++++++++++++++---------------------
mm/page_counter.c | 203 +++++++++++++
net/ipv4/tcp_memcontrol.c | 87 +++---
9 files changed, 609 insertions(+), 404 deletions(-)
create mode 100644 include/linux/page_counter.h
create mode 100644 mm/page_counter.c
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..f624727ab404 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,9 +52,9 @@ Brief summary of control files.
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
- memory.usage_in_bytes # show current res_counter usage for memory
+ memory.usage_in_bytes # show current usage for memory
(See 5.5 for details)
- memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap
+ memory.memsw.usage_in_bytes # show current usage for memory+Swap
(See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19df5d857411..0daf383f8f1c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -471,9 +471,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
/*
* __GFP_NOFAIL allocations will move on even if charging is not
* possible. Therefore we don't even try, and have this allocation
- * unaccounted. We could in theory charge it with
- * res_counter_charge_nofail, but we hope those allocations are rare,
- * and won't be worth the trouble.
+ * unaccounted. We could in theory charge it forcibly, but we hope
+ * those allocations are rare, and won't be worth the trouble.
*/
if (gfp & __GFP_NOFAIL)
return true;
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
new file mode 100644
index 000000000000..d92d18949474
--- /dev/null
+++ b/include/linux/page_counter.h
@@ -0,0 +1,49 @@
+#ifndef _LINUX_PAGE_COUNTER_H
+#define _LINUX_PAGE_COUNTER_H
+
+#include <linux/atomic.h>
+
+struct page_counter {
+ atomic_long_t count;
+ unsigned long limit;
+ struct page_counter *parent;
+
+ /* legacy */
+ unsigned long watermark;
+ unsigned long failcnt;
+};
+
+#if BITS_PER_LONG == 32
+#define PAGE_COUNTER_MAX LONG_MAX
+#else
+#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
+#endif
+
+static inline void page_counter_init(struct page_counter *counter,
+ struct page_counter *parent)
+{
+ atomic_long_set(&counter->count, 0);
+ counter->limit = PAGE_COUNTER_MAX;
+ counter->parent = parent;
+}
+
+static inline unsigned long page_counter_read(struct page_counter *counter)
+{
+ return atomic_long_read(&counter->count);
+}
+
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_try_charge(struct page_counter *counter,
+ unsigned long nr_pages,
+ struct page_counter **fail);
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_limit(struct page_counter *counter, unsigned long limit);
+int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+
+static inline void page_counter_reset_watermark(struct page_counter *counter)
+{
+ counter->watermark = page_counter_read(counter);
+}
+
+#endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 7db3db112baa..0fa8cf908b65 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,8 +54,8 @@
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/page_counter.h>
#include <linux/memcontrol.h>
-#include <linux/res_counter.h>
#include <linux/static_key.h>
#include <linux/aio.h>
#include <linux/sched.h>
@@ -1061,7 +1061,7 @@ enum cg_proto_flags {
};
struct cg_proto {
- struct res_counter memory_allocated; /* Current allocated memory. */
+ struct page_counter memory_allocated; /* Current allocated memory. */
struct percpu_counter sockets_allocated; /* Current number of sockets. */
int memory_pressure;
long sysctl_mem[3];
@@ -1213,34 +1213,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
unsigned long amt,
int *parent_status)
{
- struct res_counter *fail;
- int ret;
+ page_counter_charge(&prot->memory_allocated, amt);
- ret = res_counter_charge_nofail(&prot->memory_allocated,
- amt << PAGE_SHIFT, &fail);
- if (ret < 0)
+ if (page_counter_read(&prot->memory_allocated) >
+ prot->memory_allocated.limit)
*parent_status = OVER_LIMIT;
}
static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
unsigned long amt)
{
- res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
-}
-
-static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
-{
- u64 ret;
- ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
- return ret >> PAGE_SHIFT;
+ page_counter_uncharge(&prot->memory_allocated, amt);
}
static inline long
sk_memory_allocated(const struct sock *sk)
{
struct proto *prot = sk->sk_prot;
+
if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
- return memcg_memory_allocated_read(sk->sk_cgrp);
+ return page_counter_read(&sk->sk_cgrp->memory_allocated);
return atomic_long_read(prot->memory_allocated);
}
@@ -1254,7 +1246,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
/* update the root cgroup regardless */
atomic_long_add_return(amt, prot->memory_allocated);
- return memcg_memory_allocated_read(sk->sk_cgrp);
+ return page_counter_read(&sk->sk_cgrp->memory_allocated);
}
return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/init/Kconfig b/init/Kconfig
index b454b162d2df..3fc577e9de75 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -985,9 +985,12 @@ config RESOURCE_COUNTERS
This option enables controller independent resource accounting
infrastructure that works with cgroups.
+config PAGE_COUNTER
+ bool
+
config MEMCG
bool "Memory Resource Controller for Control Groups"
- depends on RESOURCE_COUNTERS
+ select PAGE_COUNTER
select EVENTFD
help
Provides a memory resource controller that manages both anonymous
diff --git a/mm/Makefile b/mm/Makefile
index ba3ec4e1a66e..27ddb80403a9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
+obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23976fd885fd..b62972c80055 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,7 +25,7 @@
* GNU General Public License for more details.
*/
-#include <linux/res_counter.h>
+#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
@@ -165,7 +165,7 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
+ unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
struct mem_cgroup *memcg; /* Back pointer, we cannot */
@@ -198,7 +198,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
- u64 threshold;
+ unsigned long threshold;
};
/* For threshold */
@@ -284,10 +284,13 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
*/
struct mem_cgroup {
struct cgroup_subsys_state css;
- /*
- * the counter to account for memory usage
- */
- struct res_counter res;
+
+ /* Accounted resources */
+ struct page_counter memory;
+ struct page_counter memsw;
+ struct page_counter kmem;
+
+ unsigned long soft_limit;
/* vmpressure notifications */
struct vmpressure vmpressure;
@@ -296,15 +299,6 @@ struct mem_cgroup {
int initialized;
/*
- * the counter to account for mem+swap usage.
- */
- struct res_counter memsw;
-
- /*
- * the counter to account for kernel memory usage.
- */
- struct res_counter kmem;
- /*
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
@@ -650,7 +644,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
* This check can't live in kmem destruction function,
* since the charges will outlive the cgroup
*/
- WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+ WARN_ON(page_counter_read(&memcg->kmem));
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -706,7 +700,7 @@ soft_limit_tree_from_page(struct page *page)
static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
+ unsigned long new_usage_in_excess)
{
struct rb_node **p = &mctz->rb_root.rb_node;
struct rb_node *parent = NULL;
@@ -755,10 +749,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
spin_unlock_irqrestore(&mctz->lock, flags);
}
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+ unsigned long nr_pages = page_counter_read(&memcg->memory);
+ unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
+ unsigned long excess = 0;
+
+ if (nr_pages > soft_limit)
+ excess = nr_pages - soft_limit;
+
+ return excess;
+}
static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
{
- unsigned long long excess;
+ unsigned long excess;
struct mem_cgroup_per_zone *mz;
struct mem_cgroup_tree_per_zone *mctz;
@@ -769,7 +774,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
*/
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
mz = mem_cgroup_page_zoneinfo(memcg, page);
- excess = res_counter_soft_limit_excess(&memcg->res);
+ excess = soft_limit_excess(memcg);
/*
* We have to update the tree if mz is on RB-tree or
* mem is over its softlimit.
* position in the tree.
*/
__mem_cgroup_remove_exceeded(mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+ if (!soft_limit_excess(mz->memcg) ||
!css_tryget_online(&mz->memcg->css))
goto retry;
@@ -1492,7 +1497,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
return inactive * inactive_ratio < active;
}
-#define mem_cgroup_from_res_counter(counter, member) \
+#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
/**
@@ -1504,12 +1509,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
*/
static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
{
- unsigned long long margin;
+ unsigned long margin = 0;
+ unsigned long count;
+ unsigned long limit;
- margin = res_counter_margin(&memcg->res);
- if (do_swap_account)
- margin = min(margin, res_counter_margin(&memcg->memsw));
- return margin >> PAGE_SHIFT;
+ count = page_counter_read(&memcg->memory);
+ limit = ACCESS_ONCE(memcg->memory.limit);
+ if (count < limit)
+ margin = limit - count;
+
+ if (do_swap_account) {
+ count = page_counter_read(&memcg->memsw);
+ limit = ACCESS_ONCE(memcg->memsw.limit);
+ if (count <= limit)
+ margin = min(margin, limit - count);
+ }
+
+ return margin;
}
int mem_cgroup_swappiness(struct mem_cgroup *memcg)
@@ -1650,18 +1666,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
rcu_read_unlock();
- pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->res, RES_FAILCNT));
- pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
- pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
- res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
- res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
- res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+ pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->memory)),
+ K((u64)memcg->memory.limit), memcg->memory.failcnt);
+ pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->memsw)),
+ K((u64)memcg->memsw.limit), memcg->memsw.failcnt);
+ pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
+ K((u64)page_counter_read(&memcg->kmem)),
+ K((u64)memcg->kmem.limit), memcg->kmem.failcnt);
for_each_mem_cgroup_tree(iter, memcg) {
pr_info("Memory cgroup stats for ");
@@ -1701,28 +1714,17 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
/*
* Return the memory (and swap, if configured) limit for a memcg.
*/
-static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
- u64 limit;
+ unsigned long limit;
- limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-
- /*
- * Do not consider swap space if we cannot swap due to swappiness
- */
+ limit = memcg->memory.limit;
if (mem_cgroup_swappiness(memcg)) {
- u64 memsw;
+ unsigned long memsw_limit;
- limit += total_swap_pages << PAGE_SHIFT;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-
- /*
- * If memsw is finite and limits the amount of swap space
- * available to this memcg, return that limit.
- */
- limit = min(limit, memsw);
+ memsw_limit = memcg->memsw.limit;
+ limit = min(limit + total_swap_pages, memsw_limit);
}
-
return limit;
}
@@ -1746,7 +1748,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
- totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+ totalpages = mem_cgroup_get_limit(memcg) ? : 1;
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -1949,7 +1951,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
.priority = 0,
};
- excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+ excess = soft_limit_excess(root_memcg);
while (1) {
victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1980,7 +1982,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
zone, &nr_scanned);
*total_scanned += nr_scanned;
- if (!res_counter_soft_limit_excess(&root_memcg->res))
+ if (!soft_limit_excess(root_memcg))
break;
}
mem_cgroup_iter_break(root_memcg, victim);
@@ -2307,33 +2309,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
struct memcg_stock_pcp *stock;
- bool ret = true;
+ bool ret = false;
if (nr_pages > CHARGE_BATCH)
- return false;
+ return ret;
stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
stock->nr_pages -= nr_pages;
- else /* need to call res_counter_charge */
- ret = false;
+ ret = true;
+ }
put_cpu_var(memcg_stock);
return ret;
}
/*
- * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Returns stocks cached in percpu and reset cached information.
*/
static void drain_stock(struct memcg_stock_pcp *stock)
{
struct mem_cgroup *old = stock->cached;
if (stock->nr_pages) {
- unsigned long bytes = stock->nr_pages * PAGE_SIZE;
-
- res_counter_uncharge(&old->res, bytes);
+ page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_swap_account)
- res_counter_uncharge(&old->memsw, bytes);
+ page_counter_uncharge(&old->memsw, stock->nr_pages);
stock->nr_pages = 0;
}
stock->cached = NULL;
@@ -2362,7 +2362,7 @@ static void __init memcg_stock_init(void)
}
/*
- * Cache charges(val) which is from res_counter, to local per_cpu area.
+ * Cache charges(val) to local per_cpu area.
* This will be consumed by consume_stock() function, later.
*/
static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
/*
* Tries to drain stocked charges in other cpus. This function is asynchronous
* and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
+ * expects some charges will be back later but cannot wait for it.
*/
static void drain_all_stock_async(struct mem_cgroup *root_memcg)
{
@@ -2497,9 +2496,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int batch = max(CHARGE_BATCH, nr_pages);
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
- struct res_counter *fail_res;
+ struct page_counter *counter;
unsigned long nr_reclaimed;
- unsigned long long size;
bool may_swap = true;
bool drained = false;
int ret = 0;
if (consume_stock(memcg, nr_pages))
goto done;
- size = batch * PAGE_SIZE;
if (!do_swap_account ||
- !res_counter_charge(&memcg->memsw, size, &fail_res)) {
- if (!res_counter_charge(&memcg->res, size, &fail_res))
+ !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
+ if (!page_counter_try_charge(&memcg->memory, batch, &counter))
goto done_restock;
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, size);
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ page_counter_uncharge(&memcg->memsw, batch);
+ mem_over_limit = mem_cgroup_from_counter(counter, memory);
} else {
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+ mem_over_limit = mem_cgroup_from_counter(counter, memsw);
may_swap = false;
}
static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
{
- unsigned long bytes = nr_pages * PAGE_SIZE;
-
if (mem_cgroup_is_root(memcg))
return;
- res_counter_uncharge(&memcg->res, bytes);
+ page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, bytes);
-}
-
-/*
- * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
- * This is useful when moving usage to parent cgroup.
- */
-static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
- unsigned int nr_pages)
-{
- unsigned long bytes = nr_pages * PAGE_SIZE;
-
- if (mem_cgroup_is_root(memcg))
- return;
-
- res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
- if (do_swap_account)
- res_counter_uncharge_until(&memcg->memsw,
- memcg->memsw.parent, bytes);
+ page_counter_uncharge(&memcg->memsw, nr_pages);
}
/*
@@ -2751,8 +2728,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
unlock_page_lru(page, isolated);
}
-static DEFINE_MUTEX(set_limit_mutex);
-
#ifdef CONFIG_MEMCG_KMEM
/*
* The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
@@ -2795,16 +2770,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
}
#endif
-static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
+ unsigned long nr_pages)
{
- struct res_counter *fail_res;
+ struct page_counter *counter;
int ret = 0;
- ret = res_counter_charge(&memcg->kmem, size, &fail_res);
- if (ret)
+ ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
+ if (ret < 0)
return ret;
- ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
+ ret = try_charge(memcg, gfp, nr_pages);
if (ret == -EINTR) {
/*
* try_charge() chose to bypass to root due to OOM kill or
@@ -2821,25 +2797,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
* when the allocation triggers should have been already
* directed to the root cgroup in memcontrol.h
*/
- res_counter_charge_nofail(&memcg->res, size, &fail_res);
+ page_counter_charge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_charge_nofail(&memcg->memsw, size,
- &fail_res);
+ page_counter_charge(&memcg->memsw, nr_pages);
ret = 0;
} else if (ret)
- res_counter_uncharge(&memcg->kmem, size);
+ page_counter_uncharge(&memcg->kmem, nr_pages);
return ret;
}
-static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
+ unsigned long nr_pages)
{
- res_counter_uncharge(&memcg->res, size);
+ page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
- res_counter_uncharge(&memcg->memsw, size);
+ page_counter_uncharge(&memcg->memsw, nr_pages);
/* Not down to 0 */
- if (res_counter_uncharge(&memcg->kmem, size))
+ if (page_counter_uncharge(&memcg->kmem, nr_pages))
return;
/*
@@ -3115,19 +3091,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
{
+ unsigned int nr_pages = 1 << order;
int res;
- res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
- PAGE_SIZE << order);
+ res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
if (!res)
- atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+ atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
return res;
}
void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
{
- memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
- atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+ unsigned int nr_pages = 1 << order;
+
+ memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
+ atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
}
/*
@@ -3248,7 +3226,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
return true;
}
- ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+ ret = memcg_charge_kmem(memcg, gfp, 1 << order);
if (!ret)
*_memcg = memcg;
@@ -3265,7 +3243,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
/* The page allocation failed. Revert */
if (!page) {
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(memcg, 1 << order);
return;
}
/*
@@ -3298,7 +3276,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
return;
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(memcg, 1 << order);
}
#else
static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -3476,8 +3454,12 @@ static int mem_cgroup_move_parent(struct page *page,
ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
- if (!ret)
- __mem_cgroup_cancel_local_charge(child, nr_pages);
+ if (!ret) {
+ /* Take charge off the local counters */
+ page_counter_cancel(&child->memory, nr_pages);
+ if (do_swap_account)
+ page_counter_cancel(&child->memsw, nr_pages);
+ }
if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
@@ -3507,7 +3489,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
*
* Returns 0 on success, -EINVAL on failure.
*
* both res and memsw, and called css_get().
*/
static int mem_cgroup_move_swap_account(swp_entry_t entry,
@@ -3523,7 +3505,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
mem_cgroup_swap_statistics(to, true);
/*
* This function is only called from task migration context now.
- * It postpones res_counter and refcount handling till the end
+ * It postpones page_counter and refcount handling till the end
* of task migration(mem_cgroup_clear_mc()) for performance
* improvement. But we cannot postpone css_get(to) because if
@@ -3581,60 +3563,57 @@ void mem_cgroup_print_bad_page(struct page *page)
}
#endif
+static DEFINE_MUTEX(memcg_limit_mutex);
+
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
+ unsigned long curusage;
+ unsigned long oldusage;
+ bool enlarge = false;
int retry_count;
- int ret = 0;
- int children = mem_cgroup_count_children(memcg);
- u64 curusage, oldusage;
- int enlarge;
+ int ret;
/*
* For keeping hierarchical_reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
* of # of children which we should visit in this loop.
*/
- retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+ retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+ mem_cgroup_count_children(memcg);
- oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+ oldusage = page_counter_read(&memcg->memory);
- enlarge = 0;
- while (retry_count) {
+ do {
if (signal_pending(current)) {
ret = -EINTR;
break;
}
- /*
- * Rather than hide all in some function, I do this in
- * open coded manner. You see what this really does.
- * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
- */
- mutex_lock(&set_limit_mutex);
- if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
+
+ mutex_lock(&memcg_limit_mutex);
+ if (limit > memcg->memsw.limit) {
+ mutex_unlock(&memcg_limit_mutex);
ret = -EINVAL;
- mutex_unlock(&set_limit_mutex);
break;
}
-
- if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
- enlarge = 1;
-
- ret = res_counter_set_limit(&memcg->res, val);
- mutex_unlock(&set_limit_mutex);
+ if (limit > memcg->memory.limit)
+ enlarge = true;
+ ret = page_counter_limit(&memcg->memory, limit);
+ mutex_unlock(&memcg_limit_mutex);
if (!ret)
break;
try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
- curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+ curusage = page_counter_read(&memcg->memory);
/* Usage is reduced ? */
if (curusage >= oldusage)
retry_count--;
else
oldusage = curusage;
- }
+ } while (retry_count);
+
if (!ret && enlarge)
memcg_oom_recover(memcg);
@@ -3642,52 +3621,53 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
}
static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
+ unsigned long curusage;
+ unsigned long oldusage;
+ bool enlarge = false;
int retry_count;
- u64 oldusage, curusage;
- int children = mem_cgroup_count_children(memcg);
- int ret = -EBUSY;
- int enlarge = 0;
+ int ret;
/* see mem_cgroup_resize_res_limit */
- retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
- oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
- while (retry_count) {
+ retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+ mem_cgroup_count_children(memcg);
+
+ oldusage = page_counter_read(&memcg->memsw);
+
+ do {
if (signal_pending(current)) {
ret = -EINTR;
break;
}
- /*
- * Rather than hide all in some function, I do this in
- * open coded manner. You see what this really does.
- * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
- */
- mutex_lock(&set_limit_mutex);
- if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
+
+ mutex_lock(&memcg_limit_mutex);
+ if (limit < memcg->memory.limit) {
+ mutex_unlock(&memcg_limit_mutex);
ret = -EINVAL;
- mutex_unlock(&set_limit_mutex);
break;
}
- if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
- enlarge = 1;
- ret = res_counter_set_limit(&memcg->memsw, val);
- mutex_unlock(&set_limit_mutex);
+ if (limit > memcg->memsw.limit)
+ enlarge = true;
+ ret = page_counter_limit(&memcg->memsw, limit);
+ mutex_unlock(&memcg_limit_mutex);
if (!ret)
break;
try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
- curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ curusage = page_counter_read(&memcg->memsw);
/* Usage is reduced ? */
if (curusage >= oldusage)
retry_count--;
else
oldusage = curusage;
- }
+ } while (retry_count);
+
if (!ret && enlarge)
memcg_oom_recover(memcg);
+
return ret;
}
@@ -3700,7 +3680,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
unsigned long reclaimed;
int loop = 0;
struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
+ unsigned long excess;
unsigned long nr_scanned;
if (order > 0)
@@ -3754,7 +3734,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
} while (1);
}
__mem_cgroup_remove_exceeded(mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->memcg->res);
+ excess = soft_limit_excess(mz->memcg);
/*
* One school of thought says that we should not add
* back the node to the tree if reclaim returns 0.
@@ -3847,7 +3827,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
int node, zid;
- u64 usage;
do {
/* This is for making all *used* pages to be on LRU. */
@@ -3879,9 +3858,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
* right after the check. RES_USAGE should be safe as we always
* charge before adding to the LRU.
*/
- usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
- res_counter_read_u64(&memcg->kmem, RES_USAGE);
- } while (usage > 0);
+ } while (page_counter_read(&memcg->memory) -
+ page_counter_read(&memcg->kmem) > 0);
}
/*
@@ -3921,7 +3899,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
/* try to free all pages in this cgroup */
- while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
+ while (nr_retries && page_counter_read(&memcg->memory)) {
int progress;
if (signal_pending(current))
return retval;
}
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
- enum mem_cgroup_stat_index idx)
+static unsigned long tree_stat(struct mem_cgroup *memcg,
+ enum mem_cgroup_stat_index idx)
{
struct mem_cgroup *iter;
long val = 0;
@@ -4011,55 +3989,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
{
u64 val;
- if (!mem_cgroup_is_root(memcg)) {
+ if (mem_cgroup_is_root(memcg)) {
+ val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
+ val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
+ if (swap)
+ val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
+ } else {
if (!swap)
- return res_counter_read_u64(&memcg->res, RES_USAGE);
+ val = page_counter_read(&memcg->memory);
else
- return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ val = page_counter_read(&memcg->memsw);
}
-
- /*
- * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
- * as well as in MEM_CGROUP_STAT_RSS_HUGE.
- */
- val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
- val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
- if (swap)
- val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
return val << PAGE_SHIFT;
}
+enum {
+ RES_USAGE,
+ RES_LIMIT,
+ RES_MAX_USAGE,
+ RES_FAILCNT,
+ RES_SOFT_LIMIT,
+};
static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- enum res_type type = MEMFILE_TYPE(cft->private);
- int name = MEMFILE_ATTR(cft->private);
+ struct page_counter *counter;
- switch (type) {
+ switch (MEMFILE_TYPE(cft->private)) {
- if (name == RES_USAGE)
- return mem_cgroup_usage(memcg, false);
- return res_counter_read_u64(&memcg->res, name);
+ counter = &memcg->memory;
+ break;
- if (name == RES_USAGE)
- return mem_cgroup_usage(memcg, true);
- return res_counter_read_u64(&memcg->memsw, name);
+ counter = &memcg->memsw;
+ break;
- return res_counter_read_u64(&memcg->kmem, name);
+ counter = &memcg->kmem;
break;
BUG();
}
+
+ switch (MEMFILE_ATTR(cft->private)) {
+ if (counter == &memcg->memory)
+ return mem_cgroup_usage(memcg, false);
+ if (counter == &memcg->memsw)
+ return mem_cgroup_usage(memcg, true);
+ return (u64)page_counter_read(counter) * PAGE_SIZE;
+ return (u64)counter->limit * PAGE_SIZE;
+ return (u64)counter->watermark * PAGE_SIZE;
+ return counter->failcnt;
+ return (u64)memcg->soft_limit * PAGE_SIZE;
+ BUG();
+ }
}
#ifdef CONFIG_MEMCG_KMEM
/* should be called with activate_kmem_mutex held */
static int __memcg_activate_kmem(struct mem_cgroup *memcg,
- unsigned long long limit)
+ unsigned long nr_pages)
{
int err = 0;
int memcg_id;
@@ -4106,7 +4101,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
* We couldn't have accounted to this cgroup, because it hasn't got the
* active bit set yet, so this should succeed.
*/
- err = res_counter_set_limit(&memcg->kmem, limit);
+ err = page_counter_limit(&memcg->kmem, nr_pages);
VM_BUG_ON(err);
static_key_slow_inc(&memcg_kmem_enabled_key);
}
static int memcg_activate_kmem(struct mem_cgroup *memcg,
- unsigned long long limit)
+ unsigned long nr_pages)
{
int ret;
mutex_lock(&activate_kmem_mutex);
- ret = __memcg_activate_kmem(memcg, limit);
+ ret = __memcg_activate_kmem(memcg, nr_pages);
mutex_unlock(&activate_kmem_mutex);
return ret;
}
static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
int ret;
+ mutex_lock(&memcg_limit_mutex);
if (!memcg_kmem_is_active(memcg))
- ret = memcg_activate_kmem(memcg, val);
+ ret = memcg_activate_kmem(memcg, limit);
else
- ret = res_counter_set_limit(&memcg->kmem, val);
+ ret = page_counter_limit(&memcg->kmem, limit);
+ mutex_unlock(&memcg_limit_mutex);
return ret;
}
@@ -4158,13 +4155,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
* after this point, because it has at least one child already.
*/
if (memcg_kmem_is_active(parent))
- ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+ ret = __memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
mutex_unlock(&activate_kmem_mutex);
return ret;
}
#else
static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
- unsigned long long val)
+ unsigned long limit)
{
return -EINVAL;
}
@@ -4178,110 +4175,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- enum res_type type;
- int name;
- unsigned long long val;
+ unsigned long nr_pages;
int ret;
buf = strstrip(buf);
- type = MEMFILE_TYPE(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);
+ ret = page_counter_memparse(buf, &nr_pages);
+ if (ret)
+ return ret;
- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
ret = -EINVAL;
break;
}
- /* This function does all necessary parse...reuse it */
- ret = res_counter_memparse_write_strategy(buf, &val);
- if (ret)
+ switch (MEMFILE_TYPE(of_cft(of)->private)) {
+ ret = mem_cgroup_resize_limit(memcg, nr_pages);
break;
- if (type == _MEM)
- ret = mem_cgroup_resize_limit(memcg, val);
- else if (type == _MEMSWAP)
- ret = mem_cgroup_resize_memsw_limit(memcg, val);
- else if (type == _KMEM)
- ret = memcg_update_kmem_limit(memcg, val);
- else
- return -EINVAL;
- break;
- ret = res_counter_memparse_write_strategy(buf, &val);
- if (ret)
+ ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
break;
- /*
- * For memsw, soft limits are hard to implement in terms
- * of semantics, for now, we support soft limits for
- * control without swap
- */
- if (type == _MEM)
- ret = res_counter_set_soft_limit(&memcg->res, val);
- else
- ret = -EINVAL;
+ ret = memcg_update_kmem_limit(memcg, nr_pages);
+ break;
+ }
break;
- ret = -EINVAL; /* should be BUG() ? */
+ memcg->soft_limit = nr_pages;
+ ret = 0;
break;
}
return ret ?: nbytes;
}
-static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
- unsigned long long *mem_limit, unsigned long long *memsw_limit)
-{
- unsigned long long min_limit, min_memsw_limit, tmp;
-
- min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (!memcg->use_hierarchy)
- goto out;
-
- while (memcg->css.parent) {
- memcg = mem_cgroup_from_css(memcg->css.parent);
- if (!memcg->use_hierarchy)
- break;
- tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
- min_limit = min(min_limit, tmp);
- tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- min_memsw_limit = min(min_memsw_limit, tmp);
- }
- *mem_limit = min_limit;
- *memsw_limit = min_memsw_limit;
-}
-
static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- int name;
- enum res_type type;
+ struct page_counter *counter;
- type = MEMFILE_TYPE(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);
+ switch (MEMFILE_TYPE(of_cft(of)->private)) {
+ counter = &memcg->memory;
+ break;
+ counter = &memcg->memsw;
+ break;
+ counter = &memcg->kmem;
+ break;
+ BUG();
+ }
- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
- if (type == _MEM)
- res_counter_reset_max(&memcg->res);
- else if (type == _MEMSWAP)
- res_counter_reset_max(&memcg->memsw);
- else if (type == _KMEM)
- res_counter_reset_max(&memcg->kmem);
- else
- return -EINVAL;
+ page_counter_reset_watermark(counter);
break;
- if (type == _MEM)
- res_counter_reset_failcnt(&memcg->res);
- else if (type == _MEMSWAP)
- res_counter_reset_failcnt(&memcg->memsw);
- else if (type == _KMEM)
- res_counter_reset_failcnt(&memcg->kmem);
- else
- return -EINVAL;
+ counter->failcnt = 0;
break;
+ BUG();
}
return nbytes;
@@ -4378,6 +4334,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
static int memcg_stat_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long memory, memsw;
struct mem_cgroup *mi;
unsigned int i;
@@ -4397,14 +4354,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
/* Hierarchical information */
- {
- unsigned long long limit, memsw_limit;
- memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
- seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
- if (do_swap_account)
- seq_printf(m, "hierarchical_memsw_limit %llu\n",
- memsw_limit);
+ memory = memsw = PAGE_COUNTER_MAX;
+ for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
+ memory = min(memory, mi->memory.limit);
+ memsw = min(memsw, mi->memsw.limit);
}
+ seq_printf(m, "hierarchical_memory_limit %llu\n",
+ (u64)memory * PAGE_SIZE);
+ if (do_swap_account)
+ seq_printf(m, "hierarchical_memsw_limit %llu\n",
+ (u64)memsw * PAGE_SIZE);
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
long long val = 0;
@@ -4488,7 +4447,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
struct mem_cgroup_threshold_ary *t;
- u64 usage;
+ unsigned long usage;
int i;
rcu_read_lock();
@@ -4587,10 +4546,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
{
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- u64 threshold, usage;
+ unsigned long threshold;
+ unsigned long usage;
int i, size, ret;
- ret = res_counter_memparse_write_strategy(args, &threshold);
+ ret = page_counter_memparse(args, &threshold);
if (ret)
return ret;
@@ -4680,7 +4640,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
{
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- u64 usage;
+ unsigned long usage;
int i, j, size;
mutex_lock(&memcg->thresholds_lock);
@@ -4874,7 +4834,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
memcg_kmem_mark_dead(memcg);
- if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+ if (page_counter_read(&memcg->kmem))
return;
if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5354,9 +5314,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
*/
struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
- if (!memcg->res.parent)
+ if (!memcg->memory.parent)
return NULL;
- return mem_cgroup_from_res_counter(memcg->res.parent, res);
+ return mem_cgroup_from_counter(memcg->memory.parent, memory);
}
EXPORT_SYMBOL(parent_mem_cgroup);
@@ -5401,9 +5361,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
- res_counter_init(&memcg->res, NULL);
- res_counter_init(&memcg->memsw, NULL);
- res_counter_init(&memcg->kmem, NULL);
+ page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->memsw, NULL);
+ page_counter_init(&memcg->kmem, NULL);
}
memcg->last_scanned_node = MAX_NUMNODES;
@@ -5442,18 +5402,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
memcg->swappiness = mem_cgroup_swappiness(parent);
if (parent->use_hierarchy) {
- res_counter_init(&memcg->res, &parent->res);
- res_counter_init(&memcg->memsw, &parent->memsw);
- res_counter_init(&memcg->kmem, &parent->kmem);
+ page_counter_init(&memcg->memory, &parent->memory);
+ page_counter_init(&memcg->memsw, &parent->memsw);
+ page_counter_init(&memcg->kmem, &parent->kmem);
/*
* No need to take a reference to the parent because cgroup
* core guarantees its existence.
*/
} else {
- res_counter_init(&memcg->res, NULL);
- res_counter_init(&memcg->memsw, NULL);
- res_counter_init(&memcg->kmem, NULL);
+ page_counter_init(&memcg->memory, NULL);
+ page_counter_init(&memcg->memsw, NULL);
+ page_counter_init(&memcg->kmem, NULL);
/*
* Deeper hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
@@ -5535,7 +5495,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
/*
* XXX: css_offline() would be where we should reparent all
* memory to prepare the cgroup for destruction. However,
- * memcg does not do css_tryget_online() and res_counter charging
+ * memcg does not do css_tryget_online() and page_counter charging
* under the same RCU lock region, which means that charging
* could race with offlining. Offlining only happens to
* cgroups with no tasks in them but charges can show up
@@ -5555,7 +5515,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
* call_rcu()
* offline_css()
* reparent_charges()
- * res_counter_charge()
+ * page_counter_try_charge()
* css_put()
* css_free()
* pc->mem_cgroup = dead memcg
@@ -5590,10 +5550,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- mem_cgroup_resize_limit(memcg, ULLONG_MAX);
- mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
- memcg_update_kmem_limit(memcg, ULLONG_MAX);
- res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+ mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
+ mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
+ memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+ memcg->soft_limit = 0;
}
#ifdef CONFIG_MMU
@@ -5907,19 +5867,18 @@ static void __mem_cgroup_clear_mc(void)
if (mc.moved_swap) {
/* uncharge swap account from the old cgroup */
if (!mem_cgroup_is_root(mc.from))
- res_counter_uncharge(&mc.from->memsw,
- PAGE_SIZE * mc.moved_swap);
-
- for (i = 0; i < mc.moved_swap; i++)
- css_put(&mc.from->css);
+ page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
/*
- * we charged both to->res and to->memsw, so we should
- * uncharge to->res.
+ * we charged both to->memory and to->memsw, so we
+ * should uncharge to->memory.
*/
if (!mem_cgroup_is_root(mc.to))
- res_counter_uncharge(&mc.to->res,
- PAGE_SIZE * mc.moved_swap);
+ page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+
+ for (i = 0; i < mc.moved_swap; i++)
+ css_put(&mc.from->css);
+
/* we've already done css_get(mc.to) */
mc.moved_swap = 0;
}
@@ -6285,7 +6244,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
memcg = mem_cgroup_lookup(id);
if (memcg) {
if (!mem_cgroup_is_root(memcg))
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ page_counter_uncharge(&memcg->memsw, 1);
mem_cgroup_swap_statistics(memcg, false);
css_put(&memcg->css);
}
@@ -6451,11 +6410,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
if (!mem_cgroup_is_root(memcg)) {
if (nr_mem)
- res_counter_uncharge(&memcg->res,
- nr_mem * PAGE_SIZE);
+ page_counter_uncharge(&memcg->memory, nr_mem);
if (nr_memsw)
- res_counter_uncharge(&memcg->memsw,
- nr_memsw * PAGE_SIZE);
+ page_counter_uncharge(&memcg->memsw, nr_memsw);
memcg_oom_recover(memcg);
}
diff --git a/mm/page_counter.c b/mm/page_counter.c
new file mode 100644
index 000000000000..fc4990c6bb5b
--- /dev/null
+++ b/mm/page_counter.c
@@ -0,0 +1,203 @@
+/*
+ * Lockless hierarchical page accounting & limiting
+ *
+ * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/page_counter.h>
+#include <linux/atomic.h>
+#include <linux/sched.h>
+
+/**
+ * page_counter_cancel - take pages out of the local counter
+ *
+ * Returns whether there are remaining pages in the counter.
+ */
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+{
+ long new;
+
+ new = atomic_long_sub_return(nr_pages, &counter->count);
+
+ /* More uncharges than charges? */
+ WARN_ON_ONCE(new < 0);
+
+ return new > 0;
+}
+
+/**
+ * page_counter_charge - hierarchically charge pages
+ *
+ * NOTE: This does not consider any configured counter limits.
+ */
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ for (c = counter; c; c = c->parent) {
+ long new;
+
+ new = atomic_long_add_return(nr_pages, &c->count);
+ /*
+ * This is indeed racy, but we can live with some
+ * inaccuracy in the watermark.
+ */
+ if (new > c->watermark)
+ c->watermark = new;
+ }
+}
+
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ *
+ * its ancestors has hit its configured limit.
+ */
+int page_counter_try_charge(struct page_counter *counter,
+ unsigned long nr_pages,
+ struct page_counter **fail)
+{
+ struct page_counter *c;
+
+ for (c = counter; c; c = c->parent) {
+ long new;
+ /*
+ * Charge speculatively to avoid an expensive CAS. If
+ * a bigger charge fails, it might falsely lock out a
+ * racing smaller charge and send it into reclaim
+ * early, but the error is limited to the difference
+ * between the two sizes, which is less than 2M/4M in
+ * case of a THP locking out a regular page charge.
+ *
+ * The atomic_long_add_return() implies a full memory
+ * barrier between incrementing the count and reading
+ * the limit. When racing with page_counter_limit(),
+ * we either see the new limit or the setter sees the
+ * counter has changed and retries.
+ */
+ new = atomic_long_add_return(nr_pages, &c->count);
+ if (new > c->limit) {
+ atomic_long_sub(nr_pages, &c->count);
+ /*
+ * This is racy, but we can live with some
+ * inaccuracy in the failcnt.
+ */
+ c->failcnt++;
+ *fail = c;
+ goto failed;
+ }
+ /*
+ * Just like with failcnt, we can live with some
+ * inaccuracy in the watermark.
+ */
+ if (new > c->watermark)
+ c->watermark = new;
+ }
+ return 0;
+
+ for (c = counter; c != *fail; c = c->parent)
+ page_counter_cancel(c, nr_pages);
+
+ return -ENOMEM;
+}
+
+/**
+ * page_counter_uncharge - hierarchically uncharge pages
+ *
+ */
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+ int ret = 1;
+
+ for (c = counter; c; c = c->parent) {
+ int remainder;
+
+ remainder = page_counter_cancel(c, nr_pages);
+ if (c == counter && !remainder)
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/**
+ * page_counter_limit - limit the number of pages allowed
+ *
+ * Returns 0 on success, -EBUSY if the current number of pages on the
+ * counter already exceeds the specified limit.
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+int page_counter_limit(struct page_counter *counter, unsigned long limit)
+{
+ for (;;) {
+ unsigned long old;
+ long count;
+
+ /*
+ * Update the limit while making sure that it's not
+ * below the concurrently-changing counter value.
+ *
+ * The xchg implies two full memory barriers before
+ * and after, so the read-swap-read is ordered and
+ * that function modifies the count before checking
+ * the limit, so if it sees the old limit, we see the
+ * modified counter and retry.
+ */
+ count = atomic_long_read(&counter->count);
+
+ if (count > limit)
+ return -EBUSY;
+
+ old = xchg(&counter->limit, limit);
+
+ if (atomic_long_read(&counter->count) <= count)
+ return 0;
+
+ counter->limit = old;
+ cond_resched();
+ }
+}
+
+/**
+ * page_counter_memparse - memparse() for page counter limits
+ *
+ * limited to %PAGE_COUNTER_MAX.
+ */
+int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+{
+ char unlimited[] = "-1";
+ char *end;
+ u64 bytes;
+
+ if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+ *nr_pages = PAGE_COUNTER_MAX;
+ return 0;
+ }
+
+ bytes = memparse(buf, &end);
+ if (*end != '\0')
+ return -EINVAL;
+
+ *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
+
+ return 0;
+}
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1d191357bf88..272327134a1b 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -9,13 +9,13 @@
int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
/*
- * The root cgroup does not use res_counters, but rather,
+ * The root cgroup does not use page_counters, but rather,
* rely on the data already collected by the network
* subsystem
*/
- struct res_counter *res_parent = NULL;
- struct cg_proto *cg_proto, *parent_cg;
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ struct page_counter *counter_parent = NULL;
+ struct cg_proto *cg_proto, *parent_cg;
cg_proto = tcp_prot.proto_cgroup(memcg);
if (!cg_proto)
@@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
parent_cg = tcp_prot.proto_cgroup(parent);
if (parent_cg)
- res_parent = &parent_cg->memory_allocated;
+ counter_parent = &parent_cg->memory_allocated;
- res_counter_init(&cg_proto->memory_allocated, res_parent);
+ page_counter_init(&cg_proto->memory_allocated, counter_parent);
percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
return 0;
@@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(tcp_destroy_cgroup);
-static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
{
struct cg_proto *cg_proto;
int i;
@@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
if (!cg_proto)
return -EINVAL;
- if (val > RES_COUNTER_MAX)
- val = RES_COUNTER_MAX;
-
- ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
+ ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
if (ret)
return ret;
for (i = 0; i < 3; i++)
- cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
+ cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
sysctl_tcp_mem[i]);
- if (val == RES_COUNTER_MAX)
+ if (nr_pages == PAGE_COUNTER_MAX)
clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
- else if (val != RES_COUNTER_MAX) {
+ else {
/*
* The active bit needs to be written after the static_key
* update. This is what guarantees that the socket activation
@@ -102,11 +99,20 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
return 0;
}
+enum {
+ RES_USAGE,
+ RES_LIMIT,
+ RES_MAX_USAGE,
+ RES_FAILCNT,
+};
+
+static DEFINE_MUTEX(tcp_limit_mutex);
+
static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- unsigned long long val;
+ unsigned long nr_pages;
int ret = 0;
buf = strstrip(buf);
@@ -114,10 +120,12 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
switch (of_cft(of)->private) {
/* see memcontrol.c */
- ret = res_counter_memparse_write_strategy(buf, &val);
+ ret = page_counter_memparse(buf, &nr_pages);
if (ret)
break;
- ret = tcp_update_limit(memcg, val);
+ mutex_lock(&tcp_limit_mutex);
+ ret = tcp_update_limit(memcg, nr_pages);
+ mutex_unlock(&tcp_limit_mutex);
break;
ret = -EINVAL;
@@ -126,43 +134,36 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}
-static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
-{
- struct cg_proto *cg_proto;
-
- cg_proto = tcp_prot.proto_cgroup(memcg);
- if (!cg_proto)
- return default_val;
-
- return res_counter_read_u64(&cg_proto->memory_allocated, type);
-}
-
-static u64 tcp_read_usage(struct mem_cgroup *memcg)
-{
- struct cg_proto *cg_proto;
-
- cg_proto = tcp_prot.proto_cgroup(memcg);
- if (!cg_proto)
- return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
-
- return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
-}
-
static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
u64 val;
switch (cft->private) {
- val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
+ if (!cg_proto)
+ return PAGE_COUNTER_MAX;
+ val = cg_proto->memory_allocated.limit;
+ val *= PAGE_SIZE;
break;
- val = tcp_read_usage(memcg);
+ if (!cg_proto)
+ val = atomic_long_read(&tcp_memory_allocated);
+ else
+ val = page_counter_read(&cg_proto->memory_allocated);
+ val *= PAGE_SIZE;
break;
+ if (!cg_proto)
+ return 0;
+ val = cg_proto->memory_allocated.failcnt;
+ break;
- val = tcp_read_stat(memcg, cft->private, 0);
+ if (!cg_proto)
+ return 0;
+ val = cg_proto->memory_allocated.watermark;
+ val *= PAGE_SIZE;
break;
BUG();
@@ -183,10 +184,10 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
switch (of_cft(of)->private) {
- res_counter_reset_max(&cg_proto->memory_allocated);
+ page_counter_reset_watermark(&cg_proto->memory_allocated);
break;
- res_counter_reset_failcnt(&cg_proto->memory_allocated);
+ cg_proto->memory_allocated.failcnt = 0;
break;
}
--
2.1.2
--
Michal Hocko
SUSE Labs
Johannes Weiner
2014-10-14 16:33:54 UTC
Permalink
Post by Michal Hocko
Post by Johannes Weiner
Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then
only happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the
per-cpu charge caches - on a 4-socket machine with 144-threads, the
following test shows the performance differences of 288 memcgs
I assume you had root.use_hierarchy = 1, right? Processes wouldn't bounce
on the same lock otherwise.
Yep. That's already the default on most distros, and will be in the
kernel in unified hierarchy.
Post by Michal Hocko
You have only missed MAINTAINERS...
Hm, we can add it, but then again scripts/get_maintainer.pl should
already do the right thing. I find myself using it with --git all the
time to find the people that actually worked on the code recently, not
just the ones listed in there - which might be stale information.
Thank you, Michal!
Michal Hocko
2014-10-15 09:40:31 UTC
Permalink
[...]
Post by Johannes Weiner
Post by Michal Hocko
You have only missed MAINTAINERS...
Hm, we can add it, but then again scripts/get_maintainer.pl should
already do the right thing.
OK, that would work as well.
--
Michal Hocko
SUSE Labs
Vladimir Davydov
2014-10-17 07:47:18 UTC
Permalink
Post by Johannes Weiner
Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then
only happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the
per-cpu charge caches - on a 4-socket machine with 144-threads, the
following test shows the performance differences of 288 memcgs
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long
long types in the very heart of memcg, which is great for 32 bit and
also makes the code a lot more readable.
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
Definitely better than it was.

Acked-by: Vladimir Davydov <***@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-14 01:46:02 UTC
Permalink
Abandon the spinlock-protected byte counters in favor of the unlocked
page counters in the hugetlb controller as well.

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Reviewed-by: Vladimir Davydov <***@parallels.com>
Acked-by: Michal Hocko <***@suse.cz>
---
Documentation/cgroups/hugetlb.txt | 2 +-
include/linux/hugetlb_cgroup.h | 1 -
init/Kconfig | 3 +-
mm/hugetlb_cgroup.c | 103 +++++++++++++++++++++-----------------
4 files changed, 61 insertions(+), 48 deletions(-)

diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
index a9faaca1f029..106245c3aecc 100644
--- a/Documentation/cgroups/hugetlb.txt
+++ b/Documentation/cgroups/hugetlb.txt
@@ -29,7 +29,7 @@ Brief summary of control files

hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit

For a system supporting two hugepage size (16M and 16G) the control
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0129f89cf98d..bcc853eccc85 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -16,7 +16,6 @@
#define _LINUX_HUGETLB_CGROUP_H

#include <linux/mmdebug.h>
-#include <linux/res_counter.h>

struct hugetlb_cgroup;
/*
diff --git a/init/Kconfig b/init/Kconfig
index 3fc577e9de75..d07a1c78d4e7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1058,7 +1058,8 @@ config MEMCG_KMEM

config CGROUP_HUGETLB
bool "HugeTLB Resource Controller for Control Groups"
- depends on RESOURCE_COUNTERS && HUGETLB_PAGE
+ depends on HUGETLB_PAGE
+ select PAGE_COUNTER
default n
help
Provides a cgroup Resource Controller for HugeTLB pages.
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a67c26e0f360..037e1c00a5b7 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -14,6 +14,7 @@
*/

#include <linux/cgroup.h>
+#include <linux/page_counter.h>
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
@@ -23,7 +24,7 @@ struct hugetlb_cgroup {
/*
* the counter to account for hugepages from hugetlb.
*/
- struct res_counter hugepage[HUGE_MAX_HSTATE];
+ struct page_counter hugepage[HUGE_MAX_HSTATE];
};

#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
@@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) {
- if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+ if (page_counter_read(&h_cg->hugepage[idx]))
return true;
}
return false;
@@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)

if (parent_h_cgroup) {
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
- res_counter_init(&h_cgroup->hugepage[idx],
- &parent_h_cgroup->hugepage[idx]);
+ page_counter_init(&h_cgroup->hugepage[idx],
+ &parent_h_cgroup->hugepage[idx]);
} else {
root_h_cgroup = h_cgroup;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
- res_counter_init(&h_cgroup->hugepage[idx], NULL);
+ page_counter_init(&h_cgroup->hugepage[idx], NULL);
}
return &h_cgroup->css;
}
@@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
struct page *page)
{
- int csize;
- struct res_counter *counter;
- struct res_counter *fail_res;
+ unsigned int nr_pages;
+ struct page_counter *counter;
struct hugetlb_cgroup *page_hcg;
struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);

@@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
if (!page_hcg || page_hcg != h_cg)
goto out;

- csize = PAGE_SIZE << compound_order(page);
+ nr_pages = 1 << compound_order(page);
if (!parent) {
parent = root_h_cgroup;
/* root has no limit */
- res_counter_charge_nofail(&parent->hugepage[idx],
- csize, &fail_res);
+ page_counter_charge(&parent->hugepage[idx], nr_pages);
}
counter = &h_cg->hugepage[idx];
- res_counter_uncharge_until(counter, counter->parent, csize);
+ /* Take the pages off the local counter */
+ page_counter_cancel(counter, nr_pages);

set_hugetlb_cgroup(page, parent);
out:
@@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
{
int ret = 0;
- struct res_counter *fail_res;
+ struct page_counter *counter;
struct hugetlb_cgroup *h_cg = NULL;
- unsigned long csize = nr_pages * PAGE_SIZE;

if (hugetlb_cgroup_disabled())
goto done;
@@ -187,7 +186,7 @@ again:
}
rcu_read_unlock();

- ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+ ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
css_put(&h_cg->css);
done:
*ptr = h_cg;
@@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page)
{
struct hugetlb_cgroup *h_cg;
- unsigned long csize = nr_pages * PAGE_SIZE;

if (hugetlb_cgroup_disabled())
return;
@@ -222,61 +220,76 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
if (unlikely(!h_cg))
return;
set_hugetlb_cgroup(page, NULL);
- res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
return;
}

void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
{
- unsigned long csize = nr_pages * PAGE_SIZE;
-
if (hugetlb_cgroup_disabled() || !h_cg)
return;

if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
return;

- res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
return;
}

+enum {
+ RES_USAGE,
+ RES_LIMIT,
+ RES_MAX_USAGE,
+ RES_FAILCNT,
+};
+
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- int idx, name;
+ struct page_counter *counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

- idx = MEMFILE_IDX(cft->private);
- name = MEMFILE_ATTR(cft->private);
+ counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];

- return res_counter_read_u64(&h_cg->hugepage[idx], name);
+ switch (MEMFILE_ATTR(cft->private)) {
+ case RES_USAGE:
+ return (u64)page_counter_read(counter) * PAGE_SIZE;
+ case RES_LIMIT:
+ return (u64)counter->limit * PAGE_SIZE;
+ case RES_MAX_USAGE:
+ return (u64)counter->watermark * PAGE_SIZE;
+ case RES_FAILCNT:
+ return counter->failcnt;
+ default:
+ BUG();
+ }
}

+static DEFINE_MUTEX(hugetlb_limit_mutex);
+
static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
- int idx, name, ret;
- unsigned long long val;
+ int ret, idx;
+ unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

+ if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
+ return -EINVAL;
+
buf = strstrip(buf);
+ ret = page_counter_memparse(buf, &nr_pages);
+ if (ret)
+ return ret;
+
idx = MEMFILE_IDX(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);

- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_LIMIT:
- if (hugetlb_cgroup_is_root(h_cg)) {
- /* Can't set limit on root */
- ret = -EINVAL;
- break;
- }
- /* This function does all necessary parse...reuse it */
- ret = res_counter_memparse_write_strategy(buf, &val);
- if (ret)
- break;
- val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
- ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+ mutex_lock(&hugetlb_limit_mutex);
+ ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
+ mutex_unlock(&hugetlb_limit_mutex);
break;
default:
ret = -EINVAL;
@@ -288,18 +301,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
- int idx, name, ret = 0;
+ int ret = 0;
+ struct page_counter *counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

- idx = MEMFILE_IDX(of_cft(of)->private);
- name = MEMFILE_ATTR(of_cft(of)->private);
+ counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];

- switch (name) {
+ switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_MAX_USAGE:
- res_counter_reset_max(&h_cg->hugepage[idx]);
+ page_counter_reset_watermark(counter);
break;
case RES_FAILCNT:
- res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+ counter->failcnt = 0;
break;
default:
ret = -EINVAL;
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-14 01:46:03 UTC
Permalink
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!

Signed-off-by: Johannes Weiner <***@cmpxchg.org>
Acked-by: Vladimir Davydov <***@parallels.com>
Acked-by: Michal Hocko <***@suse.cz>
---
Documentation/cgroups/resource_counter.txt | 197 -------------------------
include/linux/res_counter.h | 223 -----------------------------
init/Kconfig | 6 -
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------------------------
5 files changed, 638 deletions(-)
delete mode 100644 Documentation/cgroups/resource_counter.txt
delete mode 100644 include/linux/res_counter.h
delete mode 100644 kernel/res_counter.c

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
deleted file mode 100644
index 762ca54eb929..000000000000
--- a/Documentation/cgroups/resource_counter.txt
+++ /dev/null
@@ -1,197 +0,0 @@
-
- The Resource Counter
-
-The resource counter, declared at include/linux/res_counter.h,
-is supposed to facilitate the resource management by controllers
-by providing common stuff for accounting.
-
-This "stuff" includes the res_counter structure and routines
-to work with it.
-
-
-
-1. Crucial parts of the res_counter structure
-
- a. unsigned long long usage
-
- The usage value shows the amount of a resource that is consumed
- by a group at a given time. The units of measurement should be
- determined by the controller that uses this counter. E.g. it can
- be bytes, items or any other unit the controller operates on.
-
- b. unsigned long long max_usage
-
- The maximal value of the usage over time.
-
- This value is useful when gathering statistical information about
- the particular group, as it shows the actual resource requirements
- for a particular group, not just some usage snapshot.
-
- c. unsigned long long limit
-
- The maximal allowed amount of resource to consume by the group. In
- case the group requests for more resources, so that the usage value
- would exceed the limit, the resource allocation is rejected (see
- the next section).
-
- d. unsigned long long failcnt
-
- The failcnt stands for "failures counter". This is the number of
- resource allocation attempts that failed.
-
- c. spinlock_t lock
-
- Protects changes of the above values.
-
-
-
-2. Basic accounting routines
-
- a. void res_counter_init(struct res_counter *rc,
- struct res_counter *rc_parent)
-
- Initializes the resource counter. As usual, should be the first
- routine called for a new counter.
-
- The struct res_counter *parent can be used to define a hierarchical
- child -> parent relationship directly in the res_counter structure,
- NULL can be used to define no relationship.
-
- c. int res_counter_charge(struct res_counter *rc, unsigned long val,
- struct res_counter **limit_fail_at)
-
- When a resource is about to be allocated it has to be accounted
- with the appropriate resource counter (controller should determine
- which one to use on its own). This operation is called "charging".
-
- This is not very important which operation - resource allocation
- or charging - is performed first, but
- * if the allocation is performed first, this may create a
- temporary resource over-usage by the time resource counter is
- charged;
- * if the charging is performed first, then it should be uncharged
- on error path (if the one is called).
-
- If the charging fails and a hierarchical dependency exists, the
- limit_fail_at parameter is set to the particular res_counter element
- where the charging failed.
-
- d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
-
- When a resource is released (freed) it should be de-accounted
- from the resource counter it was accounted to. This is called
- "uncharging". The return value of this function indicate the amount
- of charges still present in the counter.
-
- The _locked routines imply that the res_counter->lock is taken.
-
- e. u64 res_counter_uncharge_until
- (struct res_counter *rc, struct res_counter *top,
- unsigned long val)
-
- Almost same as res_counter_uncharge() but propagation of uncharge
- stops when rc == top. This is useful when kill a res_counter in
- child cgroup.
-
- 2.1 Other accounting routines
-
- There are more routines that may help you with common needs, like
- checking whether the limit is reached or resetting the max_usage
- value. They are all declared in include/linux/res_counter.h.
-
-
-
-3. Analyzing the resource counter registrations
-
- a. If the failcnt value constantly grows, this means that the counter's
- limit is too tight. Either the group is misbehaving and consumes too
- many resources, or the configuration is not suitable for the group
- and the limit should be increased.
-
- b. The max_usage value can be used to quickly tune the group. One may
- set the limits to maximal values and either load the container with
- a common pattern or leave one for a while. After this the max_usage
- value shows the amount of memory the container would require during
- its common activity.
-
- Setting the limit a bit above this value gives a pretty good
- configuration that works in most of the cases.
-
- c. If the max_usage is much less than the limit, but the failcnt value
- is growing, then the group tries to allocate a big chunk of resource
- at once.
-
- d. If the max_usage is much less than the limit, but the failcnt value
- is 0, then this group is given too high limit, that it does not
- require. It is better to lower the limit a bit leaving more resource
- for other groups.
-
-
-
-4. Communication with the control groups subsystem (cgroups)
-
-All the resource controllers that are using cgroups and resource counters
-should provide files (in the cgroup filesystem) to work with the resource
-counter fields. They are recommended to adhere to the following rules:
-
- a. File names
-
- Field name File name
- ---------------------------------------------------
- usage usage_in_<unit_of_measurement>
- max_usage max_usage_in_<unit_of_measurement>
- limit limit_in_<unit_of_measurement>
- failcnt failcnt
- lock no file :)
-
- b. Reading from file should show the corresponding field value in the
- appropriate format.
-
- c. Writing to file
-
- Field Expected behavior
- ----------------------------------
- usage prohibited
- max_usage reset to usage
- limit set the limit
- failcnt reset to zero
-
-
-
-5. Usage example
-
- a. Declare a task group (take a look at cgroups subsystem for this) and
- fold a res_counter into it
-
- struct my_group {
- struct res_counter res;
-
- <other fields>
- }
-
- b. Put hooks in resource allocation/release paths
-
- int alloc_something(...)
- {
- if (res_counter_charge(res_counter_ptr, amount) < 0)
- return -ENOMEM;
-
- <allocate the resource and return to the caller>
- }
-
- void release_something(...)
- {
- res_counter_uncharge(res_counter_ptr, amount);
-
- <release the resource>
- }
-
- In order to keep the usage value self-consistent, both the
- "res_counter_ptr" and the "amount" in release_something() should be
- the same as they were in the alloc_something() when the releasing
- resource was allocated.
-
- c. Provide the way to read res_counter values and set them (the cgroups
- still can help with it).
-
- c. Compile and run :)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
deleted file mode 100644
index 56b7bc32db4f..000000000000
--- a/include/linux/res_counter.h
+++ /dev/null
@@ -1,223 +0,0 @@
-#ifndef __RES_COUNTER_H__
-#define __RES_COUNTER_H__
-
-/*
- * Resource Counters
- * Contain common data types and routines for resource accounting
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <***@openvz.org>
- *
- * See Documentation/cgroups/resource_counter.txt for more
- * info about what this counter is.
- */
-
-#include <linux/spinlock.h>
-#include <linux/errno.h>
-
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
-
-struct res_counter {
- /*
- * the current resource consumption level
- */
- unsigned long long usage;
- /*
- * the maximal value of the usage from the counter creation
- */
- unsigned long long max_usage;
- /*
- * the limit that usage cannot exceed
- */
- unsigned long long limit;
- /*
- * the limit that usage can be exceed
- */
- unsigned long long soft_limit;
- /*
- * the number of unsuccessful attempts to consume the resource
- */
- unsigned long long failcnt;
- /*
- * the lock to protect all of the above.
- * the routines below consider this to be IRQ-safe
- */
- spinlock_t lock;
- /*
- * Parent counter, used for hierarchial resource accounting
- */
- struct res_counter *parent;
-};
-
-#define RES_COUNTER_MAX ULLONG_MAX
-
-/**
- * Helpers to interact with userspace
- * res_counter_read_u64() - returns the value of the specified member.
- * res_counter_read/_write - put/get the specified fields from the
- * res_counter struct to/from the user
- *
- * @counter: the counter in question
- * @member: the field to work with (see RES_xxx below)
- * @buf: the buffer to opeate on,...
- * @nbytes: its size...
- * @pos: and the offset.
- */
-
-u64 res_counter_read_u64(struct res_counter *counter, int member);
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
- const char __user *buf, size_t nbytes, loff_t *pos,
- int (*read_strategy)(unsigned long long val, char *s));
-
-int res_counter_memparse_write_strategy(const char *buf,
- unsigned long long *res);
-
-/*
- * the field descriptors. one for each member of res_counter
- */
-
-enum {
- RES_USAGE,
- RES_MAX_USAGE,
- RES_LIMIT,
- RES_FAILCNT,
- RES_SOFT_LIMIT,
-};
-
-/*
- * helpers for accounting
- */
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
-
-/*
- * charge - try to consume more resource.
- *
- * @counter: the counter
- * @val: the amount of the resource. each controller defines its own
- * units, e.g. numbers, bytes, Kbytes, etc
- *
- * returns 0 on success and <0 if the counter->usage will exceed the
- * counter->limit
- *
- * charge_nofail works the same, except that it charges the resource
- * counter unconditionally, and returns < 0 if the after the current
- * charge we are over limit.
- */
-
-int __must_check res_counter_charge(struct res_counter *counter,
- unsigned long val, struct res_counter **limit_fail_at);
-int res_counter_charge_nofail(struct res_counter *counter,
- unsigned long val, struct res_counter **limit_fail_at);
-
-/*
- * uncharge - tell that some portion of the resource is released
- *
- * @counter: the counter
- * @val: the amount of the resource
- *
- * these calls check for usage underflow and show a warning on the console
- *
- * returns the total charges still present in @counter.
- */
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val);
-/**
- * res_counter_margin - calculate chargeable space of a counter
- * @cnt: the counter
- *
- * Returns the difference between the hard limit and the current usage
- * of resource counter @cnt.
- */
-static inline unsigned long long res_counter_margin(struct res_counter *cnt)
-{
- unsigned long long margin;
- unsigned long flags;
-
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->limit > cnt->usage)
- margin = cnt->limit - cnt->usage;
- else
- margin = 0;
- spin_unlock_irqrestore(&cnt->lock, flags);
- return margin;
-}
-
-/**
- * Get the difference between the usage and the soft limit
- * @cnt: The counter
- *
- * Returns 0 if usage is less than or equal to soft limit
- * The difference between usage and soft limit, otherwise.
- */
-static inline unsigned long long
-res_counter_soft_limit_excess(struct res_counter *cnt)
-{
- unsigned long long excess;
- unsigned long flags;
-
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= cnt->soft_limit)
- excess = 0;
- else
- excess = cnt->usage - cnt->soft_limit;
- spin_unlock_irqrestore(&cnt->lock, flags);
- return excess;
-}
-
-static inline void res_counter_reset_max(struct res_counter *cnt)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&cnt->lock, flags);
- cnt->max_usage = cnt->usage;
- spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline void res_counter_reset_failcnt(struct res_counter *cnt)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&cnt->lock, flags);
- cnt->failcnt = 0;
- spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline int res_counter_set_limit(struct res_counter *cnt,
- unsigned long long limit)
-{
- unsigned long flags;
- int ret = -EBUSY;
-
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= limit) {
- cnt->limit = limit;
- ret = 0;
- }
- spin_unlock_irqrestore(&cnt->lock, flags);
- return ret;
-}
-
-static inline int
-res_counter_set_soft_limit(struct res_counter *cnt,
- unsigned long long soft_limit)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&cnt->lock, flags);
- cnt->soft_limit = soft_limit;
- spin_unlock_irqrestore(&cnt->lock, flags);
- return 0;
-}
-
-#endif
diff --git a/init/Kconfig b/init/Kconfig
index d07a1c78d4e7..2229611528db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -979,12 +979,6 @@ config CGROUP_CPUACCT
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.

-config RESOURCE_COUNTERS
- bool "Resource counters"
- help
- This option enables controller independent resource accounting
- infrastructure that works with cgroups.
-
config PAGE_COUNTER
bool

diff --git a/kernel/Makefile b/kernel/Makefile
index 726e18443da0..245953354974 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -58,7 +58,6 @@ obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
obj-$(CONFIG_DEBUG_SYNCHRO_TEST) += synchro-test.o
obj-$(CONFIG_IKCONFIG) += configs.o
-obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
obj-$(CONFIG_SMP) += stop_machine.o
obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
deleted file mode 100644
index e791130f85a7..000000000000
--- a/kernel/res_counter.c
+++ /dev/null
@@ -1,211 +0,0 @@
-/*
- * resource cgroups
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <***@openvz.org>
- *
- */
-
-#include <linux/types.h>
-#include <linux/parser.h>
-#include <linux/fs.h>
-#include <linux/res_counter.h>
-#include <linux/uaccess.h>
-#include <linux/mm.h>
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
-{
- spin_lock_init(&counter->lock);
- counter->limit = RES_COUNTER_MAX;
- counter->soft_limit = RES_COUNTER_MAX;
- counter->parent = parent;
-}
-
-static u64 res_counter_uncharge_locked(struct res_counter *counter,
- unsigned long val)
-{
- if (WARN_ON(counter->usage < val))
- val = counter->usage;
-
- counter->usage -= val;
- return counter->usage;
-}
-
-static int res_counter_charge_locked(struct res_counter *counter,
- unsigned long val, bool force)
-{
- int ret = 0;
-
- if (counter->usage + val > counter->limit) {
- counter->failcnt++;
- ret = -ENOMEM;
- if (!force)
- return ret;
- }
-
- counter->usage += val;
- if (counter->usage > counter->max_usage)
- counter->max_usage = counter->usage;
- return ret;
-}
-
-static int __res_counter_charge(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at, bool force)
-{
- int ret, r;
- unsigned long flags;
- struct res_counter *c, *u;
-
- r = ret = 0;
- *limit_fail_at = NULL;
- local_irq_save(flags);
- for (c = counter; c != NULL; c = c->parent) {
- spin_lock(&c->lock);
- r = res_counter_charge_locked(c, val, force);
- spin_unlock(&c->lock);
- if (r < 0 && !ret) {
- ret = r;
- *limit_fail_at = c;
- if (!force)
- break;
- }
- }
-
- if (ret < 0 && !force) {
- for (u = counter; u != c; u = u->parent) {
- spin_lock(&u->lock);
- res_counter_uncharge_locked(u, val);
- spin_unlock(&u->lock);
- }
- }
- local_irq_restore(flags);
-
- return ret;
-}
-
-int res_counter_charge(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at)
-{
- return __res_counter_charge(counter, val, limit_fail_at, false);
-}
-
-int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at)
-{
- return __res_counter_charge(counter, val, limit_fail_at, true);
-}
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val)
-{
- unsigned long flags;
- struct res_counter *c;
- u64 ret = 0;
-
- local_irq_save(flags);
- for (c = counter; c != top; c = c->parent) {
- u64 r;
- spin_lock(&c->lock);
- r = res_counter_uncharge_locked(c, val);
- if (c == counter)
- ret = r;
- spin_unlock(&c->lock);
- }
- local_irq_restore(flags);
- return ret;
-}
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
-{
- return res_counter_uncharge_until(counter, NULL, val);
-}
-
-static inline unsigned long long *
-res_counter_member(struct res_counter *counter, int member)
-{
- switch (member) {
- case RES_USAGE:
- return &counter->usage;
- case RES_MAX_USAGE:
- return &counter->max_usage;
- case RES_LIMIT:
- return &counter->limit;
- case RES_FAILCNT:
- return &counter->failcnt;
- case RES_SOFT_LIMIT:
- return &counter->soft_limit;
- };
-
- BUG();
- return NULL;
-}
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
- const char __user *userbuf, size_t nbytes, loff_t *pos,
- int (*read_strategy)(unsigned long long val, char *st_buf))
-{
- unsigned long long *val;
- char buf[64], *s;
-
- s = buf;
- val = res_counter_member(counter, member);
- if (read_strategy)
- s += read_strategy(*val, s);
- else
- s += sprintf(s, "%llu\n", *val);
- return simple_read_from_buffer((void __user *)userbuf, nbytes,
- pos, buf, s - buf);
-}
-
-#if BITS_PER_LONG == 32
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
- unsigned long flags;
- u64 ret;
-
- spin_lock_irqsave(&counter->lock, flags);
- ret = *res_counter_member(counter, member);
- spin_unlock_irqrestore(&counter->lock, flags);
-
- return ret;
-}
-#else
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
- return *res_counter_member(counter, member);
-}
-#endif
-
-int res_counter_memparse_write_strategy(const char *buf,
- unsigned long long *resp)
-{
- char *end;
- unsigned long long res;
-
- /* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
- if (*buf == '-') {
- int rc = kstrtoull(buf + 1, 10, &res);
-
- if (rc)
- return rc;
- if (res != 1)
- return -EINVAL;
- *resp = RES_COUNTER_MAX;
- return 0;
- }
-
- res = memparse(buf, &end);
- if (*end != '\0')
- return -EINVAL;
-
- if (PAGE_ALIGN(res) >= res)
- res = PAGE_ALIGN(res);
- else
- res = RES_COUNTER_MAX;
-
- *resp = res;
-
- return 0;
-}
--
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paul Bolle
2014-10-16 07:20:34 UTC
Permalink
Post by Johannes Weiner
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
This patch landed in today's linux-next (ie, next 20141016).
Post by Johannes Weiner
Documentation/cgroups/resource_counter.txt | 197 -------------------------
include/linux/res_counter.h | 223 -----------------------------
init/Kconfig | 6 -
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------------------------
5 files changed, 638 deletions(-)
delete mode 100644 Documentation/cgroups/resource_counter.txt
delete mode 100644 include/linux/res_counter.h
delete mode 100644 kernel/res_counter.c
There's a last reference to CONFIG_RESOURCE_COUNTERS in
Documentation/cgroups/memory.txt. That reference could be dropped too,
couldn't it?


Paul Bolle

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Michal Hocko
2014-10-16 11:20:21 UTC
Permalink
Post by Paul Bolle
Post by Johannes Weiner
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
This patch landed in today's linux-next (ie, next 20141016).
Post by Johannes Weiner
Documentation/cgroups/resource_counter.txt | 197 -------------------------
include/linux/res_counter.h | 223 -----------------------------
init/Kconfig | 6 -
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------------------------
5 files changed, 638 deletions(-)
delete mode 100644 Documentation/cgroups/resource_counter.txt
delete mode 100644 include/linux/res_counter.h
delete mode 100644 kernel/res_counter.c
There's a last reference to CONFIG_RESOURCE_COUNTERS in
Documentation/cgroups/memory.txt. That reference could be dropped too,
couldn't it?
---
From a54e375e85c814199f480cb4ee7a133a395c5a00 Mon Sep 17 00:00:00 2001
From: Michal Hocko <***@suse.cz>
Date: Thu, 16 Oct 2014 13:15:24 +0200
Subject: [PATCH] kernel-res_counter-remove-the-unused-api-fix

ditch the last remainings of res_counter

Reported-by: Paul Bolle <***@tiscali.nl>
Signed-off-by: Michal Hocko <***@suse.cz>
---
Documentation/cgroups/memory.txt | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index f624727ab404..67613ff0270c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -116,16 +116,16 @@ The memory controller is the first controller developed.

2.1. Design

-The core of the design is a counter called the res_counter. The res_counter
-tracks the current memory usage and limit of the group of processes associated
-with the controller. Each cgroup has a memory controller specific data
-structure (mem_cgroup) associated with it.
+The core of the design is a counter called the page_counter. The
+page_counter tracks the current memory usage and limit of the group of
+processes associated with the controller. Each cgroup has a memory controller
+specific data structure (mem_cgroup) associated with it.

2.2. Accounting

+--------------------+
- | mem_cgroup |
- | (res_counter) |
+ | mem_cgroup |
+ | (page_counter) |
+--------------------+
/ ^ \
/ | \
@@ -352,9 +352,8 @@ set:
0. Configuration

a. Enable CONFIG_CGROUPS
-b. Enable CONFIG_RESOURCE_COUNTERS
-c. Enable CONFIG_MEMCG
-d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
+b. Enable CONFIG_MEMCG
+c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
--
2.1.1
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2014-10-16 14:46:41 UTC
Permalink
Post by Michal Hocko
Post by Paul Bolle
Post by Johannes Weiner
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
This patch landed in today's linux-next (ie, next 20141016).
Post by Johannes Weiner
Documentation/cgroups/resource_counter.txt | 197 -------------------------
include/linux/res_counter.h | 223 -----------------------------
init/Kconfig | 6 -
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------------------------
5 files changed, 638 deletions(-)
delete mode 100644 Documentation/cgroups/resource_counter.txt
delete mode 100644 include/linux/res_counter.h
delete mode 100644 kernel/res_counter.c
There's a last reference to CONFIG_RESOURCE_COUNTERS in
Documentation/cgroups/memory.txt. That reference could be dropped too,
couldn't it?
---
From a54e375e85c814199f480cb4ee7a133a395c5a00 Mon Sep 17 00:00:00 2001
Date: Thu, 16 Oct 2014 13:15:24 +0200
Subject: [PATCH] kernel-res_counter-remove-the-unused-api-fix
ditch the last remainings of res_counter
That makes sense, although that document is still littered with
out-of-date and seemingly irrelevant information, which is why I
didn't bother to update it. But your patch is still an improvement,
we should merge it.

Thanks!
Michal Hocko
2014-10-16 15:05:53 UTC
Permalink
Post by Johannes Weiner
Post by Michal Hocko
Post by Paul Bolle
Post by Johannes Weiner
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
This patch landed in today's linux-next (ie, next 20141016).
Post by Johannes Weiner
Documentation/cgroups/resource_counter.txt | 197 -------------------------
include/linux/res_counter.h | 223 -----------------------------
init/Kconfig | 6 -
kernel/Makefile | 1 -
kernel/res_counter.c | 211 ---------------------------
5 files changed, 638 deletions(-)
delete mode 100644 Documentation/cgroups/resource_counter.txt
delete mode 100644 include/linux/res_counter.h
delete mode 100644 kernel/res_counter.c
There's a last reference to CONFIG_RESOURCE_COUNTERS in
Documentation/cgroups/memory.txt. That reference could be dropped too,
couldn't it?
---
From a54e375e85c814199f480cb4ee7a133a395c5a00 Mon Sep 17 00:00:00 2001
Date: Thu, 16 Oct 2014 13:15:24 +0200
Subject: [PATCH] kernel-res_counter-remove-the-unused-api-fix
ditch the last remainings of res_counter
That makes sense, although that document is still littered with
out-of-date and seemingly irrelevant information, which is why I
didn't bother to update it.
I would also prefer to re-write or remove it long term.
--
Michal Hocko
SUSE Labs
Loading...