CHROMIUM: net: sched: Adaptive Rate Liming Qdisc

Adaptive Rate Limiting Qdisc (ARL) is designed for home routers to eliminate bufferbloat at upstream CPE (Cable/DSL) modem. It prevents a standing queue from forming at the upstream node by limit egress throughput to match the available upstream bandwidth. ARL does not use a statically preconfigured limit for its rate limiting. Instead, it passively measures the latency of the TCP connctions on its egress link in realtime and throttles the rate when the minimuim latency of all connections is seen to increase, which indicates the presence of excessive queue at the upstream device. ARL can be applied as root qdisc for WAN interface to prevent upstream bufferbloat at the CPE modem. Queue is then managed locally at the router, by applying another qdisc such as fq_codel as child qdisc. This is a port from chromeos-3.18 to chromeos-4.14-gw of CL:1121778 (ported from commit cee41c3dd7f50e0605368402c1625863ba278216) Note changes introduced as part of Linux-3.18 -> 4.14 porting... * Removed usage of nla_put_u64() [deprecated by Linux commit 5022524308c6 ("netlink: kill nla_put_u64()")], replaced with nla_put_u64_64bit() [introduced by Linux commit 73520786b079 ("libnl: add nla_put_u64_64bit() helper")] that forces 64bit alignment of attribute. This also required definition of TCA_ARL_PAD attribute used for alignment pad. * Removed usage of out-of-line struct ewma [deprecated by Linux commit f4e774f55fe0 ("average: remove out-of-line implementation")] replaced with static / inline macro [introduced by Linux commit commit 2377799c084d ("average: provide macro to create static EWMA")] and adapted to change in 'factor' parameter to 'precision' [introduced by Linux commit eb1e011a1474 ("average: change to declare precision, not factor")]. The old 'weight' parameter of 8 remains unchanged, just renamed 'weight_rcp'. The old 'factor' parameter of 8 translates to 3 bits 'precision'. * Adapt to changes in kernel nanosecond timer APIs, ktime (and ktime_t) are now simple signed 64bit value rather than a union type [Linux commit 2456e8553544 ("ktime: Get rid of the union")]. * The 'zone' parameter passed into nf_conntrack_find_get() is no longer an integer identifier [Linux commit 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")], a pointer to the 'nf_ct_zone_dflt' global const object is now required. * Removed all references to NF_CT_ASSERT() macro. The nf_conn 'timeout' field checked was never actually referenced within the ARL code, the 'timeout.data' pointer was eliminated by conntrack change that switched from using kernel timer to simple timestamp [Linux commit f330a7fdbe16 ("netfilter: conntrack: get rid of conntrack timer")], and the macro has now been totally deprecated [Linux commit 9efdb14f76f4 ("net: Remove CONFIG_NETFILTER_DEBUG and _ASSERT() macros.")]. * Remove nf_ct_is_untracked() check, the UNTRACKED state was deprecated [Linux commit cc41c84b7e7f ("netfilter: kill the fake untracked conntrack objects")] and then the API was completely removed [Linux commit ab8bc7ed864b ("netfilter: remove nf_ct_is_untracked")]. * Adapt to changes in Qdisc_class_ops: - graft(): Replaced function body with call to qdisc_replace(), this appears to be what all the supported Qdisc have migrated to. - get()/put(): APIs were deprecated [Linux commit 143976ce992f ("net_sched: remove tc class reference counting")], new find() API is introduced and basically identical to old get(). * Adapt to changes in Qdisc_ops: - enqueue(): Accepts new packets 'to_free' list, and use qdisc_drop() rather than qdisc_reshape_fail() when packet cannot be forwarded. gso_segment() helper function now also utilizes packets 'to_free' list and qdisc_tree_reduce_backlog() rather than qdisc_tree_decrease_qlen() to update queue length accounting. - dequeue(): Removed call to qdisc_unthrottled(), generic throttled management was deprecated [Linux commit 45f50bed1d80 ("net_sched: remove generic throttled management")]. - drop(): Deprecated [Linux commit a09ceb0e0814 ("sched: remove qdisc->drop")]. - change(): Added call to qdisc_hash_add() after default FIFO child Qdisc is created and when adjusting queue length accounting utilize qdisc_tree_reduce_backlog() rather than qdisc_tree_decrease_qlen(). This is based on coding practices observed in other supported Qdisc (e.g. sch_tbf, sch_prio). Note changes introduced as part of 64bit porting... * Added explicit cast (unsigned long) for first parameter of clamp() when 'lo'/'hi' are 'UL' constants. clamp() strongly enforces type equality at compile time. On 64bit architecture the first parameter (2 * minmax_get() / 1000L) (i.e. int * u32 * long) results in 'signed long' which doesn't match the 'unsigned long' second parameter type. BUG=b:138685581 TEST=Latency test with heavy upstream traffic. Cq-Depend: chromium:1727770, chromium:1730603 Change-Id: I727e4737ba45b8136e70643f159534cf7d06739e Signed-off-by: Danny J. Mitzel <mitzel@google.com> Reviewed-on: https://chromium-review.googlesource.com/1731583 Tested-by: Danny Mitzel <mitzel@google.com> Commit-Ready: Danny Mitzel <mitzel@google.com> Legacy-Commit-Queue: Commit Bot <commit-bot@chromium.org> Reviewed-by: Kan Yan <kyan@chromium.org>
author: Danny Mitzel <mitzel@google.com> 2019-08-01 11:47:01 -0700
committer: chrome-bot <chrome-bot@chromium.org> 2019-08-19 18:05:41 -0700
commit: bf81b38a6a104561aebe84b4a7b30502a4e68b8f (patch)
tree: 755f312a2843e670eb52009f264a02e12633f780
parent: 8861537f2f8a7cf070d6b9d9eba82140ed507a38 (diff)
6 files changed, 1509 insertions, 0 deletions
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index f9e3a663037b..cffc8ed9ba6e 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -14,6 +14,14 @@ struct ip_ct_tcp_state {
 	u_int8_t	flags;		/* per direction options */
 };
 
+struct tcp_latency_sample {
+	u_int64_t	send_ts;
+	atomic_t	sampling_state;
+	u_int32_t	last_seq;
+	u_int32_t	last_hrtt;
+	u_int32_t	s_hrtt_us;	/* smoothed hrtt in us */
+};
+
 struct ip_ct_tcp {
 	struct ip_ct_tcp_state seen[2];	/* connection parameters per direction */
 	u_int8_t	state;		/* state of the connection (enum tcp_conntrack) */
@@ -28,6 +36,10 @@ struct ip_ct_tcp {
 	/* For SYN packets while we may be out-of-sync */
 	u_int8_t	last_wscale;	/* Last window scaling factor seen */
 	u_int8_t	last_flags;	/* Last flags set */
+#if IS_ENABLED(CONFIG_NET_SCH_ARL)
+	/* Track TCP stream's latency */
+	struct tcp_latency_sample latency_sample;
+#endif
 };
 
 #endif /* _NF_CONNTRACK_TCP_H */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 703cd9df6cef..ce1958f4dd64 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -186,6 +186,27 @@ enum {
 
 #define TCA_TBF_MAX (__TCA_TBF_MAX - 1)
 
+/* ARL section */
+struct tc_arl_xstats {
+	__u32   max_bw;		/* The maxmium bw measured */
+	__u32   min_rate;	/* The lowest base rate */
+	__u32   current_rate;	/* The current base rate */
+	__u32   latency;	/* The current latency */
+};
+
+enum {
+	TCA_ARL_UNSPEC,
+	TCA_ARL_BUFFER,
+	TCA_ARL_MIN_RATE,
+	TCA_ARL_MAX_BW,
+	TCA_ARL_LIMIT,
+	TCA_ARL_MAX_LATENCY,
+	TCA_ARL_LATENCY_HYSTERESIS,
+	TCA_ARL_PAD,
+	__TCA_ARL_MAX,
+};
+#define TCA_ARL_MAX (__TCA_ARL_MAX - 1)
+
 
 /* TEQL section */
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e70ed26485a2..5a1edaf2c11d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -352,6 +352,17 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_ARL
+	tristate "Adaptive Rate Limiting (ARL) scheduler"
+	help
+	  Say Y here if you want to use the Adaptive Rate Limiting scheduler
+	  packet scheduling algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_arl.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 9e43a4721ef8..9dbd47a59ae8 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
+obj-$(CONFIG_NET_SCH_ARL)	+= sch_arl.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_arl.c b/net/sched/sch_arl.c
new file mode 100644
index 000000000000..f4e753dd638f
--- /dev/null
+++ b/net/sched/sch_arl.c
@@ -0,0 +1,1343 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Adaptive Rate Limiting Qdisc (ARL) is designed for home routers to eliminate
+ * bufferbloat at upstream CPE (Cable/DSL) modem. It prevents bloated queue
+ * from forming at upstream CPE modem by rate limit egress throughput to match
+ * the available bandwidth. Instead of using a preconfigured static rate limit.
+ * It automatically figures out the available upstream bandwidth and adjust
+ * rate limit in real time, by continuously monitoring latency passively.
+ * There are two sources of latency measurement, one is the RTT from kernel's
+ * TCP/IP stacks, another is the half path RTT measured from routed TCP
+ * streams. The minimum latency from all flows is used as the indication of
+ * bufferbloat at upstream CPE modem, because that's the common path for all
+ * flows. ARL adjusts the rate limit dynamically based on the minimum latency.
+ * If the throughput is less than available bandwidth, there will be no queue
+ * buildup at CPE device, hence the minimum latency should stay flat. On the
+ * other hand, a spike of minimum latency suggests there is bloated queue in
+ * upstream CPE modem, indicating the current rate limit is over the available
+ * bandwidth. In the case, ARL drains the queue and reduces rate limit.
+ * ARL can be applied as root qdisc for WAN interface to prevent upstream
+ * bufferbloat at the CPE modem. Queue is then managed locally at the
+ * router, by applying another qdisc such as fq_codel as child qdisc.
+ *
+ * The passive latency measurement method for routed TCP stream is inspired by:
+ * Kathleen Nichols, "Listening to Networks",
+ * http://netseminar.stanford.edu/seminars/02_02_17.pdf
+ *
+ * The rate shaping and some utility functions are copied from:
+ * net/sched/sch_tbf.c
+ * Author:	Kan Yan	<kyan@google.com>
+ */
+
+#include <linux/average.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/string.h>
+#include <linux/tcp.h>
+#include <linux/types.h>
+#include <linux/win_minmax.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/tcp.h>
+
+#include "sch_arl.h"
+
+#define ARL_SCALE	7	/* 128B per sec, approximatly 1kbps */
+#define ARL_BW_UNIT	BIT(7) /* 128B per sec, approximatly 1kbps */
+
+/* High gain to exponentially increase bw. Double the BW in 20 cycles */
+static const int ARL_HIGH_GAIN = ARL_BW_UNIT *  1035 / 1000;
+/* Drain gain: half the rate in two cycles */
+static const int ARL_DRAIN_GAIN = ARL_BW_UNIT * 707 / 1000;
+
+static bool arl_latency_sampling_enanbled;
+static int arl_dev_index = -1;
+
+/* The rate for each phase is:
+ * base_rate + rate_delta * arl_rate_tbl[mode][phase]
+ */
+static const int arl_rate_tbl[][ARL_CYCLE_LEN] = {
+			{0, 0, 0, 0},		/* STABLE */
+			{-1, -1, -1, 0},	/* DRAIN */
+			{1, 0, 1, 0},		/* BW_PROBE */
+			{-1, -1, -1, 0},	/* LATENCY_PROBE */
+			{0, 0, 0, 0},		/* UNTHROTTLED */
+			};
+
+static void arl_bw_est_reset(struct arl_sched_data *q)
+{
+	q->vars.bw_est_start_t = jiffies;
+	q->vars.bw_est_bytes_sent = 0;
+}
+
+static void arl_update_bw_estimate(struct arl_sched_data *q)
+{
+	struct	arl_vars *vars = &q->vars;
+	unsigned long now = jiffies, bw_avg;
+
+	if (!time_after(now, (vars->bw_est_start_t
+			+ msecs_to_jiffies(q->vars.phase_dur) - 1)))
+		return;
+
+	vars->last_bw_est = vars->bw_est;
+	vars->bw_est = div_u64(vars->bw_est_bytes_sent * HZ,
+			       (now - vars->bw_est_start_t)*1000);
+
+	ewma_arl_bw_avg_add(&vars->bw_avg, vars->bw_est);
+	bw_avg = ewma_arl_bw_avg_read(&vars->bw_avg);
+
+	minmax_running_max(&vars->max_bw, msecs_to_jiffies(ARL_LT_WIN), now,
+			   bw_avg);
+	if (bw_avg > q->stats.max_bw)
+		q->stats.max_bw = bw_avg;
+	arl_bw_est_reset(q);
+}
+
+static bool arl_is_latency_high(struct arl_sched_data *q)
+{
+	u32 lt_min_hrtt = minmax_get(&q->vars.lt_min_hrtt);
+
+	if ((minmax_get(&q->vars.st_min_hrtt) < (lt_min_hrtt +
+	   q->params.latency_hysteresis)) &&
+	   (minmax_get(&q->vars.min_hrtt) < q->params.max_latency))
+		return false;
+	else
+		return true;
+}
+
+/* Check if the bandwidth is fully used.
+ * Return true if the measured throughput is above ~92% of the configured rate.
+ */
+static bool arl_is_bw_full(struct arl_sched_data *q)
+{
+	u32 rate = q->vars.base_rate;
+
+	return (q->vars.bw_est > (rate - ((rate * 10) >> ARL_SCALE))) ? true :
+		false;
+}
+
+static bool arl_check_drain(struct arl_sched_data *q)
+{
+	if (q->vars.mode != ARL_LATENCY_PROBE)
+		return false;
+
+	if (minmax_get(&q->vars.st_min_hrtt) < ARL_LOW_LATENCY)
+		return false;
+
+	if (!arl_is_bw_full(q))
+		return false;
+
+	if (minmax_get(&q->vars.min_hrtt) > q->params.max_latency)
+		return true;
+
+	if (ktime_ms_delta(q->vars.phase_start_t, q->vars.last_drain_t)
+	     > ARL_DRAIN_INTERVAL)
+		return true;
+	else
+		return false;
+}
+
+static void arl_apply_new_rate(struct arl_sched_data *q, u64 next_rate)
+{
+	u32	buffer;
+
+	next_rate *= 1000;
+	psched_ratecfg_precompute(&q->vars.rate, &q->vars.cfg_rate, next_rate);
+	/* The buffer is burst size in ns, ensure it is large enough to
+	 * transmit max_size packet.
+	 */
+	buffer = psched_l2t_ns(&q->vars.rate, q->params.max_size);
+	q->vars.buffer = max(buffer, q->params.buffer);
+}
+
+static void arl_change_mode(struct arl_sched_data *q, int mode)
+{
+	struct	arl_vars *vars = &q->vars;
+	u64	next_rate;
+	u32	bw;
+
+	vars->phase = 0;
+	vars->cycle_cnt = 0;
+	vars->phase_start_t = ktime_get();
+	vars->latency_trend = 0;
+	vars->phase_dur = ARL_PHASE_DUR_MAX;
+	vars->rate_factor = ARL_BW_UNIT;
+
+	if (vars->mode == mode)
+		return;
+
+	vars->phase_dur =
+		clamp((unsigned long)(2 * minmax_get(&vars->st_min_hrtt) /
+				      USEC_PER_MSEC),
+		      ARL_PHASE_DUR_MIN, ARL_PHASE_DUR_MAX);
+
+	if (mode == ARL_UNTHROTTLED && vars->bw_est > q->params.max_bw)
+		arl_latency_sampling_enanbled = false;
+	else if (vars->mode == ARL_UNTHROTTLED)
+		arl_latency_sampling_enanbled = true;
+
+	/* Setup new base rate for next state. In general, the base rate should
+	 * set to the available bandwidth measured, if the link is fully used.
+	 * Use instand BW value when enter DRAIN mode, or exit UNTHROTTLED.
+	 * Use the moving average of bw for the other states.
+	 */
+	if (mode == ARL_DRAIN || vars->mode == ARL_UNTHROTTLED)
+		bw = vars->bw_est;
+	else
+		bw = ewma_arl_bw_avg_read(&vars->bw_avg);
+
+	/* Reduce BW to offset the overshot due to increase BW when exit
+	 * UNTHROTTLED or BW_PROBE mode
+	 */
+	if (vars->mode == ARL_UNTHROTTLED)
+		bw -= (bw >> ARL_SCALE) * 12;
+	else if (vars->mode == ARL_BW_PROBE)
+		bw -= (bw >> ARL_SCALE);
+
+	/* Cap bw to previous base_rate when exit LATENCY_PROBE or enter DRAIN
+	 * mode
+	 */
+	if (vars->mode == ARL_LATENCY_PROBE || mode == ARL_DRAIN)
+		bw = min(bw, vars->base_rate);
+
+	/* New base rate for next mode */
+	vars->base_rate = max(bw, vars->base_rate);
+	/* rate adjustment for next cycle */
+	vars->probe_rate = ((vars->base_rate * ARL_HIGH_GAIN) >> ARL_SCALE);
+	vars->rate_delta = vars->probe_rate - vars->base_rate;
+
+	/* entering DRAIN mode */
+	if (mode == ARL_DRAIN) {
+		vars->last_drain_t = ktime_get();
+		vars->phase_dur = minmax_get(&vars->st_min_hrtt) /
+					     USEC_PER_MSEC;
+		vars->phase_dur = clamp(vars->phase_dur, ARL_DRAIN_DUR_MIN,
+					ARL_DRAIN_DUR_MAX);
+
+		/* If latency is high, reduce the base rate to ~70%. */
+		if (arl_is_latency_high(q)) {
+			vars->base_rate = (vars->base_rate >> ARL_SCALE)
+					* ARL_DRAIN_GAIN;
+		} else if (arl_is_bw_full(q)) {
+			vars->base_rate -= vars->rate_delta;
+		}
+		vars->base_rate = max(bw, vars->base_rate);
+		/* set rate_delta to ~33% of base_rate, so a [-1, -1, -1, 0]
+		 * cycle could eliminate RTT worth of queue if the base rate
+		 * is approximately equals to BW.
+		 */
+		vars->rate_delta = vars->base_rate / 3;
+	}
+
+	vars->last_min_hrtt = minmax_get(&vars->st_min_hrtt);
+	vars->min_hrtt_last_cycle = vars->last_min_hrtt;
+
+	vars->mode = mode;
+	vars->base_rate = max_t(u32, vars->base_rate, q->params.min_rate);
+
+	next_rate = vars->rate_delta * arl_rate_tbl[vars->mode][vars->phase]
+		    + vars->base_rate;
+	arl_apply_new_rate(q, next_rate);
+	arl_bw_est_reset(q);
+}
+
+static void arl_update_phase(struct arl_sched_data *q)
+{
+	struct	arl_vars *vars = &q->vars;
+	u64	next_rate;
+	int	latency;
+	bool	is_bw_full, is_latency_high;
+
+	if (time_after(jiffies, vars->last_latency_upd +
+	    msecs_to_jiffies(ARL_LT_WIN * 2)) &&
+	    vars->mode != ARL_STABLE) {
+		arl_change_mode(q, ARL_UNTHROTTLED);
+		return;
+	}
+
+	if (arl_check_drain(q)) {
+		arl_change_mode(q, ARL_DRAIN);
+		return;
+	}
+
+	/* update the latency_trend at the end of each phase */
+	latency = minmax_get(&vars->st_min_hrtt);
+	if ((latency + q->params.latency_hysteresis / 2) <
+	    vars->min_hrtt_last_cycle)
+		vars->latency_trend--;
+	else if (latency > (vars->min_hrtt_last_cycle +
+		 q->params.latency_hysteresis / 2))
+		vars->latency_trend++;
+
+	if (latency < ARL_LOW_LATENCY)
+		vars->latency_trend = 0;
+
+	vars->phase = (vars->phase == (ARL_CYCLE_LEN - 1)) ? 0 :
+		       vars->phase + 1;
+	vars->phase_start_t = ktime_get();
+
+	next_rate = vars->rate_delta * arl_rate_tbl[vars->mode][vars->phase]
+		    + vars->base_rate;
+
+	arl_apply_new_rate(q, next_rate);
+
+	if (vars->phase != 0)
+		return;
+
+	/* If there is no updated latency measurement, skip state update */
+	if (time_after(jiffies, vars->last_latency_upd + vars->phase_dur * 4))
+		return;
+
+	arl_update_bw_estimate(q);
+	is_bw_full = arl_is_bw_full(q);
+
+	/* Is latency high compared to long term minimum? */
+	is_latency_high = arl_is_latency_high(q);
+
+	if ((minmax_get(&q->vars.max_bw) > q->params.max_bw) &&
+	    !is_latency_high) {
+		/* The available BW is too high to worry about bufferbloat.
+		 * so disengage the rate limiter to avoid overhead.
+		 */
+		arl_change_mode(q, ARL_UNTHROTTLED);
+		return;
+	}
+
+	switch (vars->mode) {
+	case ARL_STABLE:
+		if (is_latency_high && vars->bw_est > q->params.min_rate) {
+			arl_change_mode(q, ARL_LATENCY_PROBE);
+			return;
+		}
+		if (is_bw_full) {
+			if (vars->latency_trend > 1) {
+				arl_change_mode(q, ARL_LATENCY_PROBE);
+				return;
+			} else if (vars->cycle_cnt > 3) {
+			/* If there is sufficient traffic load, go to BW_PROBE
+			 * after delay for a few cycles.
+			 */
+				arl_change_mode(q, ARL_BW_PROBE);
+				return;
+			}
+		} else {
+			vars->last_drain_t = ktime_get();
+		}
+		break;
+
+	case ARL_BW_PROBE:
+		if (is_latency_high) {
+			arl_change_mode(q, ARL_LATENCY_PROBE);
+			return;
+		}
+		if (!is_bw_full && vars->cycle_cnt > 3) {
+			arl_change_mode(q, ARL_STABLE);
+			return;
+		} else if (vars->latency_trend >= 2) {
+			arl_change_mode(q, ARL_LATENCY_PROBE);
+			return;
+		}
+
+		/* Switch to high gain if latency is stable. Pause the
+		 * rate_delta increase for one in every 4 phases to observe
+		 * the latecency change. The could be some lag between rate
+		 * change and latency change.
+		 */
+		if (vars->latency_trend == 0 && vars->cycle_cnt % 4)
+			vars->rate_factor = ARL_HIGH_GAIN;
+		else
+			vars->rate_factor = ARL_BW_UNIT;
+
+		vars->probe_rate = ((vars->probe_rate * vars->rate_factor) >>
+				    ARL_SCALE);
+		vars->rate_delta = vars->probe_rate - vars->base_rate;
+		/* If BW has increased signficantly(>15%) without latency
+		 * increase, switch to UNTHROTTLED mode to quickly figure out
+		 * the available BW.
+		 */
+		if (vars->rate_delta >= vars->base_rate / 4) {
+			vars->rate_delta = max_t(u32, vars->rate_delta,
+						 vars->base_rate / 2);
+			if (vars->bw_est > vars->base_rate * 115 / 100) {
+				arl_change_mode(q, ARL_UNTHROTTLED);
+				return;
+			}
+		}
+		if (vars->cycle_cnt > 20) {
+			arl_change_mode(q, ARL_STABLE);
+			return;
+		}
+		break;
+
+	case ARL_LATENCY_PROBE:
+		if (!is_latency_high || vars->bw_est < q->params.min_rate) {
+		/* If latency is no longer high or cannot be further reduced,
+		 * go back to stable mode.
+		 */
+			if (is_bw_full)
+				vars->base_rate -= vars->rate_delta / 4;
+			arl_change_mode(q, ARL_STABLE);
+			return;
+		}
+
+		/* If it is not just short minor term latency increases,
+		 * then the pervious minor adjustment of rate is not sufficient.
+		 * The base_rate is likely exceed the available bandwidth, goto
+		 * DRAIN state.
+		 */
+		if (vars->bw_est > q->params.min_rate &&
+		    ((minmax_get(&q->vars.min_hrtt) > q->params.max_latency) ||
+		    ((minmax_get(&q->vars.st_min_hrtt) >
+		    q->params.max_latency) && vars->cycle_cnt > 2))) {
+			arl_change_mode(q, ARL_DRAIN);
+			return;
+		}
+
+		if (vars->latency_trend >= 0)
+			vars->rate_factor = ARL_HIGH_GAIN;
+		else
+			vars->rate_factor = ARL_BW_UNIT;
+
+		vars->rate_delta = ((vars->rate_delta * vars->rate_factor)
+					>> ARL_SCALE);
+		if (vars->rate_delta > vars->base_rate / 4)
+			vars->rate_delta = vars->base_rate / 4;
+		break;
+
+	case ARL_DRAIN:
+		if (!is_latency_high || vars->bw_est < q->params.min_rate) {
+			arl_change_mode(q, ARL_STABLE);
+			return;
+		}
+		arl_change_mode(q, ARL_LATENCY_PROBE);
+		return;
+
+	case ARL_UNTHROTTLED:
+		if (is_latency_high) {
+			arl_change_mode(q, ARL_LATENCY_PROBE);
+			return;
+		} else if ((vars->latency_trend > 1) ||
+			   ((minmax_get(&vars->max_bw) < q->params.max_bw) &&
+			   (vars->cycle_cnt > 10)))  {
+			if (vars->latency_trend >= 2 && is_bw_full)
+				arl_change_mode(q, ARL_LATENCY_PROBE);
+			else
+				arl_change_mode(q, ARL_STABLE);
+			return;
+		}
+		break;
+	}
+
+	/* state unchanged */
+	vars->cycle_cnt++;
+	vars->min_hrtt_last_cycle = minmax_get(&vars->st_min_hrtt);
+	vars->latency_trend = 0;
+	if (vars->base_rate < q->stats.min_rate || q->stats.min_rate == 0)
+		q->stats.min_rate = vars->base_rate;
+	next_rate = vars->rate_delta * arl_rate_tbl[vars->mode][vars->phase]
+		    + vars->base_rate;
+
+	arl_apply_new_rate(q, next_rate);
+}
+
+static void arl_update(struct Qdisc *sch)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	if (ktime_ms_delta(ktime_get(), q->vars.phase_start_t) <
+	    q->vars.phase_dur)
+		return;
+
+	arl_update_phase(q);
+}
+
+static void arl_params_init(struct arl_params *params)
+{
+	params->max_size = 5000;
+	params->buffer = ARL_BUFFER_SIZE_DEFAULT * NSEC_PER_USEC;
+	params->max_bw = ARL_MAX_BW_DEFAULT;
+	params->min_rate = ARL_MIN_RATE_DEFAULT;
+	params->limit = 1000;
+	params->max_latency = ARL_MAX_LATENCY_DEFAULT;
+	params->latency_hysteresis = ARL_LAT_HYSTERESIS_DEFAULT;
+}
+
+static	void arl_vars_init(struct arl_sched_data *q)
+{
+	struct arl_vars *vars = &q->vars;
+
+	vars->ts = ktime_get_ns();
+	q->vars.last_drain_t = ktime_get();
+	minmax_reset(&vars->lt_min_hrtt, jiffies, 5 * 1000);
+	minmax_reset(&vars->st_min_hrtt, jiffies, 5 * 1000);
+	minmax_reset(&vars->max_bw, jiffies, 0);
+	ewma_arl_bw_avg_init(&vars->bw_avg);
+	vars->cfg_rate.linklayer = TC_LINKLAYER_ETHERNET;
+	vars->base_rate = q->params.rate;
+	vars->tokens = q->vars.buffer;
+	vars->buffer = q->params.buffer;
+	arl_bw_est_reset(q);
+	arl_change_mode(q, ARL_STABLE);
+}
+
+static void arl_update_latency_ct(struct arl_sched_data *q,
+				  struct  tcp_latency_sample *lat, u32 latency)
+{
+	u32	s_hrtt, hrtt_last = lat->s_hrtt_us;
+
+	if (hrtt_last > ARL_LATENCY_SAMPLE_TIMEOUT_US ||
+	    latency > ARL_LATENCY_SAMPLE_TIMEOUT_US)
+		hrtt_last = latency;
+
+	/* s_hrtt_us = 3/4 old s_hrtt_us + 1/4 new sample */
+	if (hrtt_last)
+		s_hrtt = hrtt_last * 4 + latency - hrtt_last;
+	else
+		s_hrtt = latency * 4;
+
+	s_hrtt = s_hrtt / 4;
+	if (s_hrtt > ARL_LATENCY_SAMPLE_TIMEOUT_US)
+		s_hrtt = latency;
+	lat->s_hrtt_us = s_hrtt;
+
+	minmax_running_min(&q->vars.st_min_hrtt,
+			   (msecs_to_jiffies(q->vars.phase_dur)), jiffies,
+			   latency);
+	minmax_running_min(&q->vars.min_hrtt, msecs_to_jiffies(ARL_MT_WIN),
+			   jiffies, s_hrtt);
+	minmax_running_min(&q->vars.lt_min_hrtt, msecs_to_jiffies(ARL_LT_WIN),
+			   jiffies, s_hrtt);
+	q->vars.last_latency_upd = jiffies;
+}
+
+static void arl_update_latency(struct arl_sched_data *q, u32 latency)
+{
+	minmax_running_min(&q->vars.st_min_hrtt,
+			   (msecs_to_jiffies(q->vars.phase_dur)), jiffies,
+			   latency);
+	minmax_running_min(&q->vars.min_hrtt, msecs_to_jiffies(ARL_MT_WIN),
+			   jiffies, latency);
+	minmax_running_min(&q->vars.lt_min_hrtt, msecs_to_jiffies(ARL_LT_WIN),
+			   jiffies, latency);
+	q->vars.last_latency_upd = jiffies;
+}
+
+/* Latency measurement related utilities.
+ * There are two sources of the latency measurement:
+ * 1) Kernel's RTT measurement for TCP sockets bound to the qdisc's interface.
+ * 2) The half path RTT measured by ARL for routed TCP sessions. The half path
+ * measured is from router-> internet -> ACKs back to the router.
+ *
+ * To measure the half path RTT for routed TCP sessions:
+ * For each routed TCP flow, one egress packet is sampled for latency
+ * measurement. The sequence number extracted from the TCP header and the
+ * dequeue time are stored in the TCP stream's conntrack entry. The latency is
+ * measured as from the time of the packet is dequeued at egress path to the
+ * time the TCP ACK for that segment is received at ingress path.
+ */
+static void arl_egress_mark_pkt(struct sk_buff *skb, u32 seq,
+				struct nf_conn *ct)
+{
+	struct	tcp_latency_sample *tcp_lat;
+
+	tcp_lat = &ct->proto.tcp.latency_sample;
+	tcp_lat->send_ts = ktime_get();
+	tcp_lat->last_seq = seq;
+	tcp_lat->last_hrtt = 0;
+}
+
+static struct tcphdr *arl_get_tcp_header_ipv4(struct sk_buff *skb,
+					      void *buffer)
+{
+	const struct iphdr *iph;
+	struct tcphdr *tcph;
+	u32 tcph_offset;
+
+	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
+		return NULL;
+
+	iph = ip_hdr(skb);
+	if (iph->protocol != IPPROTO_TCP)
+		return NULL;
+
+	tcph_offset = skb_network_offset(skb) + iph->ihl * 4;
+	if (tcph_offset > skb->len)
+		return NULL;
+
+	tcph = skb_header_pointer(skb, tcph_offset, sizeof(struct tcphdr),
+				  buffer);
+	return tcph;
+}
+
+static struct tcphdr *arl_get_tcp_header_ipv6(struct sk_buff *skb,
+					      void *buffer)
+{
+	const struct ipv6hdr *ipv6h;
+	struct tcphdr *tcphdr;
+	u8 proto;
+	__be16 frag_off;
+	int tcphoff;
+
+	if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
+		return NULL;
+
+	ipv6h = ipv6_hdr(skb);
+	if (ipv6h->version != 6)
+		return NULL;
+
+	if (ipv6_addr_is_multicast(&ipv6h->daddr) ||
+	    ipv6_addr_is_multicast(&ipv6h->saddr))
+		return NULL;
+
+	proto = ipv6h->nexthdr;
+	tcphoff = ipv6_skip_exthdr(skb, skb_network_offset(skb) +
+				   sizeof(*ipv6h), &proto, &frag_off);
+
+	if (tcphoff < 0 || proto != IPPROTO_TCP ||
+	    ((tcphoff + sizeof(struct tcphdr)) > skb->len))
+		return NULL;
+
+	tcphdr = skb_header_pointer(skb, tcphoff, sizeof(struct tcphdr),
+				    buffer);
+	return tcphdr;
+}
+
+/* Find the conntrack entry for packet that takes the shortcut path and has no
+ * ct entry set in its skb.
+ */
+struct  nf_conn *arl_egress_find_ct_v4(struct sk_buff *skb, struct iphdr *iph,
+				       struct tcphdr *tcph)
+{
+	struct	nf_conntrack_tuple_hash *h;
+	struct	nf_conntrack_tuple tuple;
+	struct	nf_conn *ct = NULL;
+
+	if (!arl_latency_sampling_enanbled)
+		return ct;
+
+	/* construct a tuple to lookup nf_conn. */
+	memset(&tuple, 0, sizeof(tuple));
+	tuple.dst.protonum = iph->protocol;
+
+	/* The routed packet is transfromed by NAPT, so use the CT entry from
+	 * the reverse direction.
+	 */
+	tuple.dst.dir = IP_CT_DIR_REPLY;
+	tuple.src.u3.ip = iph->daddr;
+	tuple.dst.u3.ip = iph->saddr;
+	tuple.src.l3num = AF_INET;
+
+	tuple.src.u.tcp.port = tcph->dest;
+	tuple.dst.u.tcp.port = tcph->source;
+
+	h = nf_conntrack_find_get(&init_net, &nf_ct_zone_dflt, &tuple);
+	if (unlikely(!h))
+		return ct;
+
+	ct = nf_ct_tuplehash_to_ctrack(h);
+	return ct;
+}
+
+struct  nf_conn *arl_egress_find_ct_v6(struct sk_buff *skb,
+				       struct ipv6hdr *ipv6h,
+				       struct tcphdr *tcph)
+{
+	struct	nf_conntrack_tuple_hash *h;
+	struct	nf_conntrack_tuple tuple;
+	struct	nf_conn *ct = NULL;
+
+	if (!arl_latency_sampling_enanbled)
+		return ct;
+
+	/* construct a tuple to lookup nf_conn. */
+	memset(&tuple, 0, sizeof(tuple));
+
+	tuple.dst.dir = IP_CT_DIR_REPLY;
+	tuple.dst.protonum = IPPROTO_TCP;
+
+	tuple.src.u3.in6 = ipv6h->daddr;
+	tuple.dst.u3.in6 = ipv6h->saddr;
+	tuple.src.l3num = AF_INET6;
+
+	tuple.dst.u.tcp.port = tcph->source;
+	tuple.src.u.tcp.port = tcph->dest;
+
+	h = nf_conntrack_find_get(&init_net, &nf_ct_zone_dflt, &tuple);
+	if (unlikely(!h))
+		return ct;
+
+	ct = nf_ct_tuplehash_to_ctrack(h);
+	return ct;
+}
+
+static void arl_latency_sample_egress(struct arl_sched_data *q,
+				      struct sk_buff *skb)
+{
+	struct	tcphdr *tcph, tcphdr;
+	struct	nf_conn *ct;
+	struct	tcp_latency_sample *tcp_lat;
+	u32	latency_sampling;
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;
+
+	if (!arl_latency_sampling_enanbled)
+		return;
+
+	/* skip small packets */
+	if (!skb || skb->len < 256)
+		return;
+
+	/* Skip bc/mc packets. */
+	if (unlikely(skb->pkt_type == PACKET_BROADCAST ||
+		     skb->pkt_type == PACKET_MULTICAST))
+		return;
+
+	/* Only process TCP packets */
+	if (likely(htons(ETH_P_IP) == skb->protocol)) {
+		iph = ip_hdr(skb);
+		tcph = arl_get_tcp_header_ipv4(skb, &tcphdr);
+		if (!tcph)
+			return;
+		ct = arl_egress_find_ct_v4(skb, iph, tcph);
+	} else if (likely(htons(ETH_P_IPV6) == skb->protocol)) {
+		ipv6h = ipv6_hdr(skb);
+		tcph = arl_get_tcp_header_ipv6(skb, &tcphdr);
+		if (!tcph)
+			return;
+		ct = arl_egress_find_ct_v6(skb, ipv6h, tcph);
+	}
+
+	if (unlikely(!ct))
+		return;
+
+	if (!nf_ct_is_confirmed(ct))
+		goto exit;
+
+	tcp_lat = &ct->proto.tcp.latency_sample;
+	latency_sampling = atomic_read(&tcp_lat->sampling_state);
+
+	if (unlikely(latency_sampling == ARL_SAMPLE_STATE_DONE)) {
+		u32 latency = tcp_lat->last_hrtt;
+
+		if (atomic_cmpxchg(&tcp_lat->sampling_state,
+				   ARL_SAMPLE_STATE_DONE,
+				   ARL_SAMPLE_STATE_UPDATING)
+		    != ARL_SAMPLE_STATE_DONE)
+			goto exit;
+		if (latency) {
+			tcp_lat->last_hrtt = 0;
+			arl_update_latency_ct(q, tcp_lat, latency);
+		}
+		atomic_set(&tcp_lat->sampling_state,
+			   ARL_SAMPLE_STATE_IDLE);
+	} else if (latency_sampling > ARL_SAMPLE_STATE_IDLE) {
+		goto exit;
+	}
+
+	if (atomic_cmpxchg(&tcp_lat->sampling_state, ARL_SAMPLE_STATE_IDLE,
+			   ARL_SAMPLE_STATE_UPDATING) != ARL_SAMPLE_STATE_IDLE)
+		goto exit;
+
+	atomic_set(&tcp_lat->sampling_state,
+		   ARL_SAMPLE_STATE_SAMPLING);
+	arl_egress_mark_pkt(skb, ntohl(tcph->seq), ct);
+
+exit:
+	nf_ct_put(ct);
+}
+
+/* Extract half round trip time from routed TCP packets
+ * Return hrtt in us if successful, return -1 otherwise.
+ */
+static int  arl_get_hrtt(struct sk_buff *skb, u32 ack_seq,
+			 struct  tcp_latency_sample *tcp_lat)
+{
+	s64	time_delta;
+
+	if (ack_seq < tcp_lat->last_seq)
+		return -1;
+
+	time_delta = ktime_us_delta(ktime_get(), tcp_lat->send_ts);
+	if (time_delta > ARL_LATENCY_SAMPLE_TIMEOUT_US) {
+		atomic_set(&tcp_lat->sampling_state,
+			   ARL_SAMPLE_STATE_IDLE);
+		return -1;
+	}
+
+	if (atomic_cmpxchg(&tcp_lat->sampling_state,
+			   ARL_SAMPLE_STATE_SAMPLING,
+			   ARL_SAMPLE_STATE_UPDATING) !=
+	    ARL_SAMPLE_STATE_SAMPLING)
+		return -1;
+
+	tcp_lat->last_hrtt = time_delta;
+	atomic_set(&tcp_lat->sampling_state,
+		   ARL_SAMPLE_STATE_DONE);
+	return 0;
+}
+
+void arl_latency_sample_ingress_v4(struct sk_buff *skb,
+				   struct iphdr *iph,
+				   struct tcphdr *tcph)
+{
+	struct	nf_conn *ct;
+	struct  nf_conntrack_tuple tuple;
+	struct  nf_conntrack_tuple_hash *h;
+	struct  tcp_latency_sample *tcp_lat;
+
+	if (!skb || !skb->dev || skb->dev->ifindex != arl_dev_index)
+		return;
+
+	if (!arl_latency_sampling_enanbled)
+		return;
+
+	/* construct a tuple to lookup nf_conn. */
+	memset(&tuple, 0, sizeof(tuple));
+	tuple.dst.dir = IP_CT_DIR_REPLY;
+	tuple.dst.protonum = IPPROTO_TCP;
+
+	iph = ip_hdr(skb);
+	tuple.src.u3.ip = iph->saddr;
+	tuple.dst.u3.ip = iph->daddr;
+	tuple.src.l3num = AF_INET;
+
+	tuple.dst.u.tcp.port = tcph->dest;
+	tuple.src.u.tcp.port = tcph->source;
+	h = nf_conntrack_find_get(&init_net, &nf_ct_zone_dflt, &tuple);
+	if (unlikely(!h))
+		return;
+
+	ct = nf_ct_tuplehash_to_ctrack(h);
+	if (!ct)
+		goto exit;
+
+	tcp_lat = &ct->proto.tcp.latency_sample;
+	if (atomic_read(&tcp_lat->sampling_state) != ARL_SAMPLE_STATE_SAMPLING)
+		goto exit;
+
+	if (arl_get_hrtt(skb, ntohl(tcph->ack_seq), tcp_lat))
+		goto exit;
+
+exit:
+	nf_ct_put(ct);
+}
+EXPORT_SYMBOL(arl_latency_sample_ingress_v4);
+
+void arl_latency_sample_ingress_v6(struct sk_buff *skb, struct ipv6hdr *ipv6h,
+				   struct tcphdr *tcph)
+
+{
+	struct	nf_conn *ct;
+	struct  nf_conntrack_tuple tuple;
+	struct  nf_conntrack_tuple_hash *h;
+	struct  tcp_latency_sample *tcp_lat;
+
+	if (!skb || !skb->dev || skb->dev->ifindex != arl_dev_index)
+		return;
+
+	if (!arl_latency_sampling_enanbled)
+		return;
+
+	/* construct a tuple to lookup nf_conn. */
+	memset(&tuple, 0, sizeof(tuple));
+	tuple.dst.dir = IP_CT_DIR_REPLY;
+	tuple.dst.protonum = IPPROTO_TCP;
+
+	tuple.src.u3.in6 = ipv6h->saddr;
+	tuple.dst.u3.in6 = ipv6h->daddr;
+	tuple.src.l3num = AF_INET6;
+
+	tuple.dst.u.tcp.port = tcph->dest;
+	tuple.src.u.tcp.port = tcph->source;
+	h = nf_conntrack_find_get(&init_net, &nf_ct_zone_dflt, &tuple);
+	if (unlikely(!h))
+		return;
+
+	ct = nf_ct_tuplehash_to_ctrack(h);
+	if (!ct)
+		goto exit;
+
+	tcp_lat = &ct->proto.tcp.latency_sample;
+	if (atomic_read(&tcp_lat->sampling_state) != ARL_SAMPLE_STATE_SAMPLING)
+		goto exit;
+
+	if (arl_get_hrtt(skb, ntohl(tcph->ack_seq), tcp_lat))
+		goto exit;
+
+exit:
+	nf_ct_put(ct);
+}
+EXPORT_SYMBOL(arl_latency_sample_ingress_v6);
+
+void arl_latency_sample_ingress(struct sk_buff *skb)
+{
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;
+	struct tcphdr *tcph, tcphdr;
+
+	if (htons(ETH_P_IP) == skb->protocol) {
+		tcph = arl_get_tcp_header_ipv4(skb, &tcphdr);
+		if (!tcph)
+			return;
+		arl_latency_sample_ingress_v4(skb, iph, tcph);
+	} else if (htons(ETH_P_IPV6) == skb->protocol) {
+		tcph = arl_get_tcp_header_ipv6(skb, &tcphdr);
+		if (!tcph)
+			return;
+		arl_latency_sample_ingress_v6(skb, ipv6h, tcph);
+	}
+}
+EXPORT_SYMBOL(arl_latency_sample_ingress);
+
+/* GSO packets maybe too big and takes more than maxmium tokens to transmit.
+ * Segment the GSO packets that is larger than max_size.
+ */
+static int gso_segment(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *segs, *nskb;
+	netdev_features_t features = netif_skb_features(skb);
+	unsigned int len = 0, prev_len = qdisc_pkt_len(skb);
+	int ret, nb;
+
+	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+
+	if (IS_ERR_OR_NULL(segs))
+		return qdisc_drop(skb, sch, to_free);
+
+	nb = 0;
+	while (segs) {
+		nskb = segs->next;
+		segs->next = NULL;
+		qdisc_skb_cb(segs)->pkt_len = segs->len;
+		len += segs->len;
+		ret = qdisc_enqueue(segs, q->qdisc, to_free);
+		if (ret != NET_XMIT_SUCCESS) {
+			if (net_xmit_drop_count(ret))
+				qdisc_qstats_drop(sch);
+		} else {
+			nb++;
+		}
+		segs = nskb;
+	}
+	sch->q.qlen += nb;
+	if (nb > 1)
+		qdisc_tree_reduce_backlog(sch, 1 - nb, prev_len - len);
+	consume_skb(skb);
+
+	return nb > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
+}
+
+static int arl_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	int ret;
+
+	if (qdisc_pkt_len(skb) > q->params.max_size) {
+		if (skb_is_gso(skb))
+			return gso_segment(skb, sch, to_free);
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	ret = qdisc_enqueue(skb, q->qdisc, to_free);
+	if (unlikely(ret != NET_XMIT_SUCCESS)) {
+		if (net_xmit_drop_count(ret))
+			qdisc_qstats_drop(sch);
+		return ret;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static struct sk_buff *arl_dequeue(struct Qdisc *sch)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	arl_update(sch);
+	skb = q->qdisc->ops->peek(q->qdisc);
+
+	if (skb) {
+		s64 now;
+		s64 toks;
+		unsigned int len = qdisc_pkt_len(skb);
+
+		if (len > q->params.max_size) {
+			pr_err("%s: Oversized pkt! %u Bytes, max:%u\n",
+			       __func__, len, q->params.max_size);
+			len = q->params.max_size - 1;
+		}
+
+		if (q->vars.mode == ARL_UNTHROTTLED) {
+			skb = qdisc_dequeue_peeked(q->qdisc);
+			if (unlikely(!skb))
+				return NULL;
+			qdisc_bstats_update(sch, skb);
+			q->vars.bw_est_bytes_sent += len;
+			arl_latency_sample_egress(q, skb);
+			return skb;
+		}
+
+		now = ktime_get_ns();
+		toks = min(now - q->vars.ts, q->vars.buffer);
+
+		toks += q->vars.tokens;
+		if (toks > q->vars.buffer)
+			toks = q->vars.buffer;
+		toks -= psched_l2t_ns(&q->vars.rate, len);
+
+		if (toks >= 0) {
+			skb = qdisc_dequeue_peeked(q->qdisc);
+			if (unlikely(!skb))
+				return NULL;
+
+			q->vars.ts = now;
+			q->vars.tokens = toks;
+			qdisc_bstats_update(sch, skb);
+			q->vars.bw_est_bytes_sent += len;
+			arl_latency_sample_egress(q, skb);
+			return skb;
+		}
+		qdisc_watchdog_schedule_ns(&q->wtd, now + (-toks));
+
+		qdisc_qstats_overlimit(sch);
+	}
+	return NULL;
+}
+
+static void arl_reset(struct Qdisc *sch)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset(q->qdisc);
+	q->vars.ts = ktime_get_ns();
+	q->vars.tokens = q->vars.buffer;
+	qdisc_watchdog_cancel(&q->wtd);
+}
+
+static const struct nla_policy arl_policy[TCA_ARL_MAX + 1] = {
+	[TCA_ARL_BUFFER]	= { .type = NLA_U32 },
+	[TCA_ARL_MIN_RATE]	= { .type = NLA_U64 },
+	[TCA_ARL_MAX_BW]	= { .type = NLA_U64 },
+	[TCA_ARL_LIMIT]		= { .type = NLA_U32 },
+	[TCA_ARL_MAX_LATENCY]	= { .type = NLA_U32 },
+	[TCA_ARL_LATENCY_HYSTERESIS]	= { .type = NLA_U32 },
+};
+
+static int arl_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	int err;
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_ARL_MAX + 1];
+	struct Qdisc *child = NULL;
+	struct psched_ratecfg rate;
+	struct tc_ratespec rate_conf;
+
+	err = nla_parse_nested(tb, TCA_ARL_MAX, opt, arl_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_ARL_BUFFER])
+		q->params.buffer = nla_get_u32(tb[TCA_ARL_BUFFER])
+				   * NSEC_PER_USEC;
+
+	if (tb[TCA_ARL_MAX_BW])
+		q->params.max_bw = nla_get_u64(tb[TCA_ARL_MAX_BW]);
+
+	if (tb[TCA_ARL_MIN_RATE])
+		q->params.min_rate = div_u64(nla_get_u64(tb[TCA_ARL_MIN_RATE]),
+					     1000);
+
+	/* Start ARL with min_rate * 2 */
+	q->params.rate = q->params.min_rate * 2;
+
+	if (tb[TCA_ARL_LIMIT])
+		q->params.limit = nla_get_u32(tb[TCA_ARL_LIMIT]);
+
+	if (tb[TCA_ARL_MAX_LATENCY])
+		q->params.max_latency = nla_get_u32(tb[TCA_ARL_MAX_LATENCY]);
+	if (tb[TCA_ARL_LATENCY_HYSTERESIS])
+		q->params.latency_hysteresis =
+			nla_get_u32(tb[TCA_ARL_LATENCY_HYSTERESIS]);
+	if (q->params.max_latency < ARL_MAX_LATENCY_DEFAULT / 2)
+		q->params.max_latency = ARL_MAX_LATENCY_DEFAULT;
+
+	arl_vars_init(q);
+	memset(&rate_conf, 0, sizeof(rate_conf));
+	rate_conf.linklayer = TC_LINKLAYER_ETHERNET;
+
+	psched_ratecfg_precompute(&rate, &rate_conf, q->params.rate * 1000);
+	memcpy(&q->vars.rate, &rate, sizeof(struct psched_ratecfg));
+
+	if (q->qdisc != &noop_qdisc) {
+		err = fifo_set_limit(q->qdisc, q->params.limit);
+		if (err)
+			goto done;
+	} else if (q->params.limit > 0) {
+		child = fifo_create_dflt(sch, &bfifo_qdisc_ops,
+					 q->params.limit);
+		if (IS_ERR(child)) {
+			err = PTR_ERR(child);
+			goto done;
+		}
+
+		/* child is fifo, no need to check for noop_qdisc */
+		qdisc_hash_add(child, true);
+	}
+
+	sch_tree_lock(sch);
+	if (child) {
+		qdisc_tree_reduce_backlog(q->qdisc, q->qdisc->q.qlen,
+					  q->qdisc->qstats.backlog);
+		qdisc_destroy(q->qdisc);
+		q->qdisc = child;
+	}
+
+	sch_tree_unlock(sch);
+done:
+	return err;
+}
+
+static u32 arl_get_rtt_from_sk(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u32 rtt = U32_MAX, last_ack;
+
+	if (sk->sk_state != TCP_ESTABLISHED)
+		return rtt;
+
+	last_ack = jiffies_to_msecs(jiffies - tp->rcv_tstamp);
+	if (last_ack > ARL_ST_WIN) /* Discard stale data */
+		return rtt;
+
+	rtt = tp->srtt_us >> 3;
+	return rtt;
+}
+
+u32 arl_get_rtt(struct Qdisc *sch)
+{
+	int i;
+	struct inet_hashinfo *hashinfo = &tcp_hashinfo;
+	u32 rtt, rtt_min = U32_MAX;
+	struct net_device *dev = qdisc_dev(sch);
+
+	for (i = 0; i <= hashinfo->ehash_mask; i++) {
+		struct inet_ehash_bucket *head = &hashinfo->ehash[i];
+		spinlock_t *lock = inet_ehash_lockp(hashinfo, i);
+		struct sock *sk;
+		struct hlist_nulls_node *node;
+
+		if (hlist_nulls_empty(&head->chain))
+			continue;
+
+		spin_lock_bh(lock);
+		sk_nulls_for_each(sk, node, &head->chain) {
+			if (sk->sk_family != AF_INET && sk->sk_family !=
+			    AF_INET6)
+				continue;
+			if (inet_sk(sk)->rx_dst_ifindex !=  dev->ifindex)
+				continue;
+			rtt = arl_get_rtt_from_sk(sk);
+			if (rtt == U32_MAX)
+				continue;
+
+			if (rtt < rtt_min)
+				rtt_min = rtt;
+		}
+		spin_unlock_bh(lock);
+	}
+	return rtt_min;
+}
+
+static void arl_timer_func(unsigned long data)
+{
+	struct Qdisc *sch = (struct Qdisc *)data;
+	struct arl_sched_data *q = qdisc_priv(sch);
+	u32 rtt;
+
+	mod_timer(&q->arl_timer, jiffies + ARL_TIMER_INTERVAL);
+	rtt = arl_get_rtt(sch);
+
+	if (rtt != U32_MAX)
+		arl_update_latency(q, rtt);
+}
+
+static int arl_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+
+	arl_dev_index = dev->ifindex;
+
+	arl_params_init(&q->params);
+	qdisc_watchdog_init(&q->wtd, sch);
+	q->qdisc = &noop_qdisc;
+
+	init_timer(&q->arl_timer);
+	q->arl_timer.expires = jiffies + ARL_TIMER_INTERVAL;
+	q->arl_timer.data = (unsigned long)sch;
+	q->arl_timer.function = arl_timer_func;
+	add_timer(&q->arl_timer);
+
+	if (opt) {
+		int err = arl_change(sch, opt);
+
+		if (err)
+			return err;
+	}
+	arl_latency_sampling_enanbled = true;
+
+	return 0;
+}
+
+static void arl_destroy(struct Qdisc *sch)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	arl_dev_index = -1;
+	qdisc_watchdog_cancel(&q->wtd);
+	del_timer_sync(&q->arl_timer);
+	qdisc_destroy(q->qdisc);
+}
+
+static int arl_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	if ((nla_put_u32(skb, TCA_ARL_BUFFER,
+			 q->params.buffer / NSEC_PER_USEC)) ||
+	    (nla_put_u64_64bit(skb, TCA_ARL_MIN_RATE,
+			       q->params.min_rate * 1000, TCA_ARL_PAD)) ||
+	    (nla_put_u32(skb, TCA_ARL_LIMIT, q->params.limit)) ||
+	    (nla_put_u64_64bit(skb, TCA_ARL_MAX_BW, q->params.max_bw,
+			       TCA_ARL_PAD)) ||
+	    (nla_put_u32(skb, TCA_ARL_LATENCY_HYSTERESIS,
+			 q->params.latency_hysteresis)) ||
+	    (nla_put_u32(skb, TCA_ARL_MAX_LATENCY, q->params.max_latency)))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static int arl_dump_class(struct Qdisc *sch, unsigned long cl,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	tcm->tcm_handle |= TC_H_MIN(1);
+	tcm->tcm_info = q->qdisc->handle;
+
+	return 0;
+}
+
+static int arl_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+	struct tc_arl_xstats st = { 0 };
+
+	/* convert bw and rate from KBps to Kbps */
+	st.max_bw = q->stats.max_bw * 8;
+	st.min_rate = q->stats.min_rate * 8;
+	st.current_rate = q->vars.base_rate * 8;
+	st.latency = minmax_get(&q->vars.min_hrtt);
+
+	/* clear max_bw and min_rate after each stats dump */
+	q->stats.max_bw = 0;
+	q->stats.min_rate = 0;
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static int arl_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+		     struct Qdisc **old)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	if (!new)
+		new = &noop_qdisc;
+
+	*old = qdisc_replace(sch, new, &q->qdisc);
+	return 0;
+}
+
+static struct Qdisc *arl_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct arl_sched_data *q = qdisc_priv(sch);
+
+	return q->qdisc;
+}
+
+static unsigned long arl_find(struct Qdisc *sch, u32 classid)
+{
+	return 1;
+}
+
+static void arl_walk(struct Qdisc *sch, struct qdisc_walker *walker)
+{
+	if (!walker->stop) {
+		if (walker->count >= walker->skip)
+			if (walker->fn(sch, 1, walker) < 0) {
+				walker->stop = 1;
+				return;
+			}
+		walker->count++;
+	}
+}
+
+static const struct Qdisc_class_ops arl_class_ops = {
+	.graft		=	arl_graft,
+	.leaf		=	arl_leaf,
+	.find		=	arl_find,
+	.walk		=	arl_walk,
+	.dump		=	arl_dump_class,
+};
+
+static struct Qdisc_ops arl_qdisc_ops __read_mostly = {
+	.next		=	NULL,
+	.cl_ops		=	&arl_class_ops,
+	.id		=	"arl",
+	.priv_size	=	sizeof(struct arl_sched_data),
+	.enqueue	=	arl_enqueue,
+	.dequeue	=	arl_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	arl_init,
+	.reset		=	arl_reset,
+	.destroy	=	arl_destroy,
+	.change		=	arl_change,
+	.dump		=	arl_dump,
+	.dump_stats	=	arl_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init arl_module_init(void)
+{
+	return register_qdisc(&arl_qdisc_ops);
+}
+
+static void __exit arl_module_exit(void)
+{
+	unregister_qdisc(&arl_qdisc_ops);
+}
+
+module_init(arl_module_init)
+module_exit(arl_module_exit)
+
+MODULE_DESCRIPTION("Adaptive Rate Limiting(ARL) queue discipline");
+MODULE_AUTHOR("Kan Yan <kyan@google.com>");
+MODULE_LICENSE("GPL");
+
diff --git a/net/sched/sch_arl.h b/net/sched/sch_arl.h
new file mode 100644
index 000000000000..3a4643e064d0
--- /dev/null
+++ b/net/sched/sch_arl.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SCH_ARL
+#define _SCH_ARL
+
+enum arl_mode {
+	/* STABLE: Maintaining current rate */
+	ARL_STABLE = 0,
+	/* DRAIN: Drop rate sharply to reduce latency */
+	ARL_DRAIN,
+	/* BW_PROBE: Increase rate gradually to probe if more BW is available */
+	ARL_BW_PROBE,
+	/* LATENCY_PROBE: Decrease rate gradually to probe if latency can be
+	 * reduced
+	 */
+	ARL_LATENCY_PROBE,
+	/* UNTHROTTLED: Not enforceing rate shaping */
+	ARL_UNTHROTTLED
+};
+
+enum arl_latency_sampling_state {
+	ARL_SAMPLE_STATE_IDLE = 0,
+	ARL_SAMPLE_STATE_UPDATING,
+	ARL_SAMPLE_STATE_SAMPLING,
+	ARL_SAMPLE_STATE_DONE
+};
+
+struct arl_params {
+	u64	min_rate;	/* The lowest rate for the rate limiter */
+	u64	rate;		/* Initial rate */
+	u32	buffer;		/* Burst size, in ns */
+	/* The maximum rate that rate limiting will be enforced. Above max_bw
+	 * ARL enters UNTHROTTLED mode and bypasses the rate shaper to avoid CPU
+	 * overhead.
+	 */
+	u64	max_bw;
+	u32	limit;		/* Maxmium number of packets allowed in queue */
+	u32	max_latency;	/* The upper limit for latency */
+	u32	max_size;	/* Max packet size */
+	u32	latency_hysteresis;
+};
+
+DECLARE_EWMA(arl_bw_avg, 3, 8)
+
+struct arl_vars {
+	s64	tokens;	/* token count*/
+	s64	buffer;	/* Token bucket depth/rate */
+	s64	ts;	/* Last de-queue time */
+	ktime_t phase_start_t;		/* phase start time */
+	ktime_t last_drain_t;		/* Timestamp of last DRAIN */
+	struct	psched_ratecfg rate;	/* The current rate limit */
+	struct	tc_ratespec cfg_rate;
+	u64	bw_est_bytes_sent;
+	u32	bw_est;	/* Current estimate of throughput. in Kbps */
+	u32	last_bw_est; /* bw_est from previous cycle */
+	/* Exponentially Weighted Moving Avg. of bw */
+	struct	ewma_arl_bw_avg bw_avg;
+	unsigned long	bw_est_start_t;	/* Timestamp of the start */
+	struct	minmax	max_bw;		/* windowed max of the bw measured */
+	/* Short term(last cycle) windowed min of halfpath RTT */
+	struct	minmax	st_min_hrtt;
+	struct	minmax	min_hrtt;	/* Medium term(~2s) windowed min HRTT */
+	struct	minmax	lt_min_hrtt;	/* Longer term(~10s) win. min HRTT */
+	u32	min_hrtt_last_cycle;	/* min_hrtt at the end of prev. cycle */
+	u32	last_min_hrtt;		/* min_hrrt at the end of prev. state */
+	u32	mode;		/* Current state */
+	u32	phase;		/* Current phase */
+	u32	phase_dur;	/* In us */
+	u32	cycle_cnt;	/* Cycles stays in current states */
+	u32	base_rate;	/* The base rate for current mode */
+	u32	probe_rate;	/* Adjusted rate for current cycle */
+	u32	last_rate;	/* Previous rate before enter DRAIN state */
+	s32	rate_factor;	/* Factor for increase/decrease rate */
+	s32	rate_delta;	/* The adjust of rate from base_rate */
+	/* Compare to prev. cycle, is latency increasing of decreasing */
+	s32	latency_trend;
+	ulong	last_latency_upd; /* Time stamp of last valid latency measure */
+};
+
+#define ARL_TIMER_INTERVAL		(150 * HZ / MSEC_PER_SEC) /* 150 ms */
+#define ARL_LAT_HYSTERESIS_DEFAULT	(40 * USEC_PER_MSEC)	/* 40ms in us */
+
+#define ARL_CYCLE_LEN		4	/* # of phases in a cycle */
+#define ARL_PHASE_DUR_MAX	250UL	/* in ms */
+#define ARL_PHASE_DUR_MIN	100UL	/* in ms */
+#define ARL_DRAIN_DUR_MAX	250U	/* in ms */
+#define ARL_DRAIN_DUR_MIN	20U	/* in ms */
+
+#define ARL_DRAIN_INTERVAL	(10 * MSEC_PER_SEC)	/* 10s in ms */
+/* Longer term window size in ms, for the windowed min/max of bandwidth
+ * and latency measurement. Should include a DRAIN cycle.
+ */
+#define ARL_LT_WIN	(11 * MSEC_PER_SEC)
+#define ARL_MT_WIN	(3 * MSEC_PER_SEC) /* Medium term window size */
+#define ARL_ST_WIN	150	/* Short term window size in ms */
+
+#define ARL_LATENCY_SAMPLE_TIMEOUT_US	(2 * USEC_PER_SEC)
+/* Low latency threshold to skip DRAIN cycle */
+#define ARL_LOW_LATENCY	(40 * USEC_PER_MSEC)
+
+/* The maximum BW/throughput, above it ARL will not enforce rate limit */
+#define ARL_MAX_BW_DEFAULT	(300 * 1000000 / 8) /* In Bytes per sec */
+/* The minimun rate limit. ARL will not go below this limit */
+#define ARL_MIN_RATE_DEFAULT	(2 * 1000 / 8) /* In KBytes per sec */
+/* The default burst size for the Token Buffer rate limiter */
+#define ARL_BUFFER_SIZE_DEFAULT	(5 * USEC_PER_MSEC) /* in us */
+#define ARL_MAX_LATENCY_DEFAULT	(150 * USEC_PER_MSEC) /* in us */
+
+struct arl_stats {
+	u32 max_bw;		/* max bw detected */
+	u32 min_rate;		/* the min. of base rate */
+};
+
+struct arl_sched_data {
+	struct arl_params	params;
+	struct arl_vars		vars;
+	struct arl_stats	stats;
+	struct qdisc_watchdog	wtd;
+	struct Qdisc		*qdisc;
+	struct timer_list	arl_timer;
+};
+#endif /* _NET_SCH_ARL */
author	Danny Mitzel <mitzel@google.com>	2019-08-01 11:47:01 -0700
committer	chrome-bot <chrome-bot@chromium.org>	2019-08-19 18:05:41 -0700
commit	bf81b38a6a104561aebe84b4a7b30502a4e68b8f (patch)
tree	755f312a2843e670eb52009f264a02e12633f780
parent	8861537f2f8a7cf070d6b9d9eba82140ed507a38 (diff)