奔跑的蚂蚁

内核选项tcp_tw_recycle原理

2016-10-14 #Linux #TCP

why TIME_WAIT state ？

在TCP状态机中，TIME_WAIT状态主要用来避免延迟的旧连接的数据被相同的新连接(连接相同指SRC IP:PORT DST IP:PORT相同)所接收。

time_wait

如上图所示:

连接<(IP a : Port p) - (IP b : Port q)>断开
新连接<(IP a : Port p) - (IP b : Port q)>建立
上次断开连接的数据在网络中延迟并在新连接建立后到来，这样的数据应该是错误并被避免的

TIME_WAIT状态就是来解决上面这样的问题的，一般TIME_WAIT状态等待的时间等于2MSL(最大报文存活时间的2倍)，这样就可以确保旧的延迟的报文不被新连接所接收到。

在Linux具体实现协议栈时，TIME_WAIT状态是保存在每个连接相关的数据(结构TCB即TCP control block，保存该连接所有的状态数据)中的，因此在一些高性能的服务器上，就希望这样的TIME_WAIT状态可以尽量的少(节省服务器内存，并且可以快速的端口重用以建立新连接)，所以RFC6191提出了一种基于TCP timestamp的方法来进行快速的TIME_WAIT状态回收。

RFC6191在思路上借鉴了RFC1122，对比RFC1122通过序列号的递增性，RFC6191则通过TCP timestamp来进行快速的TIME_WAIT状态回收，具体思路大致是当一个新连接(SRC IP, SRC PORT, DST IP, DST PORT)到来时，如果有相同的连接处于TIME_WAIT状态，且新连接的SYN报文的timestamp要大于最近该连接上的timestamp时，可以接受该连接建立的请求。即要求了来自同一个连接(4元组)的timestamp是单调递增的。

tcp_tw_recycle内核实现

在内核中，当开启了tcp_tw_recycle选项时，即:

$ echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

内核TCP连接建立时函数(未贴完):

1440 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
1441 {
1442         struct tcp_options_received tmp_opt;
1443         struct request_sock *req;
1444         struct inet_request_sock *ireq;
1445         struct tcp_sock *tp = tcp_sk(sk);
1446         struct dst_entry *dst = NULL;
1447         __be32 saddr = ip_hdr(skb)->saddr;
1448         __be32 daddr = ip_hdr(skb)->daddr;
1449         __u32 isn = TCP_SKB_CB(skb)->when;
1450         bool want_cookie = false;
1451         struct flowi4 fl4;
1452         struct tcp_fastopen_cookie foc = { .len = -1 };
1453         struct tcp_fastopen_cookie valid_foc = { .len = -1 };
1454         struct sk_buff *skb_synack;
1455         int do_fastopen;
1456
1457         /* Never answer to SYNs send to broadcast or multicast */
1458         if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
1459                 goto drop;
1460
1461         /* TW buckets are converted to open requests without
1462  * limitations, they conserve resources and peer is
1463  * evidently real one.
1464  */
1465         if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
1466                 want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
1467                 if (!want_cookie)
1468                         goto drop;
1469         }
1470
1471         /* Accept backlog is full. If we have already queued enough
1472  * of warm entries in syn queue, drop request. It is better than
1473  * clogging syn queue with openreqs with exponentially increasing
1474  * timeout.
1475  */
1476         if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
1477                 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
1478                 goto drop;
1479         }
1480
1481         req = inet_reqsk_alloc(&tcp_request_sock_ops);
1482         if (!req)
1483                 goto drop;
1484
1485 #ifdef CONFIG_TCP_MD5SIG
1486         tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
1487 #endif
1488
1489         tcp_clear_options(&tmp_opt);
1490         tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
1491         tmp_opt.user_mss  = tp->rx_opt.user_mss;
1492         tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
1493
1494         if (want_cookie && !tmp_opt.saw_tstamp)
1495                 tcp_clear_options(&tmp_opt);
1496
1497         tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
1498         tcp_openreq_init(req, &tmp_opt, skb);
1499
1500         ireq = inet_rsk(req);
1501         ireq->loc_addr = daddr;
1502         ireq->rmt_addr = saddr;
1503         ireq->no_srccheck = inet_sk(sk)->transparent;
1504         ireq->opt = tcp_v4_save_options(skb);
1505
1506         if (security_inet_conn_request(sk, skb, req))
1507                 goto drop_and_free;
1508
1509         if (!want_cookie || tmp_opt.tstamp_ok)
1510                 TCP_ECN_create_request(req, skb, sock_net(sk));
1511
1512         if (want_cookie) {
1513                 isn = cookie_v4_init_sequence(sk, skb, &req->mss);
1514                 req->cookie_ts = tmp_opt.tstamp_ok;
1515         } else if (!isn) {
1516                 /* VJ's idea. We save last timestamp seen
1517  * from the destination in peer table, when entering
1518  * state TIME-WAIT, and check against it before
1519  * accepting new connection request.
1520  *
1521  * If "isn" is not zero, this request hit alive
1522  * timewait bucket, so that all the necessary checks
1523  * are made in the function processing timewait state.
1524  */
1525                 if (tmp_opt.saw_tstamp &&
1526                     tcp_death_row.sysctl_tw_recycle &&
1527                     (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
1528                     fl4.daddr == saddr) {
1529                         if (!tcp_peer_is_proven(req, dst, true)) {
1530                                 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
1531                                 goto drop_and_release;
1532                         }
1533                 }

可以看出在1525行条件判断处，开启TCP_TW_RECYCLE时，会进入函数tcp_peer_is_proven，即判断时间戳的递增性，该函数代码如下:

536 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst, bool paws_check)
537 {
538         struct tcp_metrics_block *tm;
539         bool ret;
540
541         if (!dst)
542                 return false;
543
544         rcu_read_lock();
545         tm = __tcp_get_metrics_req(req, dst);
546         if (paws_check) {
547                 if (tm &&
548                     (u32)get_seconds() - tm->tcpm_ts_stamp < TCP_PAWS_MSL &&
549                     (s32)(tm->tcpm_ts - req->ts_recent) > TCP_PAWS_WINDOW)
550                         ret = false;
551                 else
552                         ret = true;
553         } else {
554                 if (tm && tcp_metric_get(tm, TCP_METRIC_RTT) && tm->tcpm_ts_stamp)
555                         ret = true;
556                 else
557                         ret = false;
558         }
559         rcu_read_unlock();
560
561         return ret;
562 }
563 EXPORT_SYMBOL_GPL(tcp_peer_is_proven);

在549行 (s32)(tm->tcpm_ts - req->ts_recent) > TCP_PAWS_WINDOW) 即为当前时间小于最近该连接上的时间戳(这个时间戳内核会记录的)，然后返回了false，连接建立就会失败。

NAT 与 tcp_tw_recycle

当连接通过NAT机器转换到达开启了TCP_TW_RECYCLE的机器上时，由于NAT机器在做转换时并不会修改协议中的时间戳，因此来自不同机器的请求的时间戳不一定是单调递增的，根据上述原理，所以就会出现连接建立失败的情况。

所以一般机器都是关闭了该选项的，以避免出现连接建立失败的情况。

参考资料

http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

https://tools.ietf.org/html/rfc1122#section-4.2.2.13

https://tools.ietf.org/html/rfc6191#page-3

http://www.cnblogs.com/lulu/p/4149312.html