MTU:
- On the Layer 2 Ethernet, we have below components of the frame:
- Source MAC (6 bytes) + Destination MAC(6 bytes) + Ether-type (2 bytes) + encapsulated data + FCS(4 bytes)
- the encapsulated data above was limited to 1500 bytes in Ethernet technology. This encapsulated data is called as Ethernet MTU
- officially maximum MTU can be 65000 bytes
MSS:
- MSS is the maximum payload which TCP can carry
- how we calculate MSS:
- For IPv4 and TCP :
- IPv4 header is 20 bytes and TCP header is 20 bytes(without optional fields) and our limit on MTU is 1500 bytes:
- 1500 bytes – IP header(20 bytes) – TCP header(20 bytes) = 1460 bytes
- so above maximum segment of data which TCP can send is 1460 bytes. this is only data from upper application layer
- IPv4 header is 20 bytes and TCP header is 20 bytes(without optional fields) and our limit on MTU is 1500 bytes:
- for IPv6 and TCP:
- IPv6 header is 40 bytes and TCP header is 20 bytes(without optional fields):
- 1500 bytes – 40 bytes – 20 bytes = 1440 bytes
- 1440 bytes is the MSS for IPv6.
- IPv6 header is 40 bytes and TCP header is 20 bytes(without optional fields):
- For IPv4 and TCP :
Now how MTU affects GRE or VxLAN or any other tunnelling ?
- keep in mind, MTU is still 1500 bytes. which means any protocol which needs to add its header will need some chunk from 1500 bytes of MTU. this means MSS will reduce.
- GRE/IPSec and MTU: https://www.cisco.com/c/en/us/support/docs/ip/generic-routing-encapsulation-gre/25885-pmtud-ipfrag.html
TCP and MTU:
TCP recognize Ethernet MTU and can calculate MSS from it. which means, if we set our Ethernet MTU as 9000, TCP will calculate MSS as 8960 bytes.in datacenter, we usually set MTU as 9000 bytes as we control all the devices in DC and set the MTU and we want switches to send maximum data in a frame during TCP 3 way handshake:
- A client will calculate its MSS(say 1460 bytes) and send in SYN packet. Client is telling server that if you are going to send me the segment, send only 1460 bytes
- server will do same and tell client to send the segment based on the MSS
- The MSS value is not negotiated between hosts.
- since both know each other MSS and their local supported MSS. sender will chose the minimum of the locally supported MSS and the one received from receiver. why ?
- Let’s say locally supported MSS of the sender is 500 bytes and receiver can receive 1500 bytes
- but sender can only send MSS of 500 bytes and hence it will chose 500 bytes
- packet will still be fragmented if devices between client and server uses lower MTU
TCP header:
TCP 3-WAY HANDSHAKE
SYN:
- Let’s say an application A wants to communicate with another application on port 80 on a remote server and send/receive data. Application A will invoke CPU.
- CPU will create a socket(source-ip, destination-ip, source-port, destination port)
- CPU will create a TCB block in the memory to store information about the TCP.
- In TCB, following information will be stored:
- Socket
- Initial sequence number
- MSS and receive window size
- In TCB, following information will be stored:
- CPU will invoke TCP
SYN details:
Source Port: Client's port
Destination Port: Server's port
Sequence Number: X (randomly chosen by the client)
Acknowledgment Number: 0 (not yet meaningful, as no data has been received from the server)
Data Offset: Usually 5 (representing the size of the TCP header in 32-bit words)
Reserved: 0 (always)
Flags: 0x02 (SYN)
Window Size: Defined by the client (the number of bytes that the client is willing to accept)
Checksum: Calculated based on the header and data
Urgent Pointer: 0 (not used)
Options: May include Maximum Segment Size (MSS), Window Scale, Selective Acknowledgment (SACK) permitted, etc.
Padding: Added to ensure that the header ends on a 32-bit boundary
SYN/ACK:
Please note that the Ack will be X+1. TCP adds a phantom byte.
ACK:
TCP Acknowledgment numbers and Selective Ack(SACK)
TCP acknowledgment number acknowledge the received segments.
Let’s say server is sending data to the client. server has initial sequence as X. server sends 1460 bytes of data(should be less than or equal to MSS), then client will send acknowledgement with ack number as X+1460.
Initially, when sender sends the segments, TCP receiver will send ack for each of the segment.
As the communications continues, TCP receiver will send the ack for group of segments. For example, for segment 1,2,3–> receiver will send only one ack for segment 3.
Loss of segment:
Without SACK:
If a sender sends 4 segments(segment 1,2,3,4). Now what happens when a segment is lost ?
- Initially, TCP receiver will ack for each segment
- If segment#2 is lost
- receiver will send Duplicate Ack for segment #1
- if receiver gets segment #3 and #4. receiver will again send Duplicate Ack for segment #1
- sender will have all the segments which were sent in queue. Sender will wait for retransmission time out(RTO, which is calculated from RTT).
- Once RTO expires, sender will send the segment #2
With SACK:
- if segment #2 is lost and receiver has received segment #3 and #4. receiver will send ack for for segment #1.
- receiver will send duplicate Ack with SACK option ack’ing segment #3 and #4.
- Please note for each segment, receiver will send Duplicate ACK. usually receiver sends Ack for many segments together if TCP session is stable
- Initially, receiver will send ack for each segment
- then usually it will send ack for multiple of 3(based on the OS TCP algo)
seq(1)----segment 1----seq(101)
seq(101)---segment 2---seq(201)
seq(201)---segment 3---seg(301)
seq(301)---segment 4---seq(401)
TCP Congestion Control
RFC: https://datatracker.ietf.org/doc/html/rfc5681
We have 4 algorithms which work together for TCP congestion control:
1)slow start and congestion avoidance:
The congestion window (cwnd) is a sender-side limit on the amount of data the sender can transmit into the network before receiving an acknowledgment (ACK), while the receiver’s advertised window (rwnd) is a receiver-side limit on the amount of outstanding data. The minimum of cwnd and rwnd governs data transmission.
Beginning transmission into a network with unknown conditions requires TCP to slowly probe the network to determine the available capacity, in order to avoid congesting the network with an inappropriately large burst of data. The slow start algorithm is used for this purpose at the beginning of a transfer, or after repairing loss detected by the retransmission timer.
https://www.youtube.com/watch?v=IRXP1vJ6-vM
Basically:
- TCP will start slow and send 2,4,8 segments as data in flight
- data is flight is defined as the data which is sent but not acknowledged.
- TCP will try to expotentially increase the data in flight as it gets Acks for the previous sent segments. TCP is tracking it in the congestion window
- TCP won’t send data any more than receive window.
- the minimum of receive and congestion window determines data transfer
- TCP will exponentially grow the congestion window till the SSThreshold.
- Slow start threshold(SSThresh) is determined internally by the TCP stack
- After reaching the SSThresh, TCP will increase the data in flight very slowly.
what if network throughtput is causing segment drop?
2)fast re-transmit and fast recovery
Let’s say few of the segments in the flight are lost.
Less than 3 Duplicate Acks:
- Sender keeps a send buffer of the segments which were sent
- If the sender receiver Duplicate Acks from the receiver, it knows some segments may have been lost in the flight
- sender will wait till the RTO (re-transmission time out) and send the lost segment
- Once it gets ack for the lost segments, sender will remove those segments from the send buffer
- sender will move back to slow start and reduce the congestion window.
- keep in mind the receiver will ack each segment after a segment has been lost.
Greater than 3 Duplicate Acks:
- if sender get more than 3 duplicate Acks for the lost segments
- sender will kick of the fast re-transmit algo
- sender won’t wait for the RTO and will send segments immediately
- sender will move back to slow start and reduce the congestion window.
TCP Zero Window:
the receive window is the buffer maintained by the receiver. When sender sends the segments, receiver ack it along with the current receive window buffer. its possible that upper layer application is not picking the data from the receive window buffer. In that case, the receive window buffer will keep filling up. In the ack, the receiver will send receive window as 0. Basically telling the sender not to send any more data.