I know we have some graybeards on here who might be able to help me narrow it down. You're my last hope to try to figure out this bug that's been driving me nuts. Linux support forums and bug trackers have failed me.
It seems to me that there is a serious and fairly widespread bug that's mostly gone unnoticed. It affects Ethernet performance and nothing else. Seems like this is the only other guy who's noticed what's going on:
http://ubuntuforums.org/showthread.php?t=2245203
I'm not sure if it's distro-specific, it seems that only Ubuntu and Mint are affected.
Background info, here are the computers involved:
-
Gaming PC (Win7 64bit) - unaffected of course
-
Server (Xubuntu 14.04, kernel 3.13 32bit) - unaffected
-
HTPC (Mint 17.1, kernel 3.13 32bit) - affected
-
Laptop (Mint 17.2, kernel 3.16 64bit) - affected
The laptop didn't have this problem back when it was running some ancient Ubuntu LTS with a v2.6 64bit kernel a few weeks ago. I only recently got a wired connection to the HTPC and noticed the problem there.
All are attached to a gigabit LAN with a very diverse set of adapters of different models and manufacturers. The only thing the affected computers have in common is the distro. Speeds were confirmed using iperf between all Linux computers, to rule out any service-specific problems. For the Windows PC I simply pulled a big file from the Samba share on the server.
Between the gaming PC and the server I can move files as fast as the hard drives will allow, at over 80 Mbyte/s. But between the server and the laptop or HTPC, transfers are under 10Mbit/s (~1.2Mbyte/s). Negotiated link speed is a full gigabit in all cases. Where do I start in trying to narrow down what's causing this?
Only idea I have right now is to retest at the lowest runlevel possible.
Edit: Turns out Mint doesn't really do runlevels. It has 2 which is like 5, and 1 which is supposed to be minimal single-user mode, but never gives you a terminal.
codrus
Dork
8/16/15 11:28 a.m.
No beard here, but I've been writing software for ethernet switches for a couple decades now, so I knew a few things. :)
What is the network topology? Are they all plugged into the same switch, or are there additional switches in the network? What kind of switch are you using?
Have you tried swapping ports and cables around? Just because it auto-negotiates 1G doesn't mean it will actually sustain traffic at that speed. Are these all the built-in NICs on the motherboards? It could theoretically be a problem with the NIC hardware. Have you looked at the port statistics to see if it's showing retransmissions or other network errors? You could try forcing the links in question to 100, that won't get you the 640Mbps you're hoping for, but if running it at that speed helps then it tends to point to a hardware problem and not a software one.
If the hardware checks out, then the next step is to run tcpdump and look at the packet trace. If it's a switched network with a cheap, "dumb" switch, then you'll have to run it one of the two endpoints, because the switch is only going to forward the traffic out of those two ports. (a smarter switch supports "SPAN" or "sniffer" ports, where it can be configured to copy traffic between certain ports out another port for debugging problems exactly like this). I dunno how familiar you are with TCP, but you can learn a lot by looking at the acks coming back from the receiver.
peter
Dork
8/16/15 11:49 a.m.
Run netstat in one terminal window while you're doing the file transfer. That can help figure out where the bottleneck is - if you've got a big Recv-Q, your problem is in getting the data to disk. I'd do this before bothering with TCP dump. It's fast and easily eliminates what I think is the most likely culprit.
If your input queues are mostly empty, then I second swapping around cables and ports, followed by the TCP dump gymnastics.
(Also a no-beard. The days of Linux knowledge being confined to the Richard Stallman types are over)
Good ideas. The network is made up of a couple of 4-port gigabit home routers. I know I can get a good working gigabit link from any of the points I've been using. All the computers are using onboard adapters. No collisions or errors in ifconfig. I'll try everything suggested and report back.
Edit: Literal beards not required BTW
OK I've been running tests between the HTPC and the server. Queues for the iperf connection on the HTPC (listening side) are zero, on the sending side the receive queue is zero and the send queue is in the six digits (which is normal, right?). Forcing the link speed on the HTPC to 100Mbit/s didn't change anything. Just tcpdump and hardware swapping left to try now.
Here's a bit of the tcpdump output, filtered by IP for traffic between the server and HTPC:
http://pastebin.com/T3xrXFxf
Whoa breakthrough! Well it's not all good because one of my costly gigabit routers is probably bad, but at least this is making some sense now. I plugged my laptop into the port the gaming PC uses to do a test, just because you guys both suggested it, and BOOM, full speed. The tests I'd been running before with the HTPC and laptop both went through another router that the gaming PC doesn't connect through. When I upgraded this laptop a few weeks ago, I used that same connection to back it up at full gigabit speed, so something's gone wrong since then.
Gonna try to test the HTPC through the same connection the gaming PC uses and see what happens.
GameboyRMH wrote:
Here's a bit of the tcpdump output, filtered by IP for traffic between the server and HTPC:
http://pastebin.com/T3xrXFxf
Is that the beginning of the data? The end? Something in the middle? The window size looks suspiciously small, but this is a dynamic value so it's hard to say with just some packets out of the middle.
You really need to capture an entire TCP session. I'd recommend using a file transfer that takes at least 15-20 seconds to complete to give it time to stabilize, and try to minimize the other data flowing between those two hosts to cut down on confusion.
TCP interprets packet loss as indicating network congestion, and it backs down the transfer rate to avoid clogging things up. One unintentional side effect of this is that a data link with what seems like a very low packet drop rate can have a very low useful data throughput rate. 1-2% packet drops will significantly affect your data rate.
To look for packet drops, you need to look at sequence numbers. TCP uses sequence numbers to indicate the order packets go in, if you look at the tcpdump output you'll see "seq A:B" in the packets going to the receiver. A and B indicate the beginning and end sequence numbers of the data in that packet (the bytes are numbered, not the packets, so that you can send packets of different sizes). Packets also have an "ack C" number in them, which is an acknowledgement indicating that the sender of that packet has successfully received the data up through C flowing in the other direction. Looking for this in tcpdump output can be somewhat tedious, there are other tools out there that can automate this to a degree, such as wireshark.
[edit I see where you found a bad port -- no need to do the packet trace if that pans out] :)
More info on the routers since it looks like a network equipment problem: Two Buffalo G300NH's, the "primary" one running OpenWRT, the "secondary" one (which is now giving trouble) running DD-WRT because it was getting the job done.
Oddly DD-WRT comes with iperf while it's not available for OpenWRT (Edit: Whoops yes it is, had to update the package list). Just ran a test between the server and the troublesome router and the speed is below 10mbit again.
Edit2: iperf between the two routers gives the same <10mbit speed, gonna try to find out the negotiated link speed on both ends.
Plugged in my laptop in place of the secondary router and ran an iperf test to the primary. Got a speed of 136Mbit/s, which is only possible with a gigabit link but is still about 1/6th of what it should be, so that suggests it could be a bad cable.
BUT direct connections between the HTPC and the secondary router, and the laptop and the secondary router are both under 10Mbit in speed. What are the odds of the cable and router both going bad?
Whoops I've been working so fast I missed a reply. That tcpdump was from the middle of a capture.
I've narrowed it down to the secondary router, it's causing all of this. I plugged my laptop in place of it and ran an iperf test to the server and got full speed. That means the cable is good, it's just that the routers aren't powerful enough to max out a gigabit link themselves in an iperf test.
The router broke at exactly the right time to make me think there was some problem with Mint 17.
I'm gonna first try restoring settings on the router from a backup, then if that doesn't work I'll flash it over to OpenWRT and see if that makes a difference before I replace it.
Did you make any changes to the router config lately? It's unlikely that it's physically broken (digital solid state devices don't tend to break, and when they do they're usually completely dead, rather than just degraded), but it's possible that if you turned on some additional features then that winds up sending all of the packets to the slow forwarding path.
Didn't make any settings changes, but it turned out to be a settings problem of some kind. I found that the router was doing some bizarre stuff like not letting me log into the web interface with the same credentials I could SSH in with. Turned out I didn't have a backup of the settings, but I did a full reset (had to erase nvram from the terminal) and put the settings back manually, and it looks like it's working properly now, I have full speed to the HTPC.