RAFT orderers SERVICE_UNAVAILABLE |我的资源库露水湾

[FAB-14187] RAFT orderers SERVICE_UNAVAILABLE – Hyperledger JIRA

Details

Type: Bug

Fix Version/s:v2.0.0

Component/s:fabric-orderer

Labels:

raft

Steps to Reproduce:

1. Create a blockchain network with 3 raft orderers
2. We created 100 channels, but only used 3 for the traffic: testorgchannel1-testorgchannel3. All channels has the same 3 org, 1 peer per org.
3. Using 3*m (m=1 … 10) threads per channel with 3 channels, total thread number is 9*m, each thread targets 1 orderer in the fashion of roundrobin, i.e., thread 1 to orderer1, thread 2 to orderer 2, etc.In our example we used PTE in Constant Mode (to send the next transaction as soon as the TX response ACK for the previous one was received. We increase number of channels. Each thread sends 10,000 transactions.
batchsize = 500 ; batchtimeout=2s; SnapshotInterval: 100 MB
Two orderers show “SERVICE_UNAVAILABLE” after sending 3000 transactions in the case with 3 channels, 6 thread each (total 18 threads).
Note that the test with 3 channels, 3 thread each (total 9 threads) succeeds.

Description

The orderers show no Raft leader, see attached logs (orderer2nd-ordererorg-7.log and orderer3rd-ordererorg-7.log) and PTE log (~~FAB-13760~~-ch3-th2-move-0213170638.log)

^{[[36m2019-02-13 22:09:29.990 UTC [policies] Evaluate -> DEBU 4d724d}[[0m == Done Evaluating *policies.implicitMetaPolicy Policy /Channel/Application/Writers
^{[[36m2019-02-13 22:09:29.990 UTC [policies] Evaluate -> DEBU 4d724e}[[0m Signature set satisfies policy /Channel/Writers
^{[[36m2019-02-13 22:09:29.990 UTC [policies] Evaluate -> DEBU 4d724f}[[0m == Done Evaluating *policies.implicitMetaPolicy Policy /Channel/Writers
^{[[33m2019-02-13 22:09:29.990 UTC [orderer.common.broadcast] ProcessMessage -> WARN 4d7250}[[0m [channel: testorgschannel3] Rejecting broadcast of normal message from 10.188.208.4:13293 with SERVICE_UNAVAILABLE: rejected by Order: no Raft leader
^{[[36m2019-02-13 22:09:29.990 UTC [orderer.common.server] func1 -> DEBU 4d7251}[[0m Closing Broadcast stream
^{[[34m2019-02-13 22:09:29.990 UTC [comm.grpc.server] 1 -> INFO 4d7252}[[0m streaming call completed grpc.service=orderer.AtomicBroadcast grpc.method=Broadcast grpc.peer_address=10.188.208.4:13293 grpc.code=OK grpc.call_duration=1.111798ms
^{[[36m2019-02-13 22:09:29.992 UTC [orderer.consensus.etcdraft] consensusSent -> DEBU 4d7253}[[0m Sending msg of 28 bytes to 1 on channel testorgschannel15 took 3.666µs
^{[[36m2019-02-13 22:09:29.992 UTC [orderer.consensus.etcdraft] consensusSent -> DEBU 4d7254}[[0m Sending msg of 28 bytes to 3 on channel testorgschannel15 took 3.106µs

PTE log shows:
^[[32minfo[[39m: [PTE 2 main]: stdout: ^[[32minfo[[39m: [PTE 2 exec]: [1550095769214 Nid:chan:org:id=0:testorgschannel3:org2:0 invoke_move_dist_evtBlock] failed to sendTransaction status:

{“status”:”SERVICE_UNAVAILABLE“,”info”:”no Raft leader“}^[[32minfo[[39m: [PTE 2 main]: stdout: ^[[32minfo[[39m: [PTE 2 exec]: [1550095769227 Nid:chan:org:id=0:testorgschannel3:org2:1 invoke_move_dist_evtBlock] failed to sendTransaction status:

{“status”:”SERVICE_UNAVAILABLE“,”info”:”no Raft leader“}

Issue Links

relates to: FAB-13760 Performance test: scale up traffic threads: no crashes, constant mode concurrent invocations on 3*m threads/peer/channel (3 channels)

Activity

Permalink

Scott Zwierzynski added a comment – 14/Feb/19 9:57 PM – edited

These logs were also seen.

2019-02-13 22:09:31.551014 W | wal: sync duration of 2.087949482s, expected less than 1s
2019-02-13 22:09:31.552566 W | wal: sync duration of 7.750080345s, expected less than 1s
2019-02-13 22:09:31.558517 W | wal: sync duration of 7.743392857s, expected less than 1s

I understand that any orderer in a channel would start a new election term if it does not receive any heartbeats from the leader of the channel. With transactions in flight, an orderer leader change can cause transactions to be dropped, and also other transactions to be NACKed.

We have some questions. Please help us document what happened:

Why was a leader election triggered at all? What caused the delays? Jay mentioned that we were waiting for disk IO to complete. Can we dig any deeper?
Why was a leader election triggered in only one of the channels?
How can an operator tune their network configuration, node hardware, or environment variables to prevent this from happening as much as possible?

Permalink

Jay Guo added a comment – 15/Feb/19 3:26 PM

1. Why was a leader election triggered at all? What caused the delays? Jay mentioned that we were waiting for disk IO to complete. Can we dig any deeper?

on linux, it’s this call:

syscall.Fdatasync(int(f.Fd()))

2. Why was a leader election triggered in only one of the channels?

because channels are independent from each other, delayed file sync on leader in channelA doesn’t affect leaders in other channels.

3. How can an operator tune their network configuration, node hardware, or environment variables to prevent this from happening as much as possible?

perhaps find out wal sync performance by running wal benchmark test
set `ElectionTimeout` accordingly
trace syscall to spot delay

Permalink

Scott Zwierzynski added a comment – 16/Feb/19 2:34 AM

Answers make sense. However, upon further analysis, we notice that something affected transactions in two (and maybe all 3) channels at the same time. It seems we should be looking for the root cause of a specific interruption causing disk write delays (and lost transactions) for all the channels. This has happened twice out of many high-volume traffic tests.

We clarified the timeline and added our detailed observations in the attachment `lost_transactions.txt`

Permalink

Scott Zwierzynski added a comment – 18/Feb/19 11:25 PM

We are rerunning the testcase again tonight, ~~FAB-13760~~, with logs enabled as per direct message conversations with developers, and if the problem occurs we should have data available for review in about 16 hours.

Permalink

Dongming Hwang added a comment – 19/Feb/19 4:23 PM

19/Feb/19 4:23 PM

Re-ran the same test twice with the latest image 5937ae4 from 2/18/19 with logging level
FABRIC_LOGGING_SPEC=info:orderer.consensus.etcdraft=debug:orderer.common.cluster=debug.
The issue was not reproduced though it was observed three times with the same test in the past. However, the TPS is degraded by 1/4 (from 496 to 369) compare to the orderer image 55869bc59 from Jay in 2/12/19. We do not know if this is caused by the debug log or something else. We will run with logging level “info” today to see if the logging level impact the TPS.

Permalink

Scott Zwierzynski added a comment – 20/Feb/19 11:38 PM

Using INFO logs instead of DEBUG logs with yesterday’s images did not get the TPS back up to the rates we were seeing last week with J’s private images. It looks like the additional time being consumed is reflected in increased endorsement response time as well as broadcast transaction response time. Investigation ongoing; considering network latency and the test client software and the test machine as possible causes which would impact both of those.

Permalink

Kostas Christidis added a comment – 23/Feb/19 9:02 PM

The observation on TPS going down is useful. Let’s keep this in the back of our mind as/when we do throughput tests.

The suspicion behind this one is a slow disk. If WAL syncing takes more than 5 seconds the FSM is blocked, hence the leader re-election.

If we want to confirm this suspicion, we’ll need to re-run and take note of all the warning-level logs that pop-up whenever WAL syncing takes more than 1 second.

Permalink

Jason Yellick added a comment – 07/Mar/19 4:37 AM

Any updates on this?

Permalink

Scott Zwierzynski added a comment – 11/Mar/19 2:42 PM

Yes, in past runs we have seen wal logs (delays over 5 secs) when leadership changes occurred (warnings which disappeared when the leadership changes disappeared when we increased our hosts IOPS per GB from 2 to 10). Therefore we agree this is the likely cause of the leadership changes (and makes sense with our analysis provided in the attachment lost-transactions.txt) – even though it is difficult to obtain comprehensive and conclusive data to show that it is the only cause, or that those warning logs occurred every time. We will hold this bug open for now, and keep an eye open during other testing.

Permalink

Scott Zwierzynski added a comment – 11/Mar/19 3:00 PM – edited

Regarding suggestions from Jay Guo :

perhaps find out wal sync performance by running wal benchmark test <– Dongming Hwang Let’s try this in hosts with 2 IOPS and again in hosts with 10 IOPS
set `ElectionTimeout` accordingly <– Do you mean increasing ElectionTick (10) and/or TickInterval (500ms) to allow more than 5 secs before a follower times out waiting for heartbeats and calls for a new election? OK, we previously discussed this and think it might help the network accommodate these occasional delays. However we would prefer to look for another solution because it comes with the drawback that it would take longer for the network to recognize a true outage of the channel leader orderer.
trace syscall to spot delay <– please provide more specifics of what/when/how to do this