hbase各位错误经验总结与调优记录

参考链接:

一、安装

二、问题

2.1 单机部署的Hbase启动失败,hbase.log如下:

021-05-14 11:10:18,086 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 200000ms
at org.apache.hadoop.hbase.util.JVMClusterUtil.waitForEvent(JVMClusterUtil.java:232)
at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:200)
at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:430)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:249)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:149)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:3085)

经过百度调研可知,可使工具修复此类问题。

官方教程:https://hbase.apache.org/book.html#HBCK2

Step 1 克隆

github地址:

apache/hbase-operator-tools: Apache HBase Operator Tools (github.com)

Step2 打包生成hbase-operator-tools:

cd \hbase-hbck2\
mvn package -DskipTests

Step2 拷贝target下的hbase-hbck2-1.2.0-SNAPSHOT.jar

执行以下命令修复未知异常:

/opt/hbase-2.3.5/bin/hbase hbck -j /opt/hbase-data/hbase/yyp/hbase-hbck2-1.2.0-SNAPSHOT.jar recoverUnknown

Tips: recoverUnknown 去掉后,可以查看所有修复方法

Step 4 重启hbase,完成修复

问题2:hbase启动十几小时候,报错后,宕机

WARN [RS:0;iios-hbase:16020] util.Sleeper: We slept 14148ms instead of 3000ms, this is likely due to a long garbage collecting pause and it’s usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired,

修复参考链接见:http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

修复方法:

不确定是否生效,带prod环境7*24h检验。

Step 1 hbase-site.xml增加:

 <property>
<name>zookeeper.session.timeout</name>
<value>120000</value>
</property>
<property>
<name>hbase.zookeeper.property.tickTime</name>
<value>6000</value>
</property>

Step 2 hbase-env.sh 增加:

export HBASE_HEAPSIZE=2G

Step 3 修改hbase docker-compose

最大内存翻倍,最小cpu分配为30%。

问题3:客户意外停电后hbase宕机,导致数据无法写入

场景:hbase为单机部署

状态一:连接Hbase的代码报错如下:

---2021-09-03 15:10:54.931 - INFO - --- [shared-pool3-t2] o.a.h.h.client.AsyncRequestFutureImpl : id=1, table=realtime_data_object, attempt=11/16, failed=4ops, last exception=org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: realtime_data_object,00000795\x00\x00\x01y\xFC\xCE\xD5\x14,1624794165063.9d2f38da9f615d3ff76e670db2e27f22. is not online on iios-hbase,16020,1630652847466,
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3358),
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3335),
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1492),
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2774),
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44870),
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:393),
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133),
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338),
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318),
on iios-hbase,16020,1630384166013, tracking started null, retrying after=10072ms, replay=4ops,

状态二:Hbase启动时无明显报错

状态三:

执行命令,未发现明显错误。

状态四:

 

综上所述,尝试各类解决方法均为解决。

成功的方法:

进入hbase shell,执行查看表的命令,均正常。

Took 0.3641 seconds 
hbase(main):006:0> list
TABLE 
energy_realtime_datapoint 
energy_timing_datapoint

查询出错表前10条数据:

hbase(main):007:0> scan 'realtime_data_object',{LIMIT=>10}
ROW COLUMN+CELL 
org.apache.hadoop.hbase.TableNotEnabledException: realtime_data_object is disabled.
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:773)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:330)
at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:139)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:408)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
ERROR: Table realtime_data_object is disabled!
For usage try 'help "scan"'
Took 0.4266 seconds

Good! 原来是这个问题。

手动enable表后,表状态正常,hbase吞吐正常。

hbase(main):008:0> enable "realtime_data_object"
Took 4.1879 seconds