因为 CDH 社区版不能使用 Navigator ,所以需要自己集成一个Apache Atlas 。
版本说明 20200818 Updated: 目前最新版是2.1.0,如果 Hive 版本是3.1以下的,可能存在不兼容问题。 在编译前修改下源码,参考这篇 。
Atlas: 2.0 Download
CDH: 6.3.1 (Parcels)
其它 Atlas 依赖 Solr(或 ES)、HBase和 Kafka 来工作,先确保 CDH 这3个已经开启,或者编译 Atlas代码时集成进去(生产不建议)。
编译 从官网下载Atlas 2.0的代码,然后进行编译,详见官网(基本没啥坑,除非 maven 版本太低,去下一个最新的即可)
1 2 3 4 5 tar xvfz apache-atlas-2.0.0-sources.tar.gz cd apache-atlas-sources-2.0.0/export MAVEN_OPTS="-Xms2g -Xmx2g" mvn clean -DskipTests install mvn clean -DskipTests package -Pdist
编译好的位置在distro/target/
下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [root@node1 target] total 738M drwxr-xr-x. 3 root root 32 Mar 31 10:08 apache-atlas-2.0.0-bin -rw-r--r--. 1 root root 360M Mar 31 10:08 apache-atlas-2.0.0-bin.tar.gz drwxr-xr-x. 3 root root 44 Mar 31 10:08 apache-atlas-2.0.0-falcon-hook -rw-r--r--. 1 root root 8.8M Mar 31 10:08 apache-atlas-2.0.0-falcon-hook.tar.gz drwxr-xr-x. 3 root root 43 Mar 31 10:08 apache-atlas-2.0.0-hbase-hook -rw-r--r--. 1 root root 11M Mar 31 10:08 apache-atlas-2.0.0-hbase-hook.tar.gz drwxr-xr-x. 3 root root 42 Mar 31 10:08 apache-atlas-2.0.0-hive-hook -rw-r--r--. 1 root root 16M Mar 31 10:08 apache-atlas-2.0.0-hive-hook.tar.gz drwxr-xr-x. 3 root root 43 Mar 31 10:08 apache-atlas-2.0.0-kafka-hook -rw-r--r--. 1 root root 8.8M Mar 31 10:08 apache-atlas-2.0.0-kafka-hook.tar.gz drwxr-xr-x. 3 root root 32 Mar 31 10:08 apache-atlas-2.0.0-server -rw-r--r--. 1 root root 260M Mar 31 10:08 apache-atlas-2.0.0-server.tar.gz -rw-r--r--. 1 root root 11M Mar 31 10:08 apache-atlas-2.0.0-sources.tar.gz drwxr-xr-x. 3 root root 43 Mar 31 10:08 apache-atlas-2.0.0-sqoop-hook -rw-r--r--. 1 root root 8.8M Mar 31 10:08 apache-atlas-2.0.0-sqoop-hook.tar.gz drwxr-xr-x. 3 root root 43 Mar 31 10:08 apache-atlas-2.0.0-storm-hook -rw-r--r--. 1 root root 57M Mar 31 10:08 apache-atlas-2.0.0-storm-hook.tar.gz drwxr-xr-x. 2 root root 6 Mar 31 10:08 archive-tmp -rw-r--r--. 1 root root 94K Mar 31 10:08 atlas-distro-2.0.0.jar drwxr-xr-x. 2 root root 4.0K Mar 31 10:08 bin drwxr-xr-x. 5 root root 231 Mar 31 10:08 conf drwxr-xr-x. 2 root root 28 Mar 31 10:08 maven-archiver drwxr-xr-x. 3 root root 22 Mar 31 10:08 maven-shared-archive-resources drwxr-xr-x. 2 root root 55 Mar 31 10:08 META-INF -rw-r--r--. 1 root root 3.9K Mar 31 10:08 rat.txt drwxr-xr-x. 3 root root 22 Mar 31 10:08 test-classes
我们直接使用apache-atlas-2.0.0-bin.tar.gz
和需要 hook 的组件如:apache-atlas-2.0.0-hive-hook.tar.gz
、apache-atlas-2.0.0-sqoop-hook.tar.gz
配置 将需要的 tar.gz文件解压到指定目录如/opt/atlas
改几个关键的配置vim conf/atlas-application.properties
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 atlas.graph.storage.hostname=node1:2181,node2:2181,node3:2181 atlas.graph.index.search.solr.mode=cloud atlas.graph.index.search.solr.zookeeper-url=node1:2181/solr,node2:2181/solr,node3:2181/solr atlas.graph.index.search.solr.zookeeper-connect-timeout=60000 atlas.graph.index.search.solr.zookeeper-session-timeout=60000 atlas.graph.index.search.solr.wait-searcher=true atlas.notification.embedded=false atlas.kafka.data=${sys:atlas.home} /data/kafka atlas.kafka.zookeeper.connect=node1:2181,node2:2181,node3:2181 atlas.kafka.bootstrap.servers=node1:9092,node2:9092,node3:9092 atlas.kafka.zookeeper.session.timeout.ms=4000 atlas.kafka.zookeeper.connection.timeout.ms=2000 atlas.kafka.zookeeper.sync.time.ms=20 atlas.kafka.auto.commit.interval.ms=1000 atlas.kafka.hook.group.id=atlas atlas.rest.address=http://0.0.0.0:21001 atlas.audit.hbase.tablename=apache_atlas_entity_audit atlas.audit.zookeeper.session.timeout.ms=1000 atlas.audit.hbase.zookeeper.quorum=node1:2181,node2:2181,node3:2181 atlas.server.http.port=21001 atlas.hook.hive.synchronous=false atlas.hook.hive.numRetries=3 atlas.hook.hive.queueSize=10000 atlas.cluster.name=primary atlas.hook.sqoop.synchronous=false atlas.hook.sqoop.numRetries=3 atlas.hook.sqoop.queueSize=10000
这时候可以尝试启动下./bin/atlas_start.py
并看一下 log tail -f logs/application.log
,我这边遇到了问题:
1 2 3 4 5 6 7 8 9 10 11 2020-03-31 11:42:57,437 INFO - [main:] ~ Creating indexes for graph. (GraphBackedS earchIndexer:248) 2020-03-31 11:42:58,605 INFO - [main:] ~ Created index : vertex_index (GraphBacked SearchIndexer:253) 2020-03-31 11:42:58,692 INFO - [main:] ~ Created index : edge_index (GraphBackedSe archIndexer:259) 2020-03-31 11:42:58,700 INFO - [main:] ~ Created index : fulltext_index (GraphBackedSearchIndexer:265) 2020-03-31 11:42:58,824 ERROR - [main:] ~ GraphBackedSearchIndexer.initialize() failed (GraphBackedSearchIndexer:307) org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException : Error from server at http://BD-Cal-Pro-02:8983/solr: Can not find the specified config set : vertex_index at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:627) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:253)
不知道为何 atlas 没能在 solr 创建需要的数据,只能手动创建了
1 2 3 4 5 6 7 solrctl instancedir --create atlas conf/solr/ solrctl collection --create vertex_index -s 1 -c atlas -r 1 solrctl collection --create edge_index -s 1 -c atlas -r 1 solrctl collection --create fulltext_index -s 1 -c atlas -r 1
至此 atlas 就可以正常启动了。
Jar 包配置 把配置复制到 hook/hive 下,然后打包进atlas-plugin-classloader-2.0.0.jar
zip -u atlas-plugin-classloader-2.0.0.jar atlas-application.properties
有的教程是直接跨目录压缩进去了,但实际执行会出错。
Hive Hook 这边需要去CM 中 Hive 配置页面修改:
Hive Auxiliary JARs Directory
${ATLAS_HOME}/hook/hive
Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hive-env.sh
HIVE_AUX_JARS_PATH=${ATLAS_HOME}/hook/hive
HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <property> <name>hive.exec.post.hooks</name> <value>org.apache.atlas.hive.hook.HiveHook</value> </property> <property> <name>hive.reloadable.aux.jars.path</name> <value>${ATLAS_HOME} /hook/hive/</value> </property> <property> <name>atlas.cluster.name</name> <value>primary</value> </property>
配置好后重启 Hive,创建一张测试表可以看到Atlas 中会出现该表的记录。
导入历史数据 如果需要历史数据,则可以通过 hook-bin下面的 import_hive.sh 导入即可。
1 2 3 export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive/cp conf/atlas-application.properties /opt/cloudera/parcels/CDH/lib/hive/conf/ ./hook-bin/import-hive.sh