sqoop使用介绍
sqoop简介
sqoop 命令在执行的过程中,会生成mapreduce代码,会在sqoop命令的执行路径下生成一个java文件,该文件为mapreduce文件,可以用来查看sqoop执行逻辑。
其有两个主要命令如下,此处主要介绍import:
-
import 将关系数据库中的表数据导入到HDFS/Hive/HBASE等
-
export 将hadoop的HDFS/Hive/HBASE等中的数据导出到关系数据库
配置
- 将对应的关系数据库的驱动包例如:oracle的驱动包:ojdbc7.jar放到sqoop client所在的服务器上,路径为:../CDH/lib/sqoop/lib/
export命令
sqoop export \
--hcatalog-database temp \
--hcatalog-table js_pianyuan_orc \
--hcatalog-partition-keys curdate \
--hcatalog-partition-values 20180122 \
--connect jdbc:mysql://ip:3306/test \
--username username \
--password passwd --m 10\
--table js_pianyuan
refer to: hive中orc表sqoop导出到mysql
import命令
-
参数–hive-drop-import-delims会将oracle数据库表中特殊字符:\n \r \001 去掉,从而解决数据源脏乱的问题。
例如:如果库表中包括换行符\n,而sqoop默认换行符为导出的hdfs文件的行的分隔符,那么oracle的一条记录,在hdfs就是两行记录。 -
参数–hive-overwrite,指定该参数,那么目标hive表将会被覆盖;不指定,则对hive目标表进行追加操作。
-
参数–split-by columns,最好选择数据分布比较均匀的columns,避免部分mapper instance空跑,而一些负载过重。跟参数-m num配合使用。
导入到HDFS
sqoop import connect jdbc:oracle:thin:@localhost:1521:db --username user --password pwd --table oracle_table --target-dir /user/.../hdfs_dir --split-by column_names -m num_mappers --hive-drop-import-delims --field-terminated-by '\001' --append
导入到HIVE表
- 无分区的方式
sqoop import connect jdbc:oracle:thin:@localhost:1521:db --username user --password pwd --table oracle_table --hive-import --hive-overwrite --hive-table hivedb.hivetable --hive-drop-import-delims --field-terminated-by '\001'
- 有分区的方式(单分区)&orc格式
sqoop import connect jdbc:oracle:thin:@localhost:1521:db --username user --password pwd --table oracle_table --hcatalog-database odsa --hcatalog-table odsa_table --hive-drop-import-delims --field-terminated-by '\001' -m 1 --hive-partition-key rdate --hive-partition-value '20180909' --hcatalog-storage-stanza 'stored as orcfile'
- 有分区的方式(多分区)&orc格式
sqoop import connect jdbc:oracle:thin:@localhost:1521:db --username user --password pwd --table oracle_table --hcatalog-database odsa --hcatalog-table odsa_table --hive-drop-import-delims --field-terminated-by '\001' -m 1 --hcatalog-partition-keys rdate,rtime --hcatalog-partition-value '20180909','120101' --hcatalog-storage-stanza 'stored as orcfile tblproperties ("orc.compress"="SNAPPY")'
导入数据格式为csv
如果库表里有些字段包括列分隔符,可以将数据导出为csv文件来解决;但是如果包含一些回车符,那么必须通过参数:参数–hive-drop-import-delims 来删除数据源中脏数据的无用字符。
用法和用例
- 用法
sqoop import connect jdbc:oracle:thin:@localhost:1521:db --username user --password pwd --table oracle_table --hive-import --hive-overwrite --hive-table hivedb.hivetable --hive-drop-import-delims --field-terminated-by ',' --escaped-by `\\` --enclosed-by '\"'
导出来的文件结果如下:
使用命令:
hadoof fs -tail /user/..../hadoop-map-001
查看得到的文件如下:
"aaa","\"nihao","null","30"
"bbb","world","null","50"
如上所示,关系表中的null被转为文件中的”null”.可使用下属变量将关系表中的null转为自己想要的值,例如:
--null-string '\\N'
--null-nong-string '\\N'
- 对应的hive设置
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ',',
"quoteChar" = '\"',
"escapeChar" = '\\'
)
STORED AS TEXTFILE;
同时,在设置serdeproperties的一个属性如下:
alter table ${table_name} set serdeproperties('serialization.null.format'='\\N')
问题
问题1
- 现象:Caused by:java.sql.SQLRecoverableException:IO Error: Connection reset
- 原因:Sqoop 2 Server的Java堆栈(Java Heap Size of Sqoop 2 Server in Bytes)设置太少,
- 解决:调大即可。
- refer to Java Heap Size of Sqoop 2 Server in Bytes
问题2
- oracle表中null值,导出的文件格式为csv时,空值无法识别。
问题3
- Problem: When using the default Sqoop connector for Oracle, some data does get transferred, but during the map-reduce job a lot of errors are reported as belows:
11/05/26 16:23:47 INFO mapred.JobClient: Task Id : attempt_201105261333_0002_m_000002_0, Status : FAILED java.lang.RuntimeException: java.lang.RuntimeException: java.sql.SQLRecoverableException: IO Error: Connection reset at com.cloudera.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:164) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.RuntimeException: java.sql.SQLRecoverableException: IO Error: Connection reset at com.cloudera.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:190) at com.cloudera.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:159) ... 9 more Caused by: java.sql.SQLRecoverableException: IO Error: Connection reset at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:428) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:536) at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:228) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at com.cloudera.sqoop.mapreduce.db.DBConfiguration.getConnection(DBConfiguration.java:152) at com.cloudera.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:184) ... 10 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at oracle.net.ns.DataPacket.send(DataPacket.java:199) at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:211) at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:227) at oracle.net.ns.NetInputStream.read(NetInputStream.java:175) at oracle.net.ns.NetInputStream.read(NetInputStream.java:100) at oracle.net.ns.NetInputStream.read(NetInputStream.java:85) at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123) at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79) at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1122) at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1099) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:288) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191) at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:366) at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:752) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:366) ... 18 more
- Solution: This problem occurs primarily due to the lack of a fast random number generation device on the host where the map tasks execute. On typical Linux systems this can be addressed by setting the following property in the java.security file:
securerandom.source=file:/dev/../dev/urandom
Oracle: Connection Reset Errors