您现在的位置 >> Hadoop教程 >> Hadoop实战 >> 专题  
 

Hadoop教程2:Hello World及hadoop生态系统以及 Sqoop 与 Informix、DB2 和 MySQL

【作者:Hadoop实战专家】【关键词:数据库 数据 可以 】 【点击:56934次】【2013-09-2】
最后这个 MapReduce 示例使用 Hadoop Streaming 支持用 Python 编写的一个映射程序和用 AWK 编写的缩减程序。 mysql> create table wordcount ( word char(36) not null primary key, n int);这两种产品自带的样本数据库有一些您可以为此目的使用的简单数据集。  

相关热门搜索:数据库

大数据标签:hadoop hdfs mapreduce hive sqoop bigdata

继续接着上面的内容:

上一贴:精彩内容

Hadoop 包括一个检查 HDFS 状态的浏览器界面。

图 7 显示了单词计数作业的输出。
图 7. 使用浏览器查看 HDFS

nnbrowse.jpg (56.95 KB, 下载次数: 0)

  

2014-5-18 17:44 上传

Cloudera 网站免费提供了一个更复杂的控制台。这个控制台提供了大量超出标准 Hadoop Web 界面的功能。请注意,图 8 所示的 HDFS 健康状态为 Bad。
图 8. 由 Cloudera Manager 管理的 Hadoop 服务

cm.jpg (89.64 KB, 下载次数: 0)

  

2014-5-18 17:44 上传

为什么是 Bad(不好)?因为在单个虚拟机中,HDFS 无法制作数据块的三个副本。当块不足以复制时,就会存在数据丢失的风险,因此系统的健康状态是不好的。您没有尝试在单个节点上运行生产 Hadoop 作业,这是好事。
您的 MapReduce 作业并不会受限于 Java。最后这个 MapReduce 示例使用 Hadoop Streaming 支持用 Python 编写的一个映射程序和用 AWK 编写的缩减程序。不,您不必是一个 Java 大师也可以编写 map-reduce!
Mark Twain 并不是 Cooper 的铁杆粉丝。在这个用例中,Hadoop 将提供比较 Twain 和 Cooper 的一些简单的文学评论。Flesch-Kincaid 测试对特定文本的阅读级别进行计算。此分析的因素之一是句子的平均长度。解析句子原来比只是查找句号字符要复杂得多。openNLP 包和 Python NLTK 包有出色的句子分析程序。为了简单起见,清单 8 中的示例将使用字长替代一个单词中的音节数。如果您想将这项工作带到一个新的水平,在 MapReduce 中实施 Flesch-Kincaid 测试,抓取 Web,并计算出您最喜爱的新闻站点的阅读级别。

清单 8. 基于 Python 的映射程序文学评论

1. # here is the mapper we'll connect to the streaming hadoop interface

2.

3. # the mapper is reading the text in the file - not really appreciating Twain's humor

4. #

5.

6. # modified from

7. # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

8. $ cat mapper.py

9. #!/usr/bin/env python

10. import sys

11.

12. # read stdin

13. for linein in sys.stdin:

14. # strip blanks

15. linein = linein.strip()

16. # split into words

17. mywords = linein.split()

18. # loop on mywords, output the length of each word

19. for word in mywords:

20. # the reducer just cares about the first column,

21. # normally there is a key - value pair

22. print '%s %s' % (len(word), 0)

复制代码

针对单词 “Twain” 的映射程序输出将是 5 0。字长按数值顺序进行排序,并按排序顺序提交给缩减程序。在清单 9 和清单 10 中的示例中,不需要对数据进行排序,就可以得到正确的输出,但排序是内置在 MapReduce 基础架构中的,无论如何都会发生。

清单 9. 用于文学评论的 AWK 缩减程序

1. # the awk code is modified from [url]http://www.commandlinefu.com[/url]

2.

3. # awk is calculating

4. #  NR - the number of words in total

5. #  sum/NR - the average word length

6. # sqrt(mean2/NR) - the standard deviation

7.

8. $ cat statsreducer.awk

9. awk '{delta = $1 - avg; avg += delta / NR; \

10. mean2 += delta * ($1 - avg); sum=$1+sum } \

11. END { print NR, sum/NR, sqrt(mean2 / NR); }'

复制代码

清单 10. 使用 Hadoop Streaming 运行 Python 映射程序和 AWK 缩减程序

1. # test locally

2.

3. # because we're using Hadoop Streaming, we can test the

4. # mapper and reducer with simple pipes

5.

6. # the "sort" phase is a reminder the keys are sorted

7. # before presentation to the reducer

8. #in this example it doesn't matter what order the

9. # word length values are presented for calculating the std deviation

10.

11. $ zcat ../DS.txt.gz  | ./mapper.py | sort | ./statsreducer.awk

12. 215107 4.56068 2.50734

13.

14. # now run in hadoop with streaming

15.

16. # CDH4

17. hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

18. -input HF.txt -output HFstats -file ./mapper.py -file \

19. ./statsreducer.awk -mapper ./mapper.py -reducer ./statsreducer.awk

20.

21. # CDH3

22. $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \

23. -input HF.txt -output HFstats -file ./mapper.py -file ./statsreducer.awk \

24. -mapper ./mapper.py -reducer ./statsreducer.awk

25.

26. $ hls HFstats

27. Found 3 items

28. -rw-r--r--   1 cloudera supergroup   0 2012-08-12 15:38 /user/cloudera/HFstats/_SUCCESS

29. drwxr-xr-x   - cloudera supergroup   0 2012-08-12 15:37 /user/cloudera/HFstats/_logs

30. -rw-r--r--   1 cloudera ...  24 2012-08-12 15:37 /user/cloudera/HFstats/part-00000

31.

32. $ hcat /user/cloudera/HFstats/part-00000

33. 113365 4.11227 2.17086

34.

35. # now for cooper

36.

37. $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \

38. -input DS.txt.gz -output DSstats -file ./mapper.py -file ./statsreducer.awk \

39. -mapper ./mapper.py -reducer ./statsreducer.awk

40.

41. $ hcat /user/cloudera/DSstats/part-00000

42. 215107 4.56068 2.50734

复制代码

Mark Twain 的粉丝若知道 Hadoop 发现 Cooper 使用较长的单词,并且其标准偏差令人震惊,那么他们就可以愉快地放松了(幽默意图)。当然,要是假设较短的单词会更好。让我们继续,下一步是将 HDFS 中的数据写入 Informix 和 DB2。
使用 Sqoop 通过 JDBC 将来自 HDFS 的数据写入 Informix、DB2 或 MySQL
Sqoop Apache 项目是一个开源的基于 JDBC 的 Hadoop,用于数据库的数据移动实用程序。Sqoop 最初由在 Cloudera 的黑客马拉松 (hackathon) 创建,后来成为开源的工具。
将数据从 HDFS 移动到关系数据库是一种常见的用例。HDFS 和 map-reduce 在执行繁重工作方面是非常棒的。对于简单的查询或 Web 站点的后端存储,在关系存储区中缓存 map-reduce 输出是一个很好的设计模式。您可以避免重新运行 map-reduce 单词计数,只需将结果 Sqoop 到 Informix 和 DB2 中即可。您已经生成了关于 Twain 和 Cooper 的数据,现在,让我们把这些数据移动到一个数据库,如清单 11 所示。

清单 11. JDBC 驱动程序安装

1. #Sqoop needs access to the JDBC driver for every

2. # database that it will access

3.

4. # please copy the driver for each database you plan to use for these exercises

5. # the MySQL database and driver are already installed in the virtual image

6. # but you still need to copy the driver to the sqoop/lib directory

7.

8. #one time copy of jdbc driver to sqoop lib directory

9. $ sudo cp Informix_JDBC_Driver/lib/ifxjdbc*.jar /usr/lib/sqoop/lib/

10. $ sudo cp db2jdbc/db2jcc*.jar /usr/lib/sqoop/lib/

11. $ sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/

复制代码

清单 12 至 15 所示的示例分别对应于每种数据库。请跳到您感兴趣的示例,包括 Informix、DB2 或 MySQL。对于掌握多种数据库语言的人,请享受执行每个示例的乐趣。如果这里没有包括您首选的数据库,让这些示例在其他地方工作也不会是一个巨大的挑战。

清单 12. Informix 用户:Sqoop 将单词计数的结果写入 Informix

1. # create a target table to put the data

2. # fire up dbaccess and use this sql

3. # create table wordcount ( word char(36) primary key, n int);

4.

5. # now run the sqoop command

6. # this is best put in a shell script to help avoid typos...

7.

8. $ sqoop export -D sqoop.export.records.per.statement=1 \

9. --fields-terminated-by '\t' --driver com.informix.jdbc.IfxDriver \

10. --connect \

11. "jdbc:informix-sqli://myhost:54321/stores_demo:informixserver=i7;user=me;password=mypw" \

12. --table wordcount --export-dir /user/cloudera/HF.out

复制代码

清单 13. Informix 用户:Sqoop 将单词计数的结果写入 Informix

1. 12/08/08 21:39:42 INFO manager.SqlManager: Using default fetchSize of 1000

2. 12/08/08 21:39:42 INFO tool.CodeGenTool: Beginning code generation

3. 12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.*

4. FROM wordcount AS t WHERE 1=0

5. 12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.*

6. FROM wordcount AS t WHERE 1=0

7. 12/08/08 21:39:43 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop

8. 12/08/08 21:39:43 INFO orm.CompilationManager: Found hadoop core jar at:

9. /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar

10. 12/08/08 21:39:45 INFO orm.CompilationManager: Writing jar file:

11. /tmp/sqoop-cloudera/compile/248b77c05740f863a15e0136accf32cf/wordcount.jar

12. 12/08/08 21:39:45 INFO mapreduce.ExportJobBase: Beginning export of wordcount

13. 12/08/08 21:39:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.*

14. FROM wordcount AS t WHERE 1=0

15. 12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1

16. 12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1

17. 12/08/08 21:39:46 INFO mapred.JobClient: Running job: job_201208081900_0012

18. 12/08/08 21:39:47 INFO mapred.JobClient:  map 0% reduce 0%

19. 12/08/08 21:39:58 INFO mapred.JobClient:  map 38% reduce 0%

20. 12/08/08 21:40:00 INFO mapred.JobClient:  map 64% reduce 0%

21. 12/08/08 21:40:04 INFO mapred.JobClient:  map 82% reduce 0%

22. 12/08/08 21:40:07 INFO mapred.JobClient:  map 98% reduce 0%

23. 12/08/08 21:40:09 INFO mapred.JobClient: Task Id :

24. attempt_201208081900_0012_m_000000_0, Status : FAILED

25. java.io.IOException: java.sql.SQLException:

26.     Encoding or code set not supported.

27. at ...SqlRecordWriter.close(AsyncSqlRecordWriter.java:187)

28. at ...$NewDirectOutputCollector.close(MapTask.java:540)

29. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)

30. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

31. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

32. at java.security.AccessController.doPrivileged(Native Method)

33. at javax.security.auth.Subject.doAs(Subject.java:396)

34. at ....doAs(UserGroupInformation.java:1177)

35. at org.apache.hadoop.mapred.Child.main(Child.java:264)

36. Caused by: java.sql.SQLException: Encoding or code set not supported.

37. at com.informix.util.IfxErrMsg.getSQLException(IfxErrMsg.java:413)

38. at com.informix.jdbc.IfxChar.toIfx(IfxChar.java:135)

39. at com.informix.jdbc.IfxSqli.a(IfxSqli.java:1304)

40. at com.informix.jdbc.IfxSqli.d(IfxSqli.java:1605)

41. at com.informix.jdbc.IfxS

42. 12/08/08 21:40:11 INFO mapred.JobClient:  map 0% reduce 0%

43. 12/08/08 21:40:15 INFO mapred.JobClient: Task Id :

44. attempt_201208081900_0012_m_000000_1, Status : FAILED

45. java.io.IOException: java.sql.SQLException:

46.     Unique constraint (informix.u169_821) violated.

47. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)

48. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)

49. at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)

50. at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

51. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)

52. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)

53. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

54. at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)

55. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

56. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

57. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

58. at java.security.AccessController.doPrivileged(Native Method)

59. at javax.security.a

60. 12/08/08 21:40:20 INFO mapred.JobClient:

61. Task Id : attempt_201208081900_0012_m_000000_2, Status : FAILED

62. java.sql.SQLException: Unique constraint (informix.u169_821) violated.

63. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)

64. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)

65. at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)

66. at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

67. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)

68. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)

69. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

70. at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)

71. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

72. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

73. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

74. at java.security.AccessController.doPrivileged(Native Method)

75. at javax.security.a

76. 12/08/08 21:40:27 INFO mapred.JobClient: Job complete: job_201208081900_0012

77. 12/08/08 21:40:27 INFO mapred.JobClient: Counters: 7

78. 12/08/08 21:40:27 INFO mapred.JobClient:   Job Counters

79. 12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=38479

80. 12/08/08 21:40:27 INFO mapred.JobClient:

81. Total time spent by all reduces waiting after reserving slots (ms)=0

82. 12/08/08 21:40:27 INFO mapred.JobClient:

83. Total time spent by all maps waiting after reserving slots (ms)=0

84. 12/08/08 21:40:27 INFO mapred.JobClient:     Launched map tasks=4

85. 12/08/08 21:40:27 INFO mapred.JobClient:     Data-local map tasks=4

86. 12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

87. 12/08/08 21:40:27 INFO mapred.JobClient:     Failed map tasks=1

88. 12/08/08 21:40:27 INFO mapreduce.ExportJobBase:

89. Transferred 0 bytes in 41.5758 seconds (0 bytes/sec)

90. 12/08/08 21:40:27 INFO mapreduce.ExportJobBase: Exported 0 records.

91. 12/08/08 21:40:27 ERROR tool.ExportTool: Error during export: Export job failed!

92.

93. # despite the errors above, rows are inserted into the wordcount table

94. # one row is missing

95. # the retry and duplicate key exception are most likely related, but

96. # troubleshooting will be saved for a later article

97.

98. # check how we did

99. # nothing like a "here document" shell script

100.

101. $ dbaccess stores_demo - < select count(*) from wordcount;

103. > eoj

104.

105. Database selected.

106. (count(*))

107. 13837

108. 1 row(s) retrieved.

109. Database closed.

复制代码

清单 14. DB2 用户:Sqoop 将单词计数的结果写入 DB2

1. # here is the db2 syntax

2. # create a destination table for db2

3. #

4. #db2 => connect to sample

5. #

6. #   Database Connection Information

7. #

8. # Database server        = DB2/LINUXX8664 10.1.0

9. # SQL authorization ID   = DB2INST1

10. # Local database alias   = SAMPLE

11. #

12. #db2 => create table wordcount ( word char(36) not null primary key , n int)

13. #DB20000I  The SQL command completed successfully.

14. #

15.

16. sqoop export -D sqoop.export.records.per.statement=1 \

17. --fields-terminated-by '\t' \

18. --driver com.ibm.db2.jcc.DB2Driver \

19. --connect "jdbc:db2://192.168.1.131:50001/sample"  \

20. --username db2inst1 --password db2inst1 \

21. --table wordcount --export-dir /user/cloudera/HF.out

22.

23. 12/08/09 12:32:59 WARN tool.BaseSqoopTool: Setting your password on the

24. command-line is insecure. Consider using -P instead.

25. 12/08/09 12:32:59 INFO manager.SqlManager: Using default fetchSize of 1000

26. 12/08/09 12:32:59 INFO tool.CodeGenTool: Beginning code generation

27. 12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement:

28. SELECT t.* FROM wordcount AS t WHERE 1=0

29. 12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement:

30. SELECT t.* FROM wordcount AS t WHERE 1=0

31. 12/08/09 12:32:59 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop

32. 12/08/09 12:32:59 INFO orm.CompilationManager: Found hadoop core jar

33. at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar

34. 12/08/09 12:33:00 INFO orm.CompilationManager: Writing jar

35. file: /tmp/sqoop-cloudera/compile/5532984df6e28e5a45884a21bab245ba/wordcount.jar

36. 12/08/09 12:33:00 INFO mapreduce.ExportJobBase: Beginning export of wordcount

37. 12/08/09 12:33:01 INFO manager.SqlManager: Executing SQL statement:

38. SELECT t.* FROM wordcount AS t WHERE 1=0

39. 12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1

40. 12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1

41. 12/08/09 12:33:02 INFO mapred.JobClient: Running job: job_201208091208_0002

42. 12/08/09 12:33:03 INFO mapred.JobClient:  map 0% reduce 0%

43. 12/08/09 12:33:14 INFO mapred.JobClient:  map 24% reduce 0%

44. 12/08/09 12:33:17 INFO mapred.JobClient:  map 44% reduce 0%

45. 12/08/09 12:33:20 INFO mapred.JobClient:  map 67% reduce 0%

46. 12/08/09 12:33:23 INFO mapred.JobClient:  map 86% reduce 0%

47. 12/08/09 12:33:24 INFO mapred.JobClient:  map 100% reduce 0%

48. 12/08/09 12:33:25 INFO mapred.JobClient: Job complete: job_201208091208_0002

49. 12/08/09 12:33:25 INFO mapred.JobClient: Counters: 16

50. 12/08/09 12:33:25 INFO mapred.JobClient:   Job Counters

51. 12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=21648

52. 12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all

53. reduces waiting after reserving slots (ms)=0

54. 12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all

55. maps waiting after reserving slots (ms)=0

56. 12/08/09 12:33:25 INFO mapred.JobClient:     Launched map tasks=1

57. 12/08/09 12:33:25 INFO mapred.JobClient:     Data-local map tasks=1

58. 12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

59. 12/08/09 12:33:25 INFO mapred.JobClient:   FileSystemCounters

60. 12/08/09 12:33:25 INFO mapred.JobClient:     HDFS_BYTES_READ=138350

61. 12/08/09 12:33:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69425

62. 12/08/09 12:33:25 INFO mapred.JobClient:   Map-Reduce Framework

63. 12/08/09 12:33:25 INFO mapred.JobClient:     Map input records=13838

64. 12/08/09 12:33:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=105148416

65. 12/08/09 12:33:25 INFO mapred.JobClient:     Spilled Records=0

66. 12/08/09 12:33:25 INFO mapred.JobClient:     CPU time spent (ms)=9250

67. 12/08/09 12:33:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=42008576

68. 12/08/09 12:33:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=596447232

69. 12/08/09 12:33:25 INFO mapred.JobClient:     Map output records=13838

70. 12/08/09 12:33:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=126

71. 12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Transferred 135.1074 KB

72. in 24.4977 seconds (5.5151 KB/sec)

73. 12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Exported 13838 records.

74.

75. # check on the results...

76. #

77. #db2 => select count(*) from wordcount

78. #

79. #1

80. #-----------

81. #      13838

82. #

83. #  1 record(s) selected.

84. #

85. #

复制代码

清单 15. MySQL 用户:Sqoop 将单词计数的结果写入 MySQL

1. # if you don't have Informix or DB2 you can still do this example

2. # mysql - it is already installed in the VM, here is how to access

3.

4. # one time copy of the JDBC driver

5.

6. sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/

7.

8. # now create the database and table

9.

10. $ mysql -u root

11. Welcome to the MySQL monitor.  Commands end with ; or \g.

12. Your MySQL connection id is 45

13. Server version: 5.0.95 Source distribution

14.

15. Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

16.

17. Oracle is a registered trademark of Oracle Corporation and/or its

18. affiliates. Other names may be trademarks of their respective

19. owners.

20.

21. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

22.

23. mysql> create database mydemo;

24. Query OK, 1 row affected (0.00 sec)

25.

26. mysql> use mydemo

27. Database changed

28. mysql> create table wordcount ( word char(36) not null primary key, n int);

29. Query OK, 0 rows affected (0.00 sec)

30.

31. mysql> exit

32. Bye

33.

34. # now export

35.

36. $ sqoop export --connect jdbc:mysql://localhost/mydemo \

37. --table wordcount --export-dir /user/cloudera/HF.out \

38. --fields-terminated-by '\t' --username root

复制代码

使用 Sqoop 将数据从 Informix 和 DB2 导入到 HDFS
使用 Sqoop 也可以实现将数据插入 Hadoop HDFS。通过导入参数可以控制此双向功能。
这两种产品自带的样本数据库有一些您可以为此目的使用的简单数据集。清单 16 显示了 Sqoop 每台服务器的语法和结果。
对于 MySQL 用户,请调整以下 Informix 或 DB2 示例中的语法。

清单 16. Sqoop 从 Informix 样本数据库导入到 HDFS

1. $ sqoop import --driver com.informix.jdbc.IfxDriver \

2. --connect \

3. "jdbc:informix-sqli://192.168.1.143:54321/stores_demo:informixserver=ifx117" \

4. --table orders \

5. --username informix --password useyours

6.

7. 12/08/09 14:39:18 WARN tool.BaseSqoopTool: Setting your password on the command-line

8. is insecure. Consider using -P instead.

9. 12/08/09 14:39:18 INFO manager.SqlManager: Using default fetchSize of 1000

10. 12/08/09 14:39:18 INFO tool.CodeGenTool: Beginning code generation

11. 12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement:

12. SELECT t.* FROM orders AS t WHERE 1=0

13. 12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement:

14. SELECT t.* FROM orders AS t WHERE 1=0

15. 12/08/09 14:39:19 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop

16. 12/08/09 14:39:19 INFO orm.CompilationManager: Found hadoop core jar

17. at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar

18. 12/08/09 14:39:21 INFO orm.CompilationManager: Writing jar

19. file: /tmp/sqoop-cloudera/compile/0b59eec7007d3cff1fc0ae446ced3637/orders.jar

20. 12/08/09 14:39:21 INFO mapreduce.ImportJobBase: Beginning import of orders

21. 12/08/09 14:39:21 INFO manager.SqlManager: Executing SQL statement:

22. SELECT t.* FROM orders AS t WHERE 1=0

23. 12/08/09 14:39:22 INFO db.DataDrivenDBInputFormat: BoundingValsQuery:

24. SELECT MIN(order_num), MAX(order_num) FROM orders

25. 12/08/09 14:39:22 INFO mapred.JobClient: Running job: job_201208091208_0003

26. 12/08/09 14:39:23 INFO mapred.JobClient:  map 0% reduce 0%

27. 12/08/09 14:39:31 INFO mapred.JobClient:  map 25% reduce 0%

28. 12/08/09 14:39:32 INFO mapred.JobClient:  map 50% reduce 0%

29. 12/08/09 14:39:36 INFO mapred.JobClient:  map 100% reduce 0%

30. 12/08/09 14:39:37 INFO mapred.JobClient: Job complete: job_201208091208_0003

31. 12/08/09 14:39:37 INFO mapred.JobClient: Counters: 16

32. 12/08/09 14:39:37 INFO mapred.JobClient:   Job Counters

33. 12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22529

34. 12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all reduces

35. waiting after reserving slots (ms)=0

36. 12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all maps

37. waiting after reserving slots (ms)=0

38. 12/08/09 14:39:37 INFO mapred.JobClient:     Launched map tasks=4

39. 12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

40. 12/08/09 14:39:37 INFO mapred.JobClient:   FileSystemCounters

41. 12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_READ=457

42. 12/08/09 14:39:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=278928

43. 12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2368

44. 12/08/09 14:39:37 INFO mapred.JobClient:   Map-Reduce Framework

45. 12/08/09 14:39:37 INFO mapred.JobClient:     Map input records=23

46. 12/08/09 14:39:37 INFO mapred.JobClient:     Physical memory (bytes) snapshot=291364864

47. 12/08/09 14:39:37 INFO mapred.JobClient:     Spilled Records=0

48. 12/08/09 14:39:37 INFO mapred.JobClient:     CPU time spent (ms)=1610

49. 12/08/09 14:39:37 INFO mapred.JobClient:     Total committed heap usage (bytes)=168034304

50. 12/08/09 14:39:37 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2074587136

51. 12/08/09 14:39:37 INFO mapred.JobClient:     Map output records=23

52. 12/08/09 14:39:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=457

53. 12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Transferred 2.3125 KB in 16.7045

54. seconds (141.7585 bytes/sec)

55. 12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Retrieved 23 records.

56.

57. # now look at the results

58.

59. $ hls

60. Found 4 items

61. -rw-r--r--   1 cloudera supergroup     459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz

62. drwxr-xr-x   - cloudera supergroup          0 2012-08-08 19:38 /user/cloudera/HF.out

63. -rw-r--r--   1 cloudera supergroup     597587 2012-08-08 19:35 /user/cloudera/HF.txt

64. drwxr-xr-x   - cloudera supergroup          0 2012-08-09 14:39 /user/cloudera/orders

65. $ hls orders

66. Found 6 items

67. -rw-r--r--   1 cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_SUCCESS

68. drwxr-xr-x   - cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_logs

69. -rw-r--r--   1 cloudera ...roup 630 2012-08-09 14:39 /user/cloudera/orders/part-m-00000

70. -rw-r--r--   1 cloudera supergroup

71. 564 2012-08-09 14:39 /user/cloudera/orders/part-m-00001

72. -rw-r--r--   1 cloudera supergroup

73. 527 2012-08-09 14:39 /user/cloudera/orders/part-m-00002

74. -rw-r--r--   1 cloudera supergroup

75. 647 2012-08-09 14:39 /user/cloudera/orders/part-m-00003

76.

77. # wow  there are four files part-m-0000x

78. # look inside one

79.

80. # some of the lines are edited to fit on the screen

81. $ hcat /user/cloudera/orders/part-m-00002

82. 1013,2008-06-22,104,express ,n,B77930    ,2008-07-10,60.80,12.20,2008-07-31

83. 1014,2008-06-25,106,ring bell,  ,n,8052      ,2008-07-03,40.60,12.30,2008-07-10

84. 1015,2008-06-27,110,        ,n,MA003     ,2008-07-16,20.60,6.30,2008-08-31

85. 1016,2008-06-29,119, St.          ,n,PC6782    ,2008-07-12,35.00,11.80,null

86. 1017,2008-07-09,120,use                 ,n,DM354331  ,2008-07-13,60.00,18.00,null

复制代码

为什么有四个不同的文件,而且每个文件只包含一部分数据?Sqoop 是一个高度并行化的实用程序。如果一个运行 Sqoop 的具有 4000 个节点的集群从数据库全力执行导入操作,那么 4000 个数据库连接看起来非常像针对数据库的拒绝服务攻击。Sqoop 的默认连接限制是 4 个 JDBC 连接。每个连接在 HDFS 中生成一个数据文件。因此会有四个文件。不用担心,您会看到 Hadoop 如何毫无难度地跨这些文件进行工作。
下一步是导入一个 DB2 表。如清单 17 所示,通过指定 -m 1 选项,就可以导入没有主键的表,其结果也是一个单一文件。

未完待续,请看后面!!!

大数据系列相关文章:

最新评论
无限追求2014-09-10 12:20:57
#丰富的类库是提高大数据计算开发效率的基础#Hadoop尝试过用Hive等类SQL来封装这些基础算法,但Hive中的函数远远达不到丰富的程度,往往还要借助MapReduce或自定义类才能实现,并不能明显提高开发效率。比如统计各部门销量前十名的产品;
髙海釗2014-09-10 12:56:48
精通某一部分,是可以的,呵呵
啊豪2014-09-09 10:49:54
中毒了吧
陈飞倚2014-09-09 03:36:13
发表了博文《Hadoop实战案例培训(Cloudera)》Hadoop实战案例培训(Cloudera)培训类型:公开课课程长度:10天/60小时培训地点:福州市五一北路1号力宝天马广场2706室环境要http://t.cn/RPj8dHT
小小佳2014-09-09 01:06:30
【Hortonworks的首次收购给Hadoop安全开了个好头】随着越来越多Hadoop集群由实验性项目转化至生产阶段,企业开始不断将与自身运营以及客户情况相关的关键性信息纳入其中,这一切使得在分析周期当中严格执行数据保护变得愈发迫切。 http://t.cn/RvZSOdl
追寻阳光的忍冬草2014-09-08 03:50:28
乱码是字符集问题
1111foreve2014-09-07 07:39:30
北京xx软件公司
Annie2014-09-06 05:54:47
Just_do_IT加入本群
YY小鱼2014-09-05 08:07:16
Code size:Hadoop(MapReduce)120000,storm 60000+,impala 80000+,Girahp 60000+,saprk,20000-40000之间。这张图一定要看下。
JOJO何20122014-09-05 03:09:22
解读:基于Hadoop的大规模数据处理系统 http://t.cn/8sCp7Yq (via @推酷网)
 
  • Hadoop生态系统资料推荐