HDFS Reader
HDFS Reader provides the ability to read data storage from distributed file system Hadoop HDFS.
Currently HdfsReader supports the following file formats:
- textfile(text)
- orcfile(orc)
- rcfile(rc)
- sequence file(seq)
- Csv(csv)
- parquet
Features and Limitations
- Supports textfile, orcfile, parquet, rcfile, sequence file and csv format files, and requires that the file content stores a logically two-dimensional table.
- Supports reading multiple types of data (represented using String), supports column pruning, supports column constants
- Supports recursive reading, supports regular expressions (
*and?). - Supports common compression algorithms, including GZIP, SNAPPY, ZLIB, etc.
- Multiple Files can support concurrent reading.
- Supports sequence file data compression, currently supports lzo compression method.
- csv type supports compression formats: gzip, bz2, zip, lzo, lzo_deflate, snappy.
- Currently the Hive version in the plugin is
3.1.1, Hadoop version is3.1.1, writes normally in Hadoop2.7.x, Hadoop3.1.xand Hive2.x, hive3.1.xtest environments; other versions are theoretically supported, but please test further before using in production environments; - Supports
kerberosauthentication
Configuration Example
json
{
"job": {
"setting": {
"speed": {
"channel": 3,
"bytes": -1
}
},
"content": {
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/hive/warehouse/mytable01/*",
"defaultFS": "hdfs://xxx:port",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "boolean"
},
{
"type": "string",
"value": "hello"
},
{
"index": 2,
"type": "double"
}
],
"fileType": "orc",
"encoding": "UTF-8",
"fieldDelimiter": ","
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": true
}
}
}
}
}Configuration Parameters
| Configuration | Required | Data Type | Default Value | Description |
|---|---|---|---|---|
| path | Yes | string | None | File path to read |
| defaultFS | Yes | string | None | HDFS NAMENODE node address, if HA mode is configured, it is the value of defaultFS |
| fileType | Yes | string | None | File type |
| column | Yes | list<map> | None | List of fields to read |
| fieldDelimiter | No | char | , | Specify text file field delimiter, binary files do not need to specify this |
| encoding | No | string | utf-8 | File encoding configuration, currently only supports utf-8 |
| nullFormat | No | string | None | Characters that can represent null, if user configures: "\\N", then if source data is "\N", it's treated as null field |
| haveKerberos | No | boolean | None | Whether to enable Kerberos authentication, if enabled, need to configure the following two items |
| kerberosKeytabFilePath | No | string | None | Kerberos authentication credential file path, e.g. /your/path/addax.service.keytab |
| kerberosPrincipal | No | string | None | Kerberos authentication credential principal, e.g. addax/[email protected] |
| compress | No | string | None | Specify compression format of files to read |
| hadoopConfig | No | map | None | Can configure some advanced parameters related to Hadoop, such as HA configuration |
| hdfsSitePath | No | string | None | Path to hdfs-site.xml, detailed explanation below |