Skip to content

Data Reader

DataReader plugin is specifically designed to generate data that meets certain rule requirements in development and testing environments.

In actual development and testing, we need to generate test data according to certain business rules, not just random content, such as ID card numbers, bank account numbers, stock codes, etc.

Why Reinvent the Wheel

Indeed, there are quite a few specialized data generation tools on the Internet, many of which are powerful and performant. However, most of these tools consider the data generation part but ignore the problem of writing data to the target end, or some consider it but only consider one or a limited number of databases.

It happens that the Addax tool can provide enough target-end writing capabilities, plus the existing Stream Reader is already a simple version of data generation tool. Therefore, adding some specific rules to this functionality and leveraging the diversity of the writing end naturally makes it a good data generation tool.

Configuration Example

Here I list all the rules currently supported by the plugin in the example below

json
{
  "job": {
    "setting": {
      "speed": {
        "byte": -1,
        "channel": 1
      },
      "errorLimit": {
        "record": 0,
        "percentage": 0.02
      }
    },
    "content": {
      "reader": {
        "name": "datareader",
        "parameter": {
          "column": [
            {
              "value": "1,100,",
              "rule": "random",
              "type": "double"
            },
            {
              "value": "DataX",
              "type": "string"
            },
            {
              "value": "1",
              "rule": "incr",
              "type": "long"
            },
            {
              "value": "1989/06/04 00:00:01,-1",
              "rule": "incr",
              "type": "date",
              "dateFormat": "yyyy/MM/dd hh:mm:ss"
            },
            {
              "value": "test",
              "type": "bytes"
            },
            {
              "rule": "address"
            },
            {
              "rule": "bank"
            },
            {
              "rule": "company"
            },
            {
              "rule": "creditCard"
            },
            {
              "rule": "debitCard"
            },
            {
              "rule": "idCard"
            },
            {
              "rule": "lat"
            },
            {
              "rule": "lng"
            },
            {
              "rule": "name"
            },
            {
              "rule": "job"
            },
            {
              "rule": "phone"
            },
            {
              "rule": "stockCode"
            },
            {
              "rule": "stockAccount"
            }
          ],
          "sliceRecordCount": 10
        }
      },
      "writer": {
        "name": "streamwriter",
        "parameter": {
          "print": true,
          "encoding": "UTF-8"
        }
      }
    }
  }
}

Save the above content to job/datareader2stream.json

Then execute this task, the output result is similar to the following:

Details
txt
$ bin/addax.sh job/datareader2stream.json
  ___      _     _
 / _ \    | |   | |
/ /_\ \ __| | __| | __ ___  __
|  _  |/ _` |/ _` |/ _` \ \/ /
| | | | (_| | (_| | (_| |>  <
\_| |_/\__,_|\__,_|\__,_/_/\_\

:: Addax version ::    (v4.0.2-SNAPSHOT)

2021-08-13 17:02:00.888 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl
2021-08-13 17:02:00.910 [        main] INFO  Engine               -
{
	"content":
		{
			"reader":{
				"parameter":{
					"column":[
						{
							"rule":"random",
							"type":"double",
							"scale": "2",
							"value":"1,100,"
						},
						{
							"type":"string",
							"value":"DataX"
						},
						{
							"rule":"incr",
							"type":"long",
							"value":"1"
						},
						{
							"dateFormat":"yyyy/MM/dd hh:mm:ss",
							"rule":"incr",
							"type":"date",
							"value":"1989/06/04 00:00:01,-1"
						},
						{
							"type":"bytes",
							"value":"test"
						},
						{
							"rule":"address"
						},
						{
							"rule":"bank"
						},
						{
							"rule":"company"
						},
						{
							"rule":"creditCard"
						},
						{
							"rule":"debitCard"
						},
						{
							"rule":"idCard"
						},
						{
							"rule":"lat"
						},
						{
							"rule":"lng"
						},
						{
							"rule":"name"
						},
						{
							"rule":"job"
						},
						{
							"rule":"phone"
						},
						{
							"rule":"stockCode"
						},
						{
							"rule":"stockAccount"
						}
					],
					"sliceRecordCount":10
				},
				"name":"datareader"
			},
			"writer":{
				"parameter":{
					"print":true,
					"encoding":"UTF-8"
				},
				"name":"streamwriter"
			}
	},
	"setting":{
		"errorLimit":{
			"record":0,
			"percentage":0.02
		},
		"speed":{
			"byte":-1,
			"channel":1
		}
	}
}

2021-08-13 17:02:00.937 [        main] INFO  PerfTrace            - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-08-13 17:02:00.938 [        main] INFO  JobContainer         - Addax jobContainer starts job.
2021-08-13 17:02:00.940 [        main] INFO  JobContainer         - Set jobId = 0
2021-08-13 17:02:00.976 [       job-0] INFO  JobContainer         - Addax Reader.Job [datareader] do prepare work .
2021-08-13 17:02:00.977 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do prepare work .
2021-08-13 17:02:00.978 [       job-0] INFO  JobContainer         - Job set Channel-Number to 1 channels.
2021-08-13 17:02:00.979 [       job-0] INFO  JobContainer         - Addax Reader.Job [datareader] splits to [1] tasks.
2021-08-13 17:02:00.980 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] splits to [1] tasks.
2021-08-13 17:02:01.002 [       job-0] INFO  JobContainer         - Scheduler starts [1] taskGroups.
2021-08-13 17:02:01.009 [ taskGroup-0] INFO  TaskGroupContainer   - taskGroupId=[0] start [1] channels for [1] tasks.
2021-08-13 17:02:01.017 [ taskGroup-0] INFO  Channel              - Channel set byte_speed_limit to -1, No bps activated.
2021-08-13 17:02:01.017 [ taskGroup-0] INFO  Channel              - Channel set record_speed_limit to -1, No tps activated.

7.65	DataX	1	1989-06-04 00:00:01	test	天津市南京县长寿区光明路263号	交通银行	易动力信息有限公司	6227894836568607	6235712610856305437	450304194808316766	31.3732613	-125.3507716	龚军	机电工程师	13438631667	726929	8741848665
18.58	DataX	2	1989-06-03 00:00:01	test	江苏省太原市浔阳区东山路33号	中国银行	时空盒数字信息有限公司	4096666711928233	6217419359154239015	220301200008188547	48.6648764	104.8567048	匡飞	化妆师	18093137306	006845	1815787371
16.16	DataX	3	1989-06-02 00:00:01	test	台湾省邯郸市清河区万顺路10号	大同商行	开发区世创科技有限公司	4096713966912225	6212977716107080594	150223196408276322	29.0134395	142.6426842	支波	审核员	13013458079	020695	3545552026
63.89	DataX	4	1989-06-01 00:00:01	test	上海市辛集县六枝特区甘园路119号	中国农业银行	泰麒麟传媒有限公司	6227893481508780	6215686558778997167	220822196208286838	-71.6484635	111.8181273	敬坤	房地产客服	13384928291	174445	0799668655
79.18	DataX	5	1989-05-31 00:00:01	test	陕西省南京市朝阳区大胜路170号	内蒙古银行	晖来计算机信息有限公司	6227535683896707	6217255315590053833	350600198508222018	-24.9783587	78.017024	蒋杨	固定资产会计	18766298716	402188	9633759917
14.97	DataX	6	1989-05-30 00:00:01	test	海南省长春县璧山区碧海街147号	华夏银行	浙大万朋科技有限公司	6224797475369912	6215680436662199846	220122199608190275	-3.5088667	-40.2634359	边杨	督导/巡店	13278765923	092780	2408887582
45.49	DataX	7	1989-05-29 00:00:01	test	台湾省潜江县梁平区七星街201号	晋城商行	开发区世创信息有限公司	5257468530819766	6213336008535546044	141082197908244004	-72.9200596	120.6018163	桑明	系统工程师	13853379719	175864	8303448618
8.45	DataX	8	1989-05-28 00:00:01	test	海南省杭州县城北区天兴路11号	大同商行	万迅电脑科技有限公司	6227639043120062	6270259717880740332	430405198908214042	-16.5115338	-39.336119	覃健	人事总监	13950216061	687461	0216734574
15.01	DataX	9	1989-05-27 00:00:01	test	云南省惠州市和平区海鸥街201号	内蒙古银行	黄石金承信息有限公司	6200358843233005	6235730928871528500	130300195008312067	-61.646097	163.0882369	卫建华	电话采编	15292600492	001658	1045093445
55.14	DataX	10	1989-05-26 00:00:01	test	辽宁省兰州市徐汇区东山街176号	廊坊银行	创汇科技有限公司	6227605280751588	6270262330691012025	341822200908168063	77.2165746	139.5431377	池浩	多媒体设计	18693948216	201678	0692522928

2021-08-13 17:02:04.020 [       job-0] INFO  AbstractScheduler    - Scheduler accomplished all tasks.
2021-08-13 17:02:04.021 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do post work.
2021-08-13 17:02:04.022 [       job-0] INFO  JobContainer         - Addax Reader.Job [datareader] do post work.
2021-08-13 17:02:04.025 [       job-0] INFO  JobContainer         - PerfTrace not enable!
2021-08-13 17:02:04.028 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 1817 bytes | Speed 605B/s, 3 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-08-13 17:02:04.030 [       job-0] INFO  JobContainer         -
任务启动时刻                    : 2021-08-13 17:02:00
任务结束时刻                    : 2021-08-13 17:02:04
任务总计耗时                    :                  3s
任务平均流量                    :              605B/s
记录写入速度                    :              3rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0

Configuration Description

The configuration of column is slightly different from other plugins. A field consists of the following configuration items:

ConfigurationRequiredDefault ValueExampleDescription
valueNoNoneAddaxData value, required in some cases
ruleNoconstantidCardData generation rule, detailed description below
typeNostringdoubleData value type
dateFormatNoyyyy-MM-dd HH:mm:ssyyyy/MM/dd HH:mm:ssDate format, only valid when type is date

rule Description

The core of this plugin's field configuration is the rule field, which is used to indicate what kind of data should be generated, and according to different rules, combined with other configuration options to generate data that meets expectations. Currently, the configurations of rule are all built-in supported rules, custom rules are not supported yet. Detailed description follows:

constant

constant is the default configuration of rule. This rule means that the data value to be generated is determined by the value configuration item, with no changes. For example:

json
{
  "value": "Addax",
  "type": "string",
  "rule": "constant"
}

This means the data value generated by this field is all Addax

incr

The meaning of incr configuration item is consistent with the incr meaning in the streamreader plugin, indicating this is an incremental data generation rule. For example:

json
{
  "value": "1,2",
  "rule": "incr",
  "type": "long"
}

This means the field data is a long integer, starting from 1, incrementing by 2 each time, forming an incremental sequence starting from 1 with step size 2.

For more detailed configuration rules and precautions for this field, refer to the incr description in streamreader.

random

The meaning of random configuration item is consistent with the random meaning in the streamreader plugin, indicating this is an incremental data generation rule. For example:

json
{
  "value": "1,10",
  "rule": "random",
  "type": "string"
}

This means the field data is a random string with length 1 to 10 (both 1 and 10 included).

For more detailed configuration rules and precautions for this field, refer to the random description in streamreader.

Rule NameMeaningExampleData TypeDescription
addressRandomly generate address information that basically meets domestic actual conditions辽宁省兰州市徐汇区东山街176号string
bankRandomly generate a domestic bank name华夏银行string
companyRandomly generate a company name万迅电脑科技有限公司string
creditCardRandomly generate a credit card number430405198908214042string16 digits
debitCardRandomly generate a debit card number6227894836568607string19 digits
emailRandomly generate an email address[email protected]string
idCardRandomly generate a domestic ID card number350600198508222018string18 digits, follows validation rules, first 6 digits meet administrative division requirements
latRandomly generate latitude data48.6648764doubleFixed 7 decimal places, can also use latitude
lngRandomly generate longitude data120.6018163doubleFixed 7 decimal places, can also use longitude
nameRandomly generate a domestic name池浩stringCurrently doesn't consider surname distribution in China
jobRandomly generate a domestic job title系统工程师stringData source from recruitment websites
phoneRandomly generate a domestic mobile phone number15292600492stringCurrently doesn't consider virtual phone numbers
stockCodeRandomly generate a 6-digit stock code687461stringFirst two digits meet domestic stock code standards
stockAccountRandomly generate a 10-digit stock trading account0692522928stringCompletely random, doesn't meet account standards
uuidRandomly generate a UUID stringbc1cf125-929b-43b7-b324-d7c4cc5a75d2stringCompletely random
zipCodeRandomly generate a domestic postal code411105longDoesn't fully meet domestic postal code standards

Note: The data types returned by the rules in the above table are fixed and cannot be modified, so type doesn't need to be configured, and the configured type will be ignored. Since data generation comes from internal rules, value also doesn't need to be configured, and the configured content will be ignored.