CSV and Parquet

While CSV is great for readability, for working within Spark, Parquet is choice to speed things up. More details on Apache Parquet could be found here. Essentially the solution provides provides columnar storage that enables complex data to be encoded efficiently in bulk. The difference between columnar structure relative to a traditional DB structure is that how data is fundamentally organized. This enables searches across large data sets and reads of large sets of data can be optimized. Parquet provides better performance advantage over CSV, which is true especially dealing with large data sets. Here is an excellent article that elegently articulates the benefits

To convert CSV data to Parquet. First get Apache Drill which is pretty straight forward

1
2
3
4
wget http://mirrors.sonic.net/apache/drill/drill-1.5.0/apache-drill-1.5.0.tar.gz
tar -xvf apache-drill-1.5.0.tar.gz
cd apache-drill-1.5.0/bin
./drill-embedded

which starts up Apache Drill.

1
2
3
4
5
6
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Mar 09, 2016 9:46:33 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.5.0
"start your sql engine"
0: jdbc:drill:zk=local>

Next get the CSV data that needs to be converted. But first lets check out the CSV. Set the format as Parquet. Note that ` (below ~) is not ‘ (below “)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0: jdbc:drill:zk=local> select columns[3] from dfs.`/home/xxx/Development/Data/Education/output/MERGED2013_PP.csv`;
0: jdbc:drill:zk=local> alter session set `store.format`='parquet';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+
1 row selected (0.103 seconds)

0: jdbc:drill:zk=local> create table dfs.tmp.`/output/pData/` as select * from dfs.`/home/xxx/Development/Data/Education/output/MERGED2013_PP.csv`;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 7805                       |
+-----------+----------------------------+
1 row selected (4.445 seconds)

0: jdbc:drill:zk=local> select columns[3] from dfs.tmp.`/output/pData/*`;


The output goes to /tmp/output/pData

If the above output needs to be changed, the navigate to http://localhost:8047/storage/dfs and update the /tmp location

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

Leave a Reply

Your email address will not be published. Required fields are marked *