image

Querying Avro Files With Presto. It is possible!

Presto’s Hive connector proudly declares that it supports AVRO. However, if you’ve ever tried to query a Hive built AVRO table from Presto, you may have gotten a pretty cryptic error message. cannot find field id from [0:error_error_error_error_error_error_error, 1:cannot_determine_schema, 2:check, 3:schema, 4:url, 5:and, 6:literal]

If you try searching for solutions to this problem, or Avro Presto issues in general, it’s not particularly promising. I found this: https://groups.google.com/forum/#!topic/presto-users/uPApa6ih2B0

"If you can, I would simply avoid this format entirely."

In this instance, the engineering team had given up on the problem 6 months ago, and lots of forum posts about how Presto's AVRO support seems fundamentally broken. It turns out that there’s a pretty simple fix.

Presto can’t deal with schema’s stored in referenced external files. Instead, you can place the schema in the create table statement:

CREATE EXTERNAL TABLE people ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : “people", "namespace" : "testing.hive.avro.serde", "fields" : [ { "name" : “first", "type" : "string", "doc" : “first name" }, { "name" : “last", "type" : "string", "doc" : “last name" }, { "name" : “record_id", "type" : "int", "doc" : “record id" } ] } ‘)

If you need to extract the schema from your AVRO files, it’s pretty simple. Just use avro_tools.jar: https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/