DRILL-5546: Handle schema change exception failure caused by empty input or empty batch.

1. Modify ScanBatch's logic when it iterates list of RecordReader. 1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed. 2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in. 3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first. 2. Modify IteratorValidatorBatchIterator to allow 1) fast NONE ( before seeing a OK_NEW_SCHEMA) 2) batch with empty list of columns. 2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files. 3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others. 4. Fix and refactor union all operator. 1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator. 2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation. The new union all operator reduces the code size into half, comparing the old one. 5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row. Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer the case in the current implementation. Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL. 6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else. In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table. However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change. 7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost. 8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty and are skipped. 1) column star is replaced with empty list 2) regular column reference is replaced with nullable-int column 3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type 4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator. 9. Add unit test to test operators handling empty input. 10. Add unit test to test query when inputs are all empty. DRILL-5546: Revise code based on review comments. Handle implicit column in scan batch. Change interface in ScanBatch's constructor. 1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns. 2) We could skip the implicit columns when check if there is a schema change coming from record reader. 3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same. ScanBatch code review comments. Add more unit tests. Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput(). - Move SimpleRecordBatch out of TopNBatch to make it sharable across different places. - Add Unit test verify schema for star column query against multilevel tables. Unit test framework change - Fix memory leak in unit test framework. - Allow SchemaTestBuilder to pass in BatchSchema. close #906
author: Jinfeng Ni <jni@apache.org> 2017-05-17 16:08:00 -0700
committer: Jinfeng Ni <jni@apache.org> 2017-09-05 12:07:23 -0700
commit: fde0a1df1734e0742b49aabdd28b02202ee2b044 (patch)
tree: f5d408914895d1b9bea8cdc86bab26365ed8c81d /exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java
parent: e1649dd7d9fb2c30632f4df6ea17c483379c9775 (diff)
1 files changed, 2 insertions, 8 deletions
diff --git a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java
index 7e4483bcf..df80a10fd 100644
--- a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java
+++ b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java
@@ -42,8 +42,7 @@ import org.apache.calcite.plan.RelTraitSet;
 import org.apache.calcite.rel.type.RelDataType;
 
 import com.google.common.base.Preconditions;
-import com.google.common.base.Predicate;
-import com.google.common.collect.Iterables;
+import org.apache.drill.exec.util.Utilities;
 
 /**
  * GroupScan of a Drill table.
@@ -160,12 +159,7 @@ public class DrillScanRel extends DrillScanRelBase implements DrillRel {
     final ScanStats stats = groupScan.getScanStats(settings);
     int columnCount = getRowType().getFieldCount();
     double ioCost = 0;
-    boolean isStarQuery = Iterables.tryFind(getRowType().getFieldNames(), new Predicate<String>() {
-      @Override
-      public boolean apply(String input) {
-        return Preconditions.checkNotNull(input).equals("*");
-      }
-    }).isPresent();
+    boolean isStarQuery = Utilities.isStarQuery(columns);
 
     if (isStarQuery) {
       columnCount = STAR_COLUMN_COST;
author	Jinfeng Ni <jni@apache.org>	2017-05-17 16:08:00 -0700
committer	Jinfeng Ni <jni@apache.org>	2017-09-05 12:07:23 -0700
commit	fde0a1df1734e0742b49aabdd28b02202ee2b044 (patch)
tree	f5d408914895d1b9bea8cdc86bab26365ed8c81d /exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java
parent	e1649dd7d9fb2c30632f4df6ea17c483379c9775 (diff)