方法一:
参考资料:https://blog.csdn.net/GCR8949/article/details/80155064

import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.spark.sql.SparkSession
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
val spark = getLocalSparkSession()

    val binaryRDD = spark.sparkContext.binaryFiles("XXX.zip")
    val dataRDD= binaryRDD.flatMap {
      case (name: String, content: PortableDataStream) => val zis = new ZipInputStream(content.open())
        Stream.continually(zis.getNextEntry)
          .takeWhile(_ != null)
          .flatMap { _ =>
            val br = new BufferedReader(new InputStreamReader(zis))
            Stream.continually(br.readLine()).takeWhile(_ != null)
          }
    }
    dataRDD.take(10).foreach(println)

    spark.read.json(dataRDD).show(100)

方法二:
使用spark.sparkContext.newAPIHadoopRDD
参考资料:https://www.thinbug.com/q/28569788

newAPIHadoopRDD
https://blog.csdn.net/zpf_940810653842/article/details/104815533

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐