24 Oct 25

This blog post shares the key professional and technical lessons the author learned over four years working in the field of Data Engineering, covering topics like infrastructure, pipelines, and career growth.

by tmfnk 2 months ago

This is a repo with links to everything you’d ever want to learn about data engineering. The Data Engineering Handbook on GitHub is a comprehensive, open-source guide and curriculum intended to help aspiring and current professionals master the skills and tools required to become a Data Engineer.

by tmfnk 2 months ago saved 2 times

This blog post offers a tutorial on mastering PySpark SQL, guiding readers through the core concepts from basic data manipulation to advanced querying techniques for large-scale data processing.

by tmfnk 2 months ago

08 Oct 15

“Spark as a Service”: Simple REST interface (including HTTPS) for all aspects of job, context management Support for Spark SQL, Hive, Streaming Contexts/jobs and custom job contexts! See Contexts. LDAP Auth support via Apache Shiro integration Supports sub-second low-latency jobs via long-running job contexts Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart Kill running jobs via stop context and delete job Separate jar uploading step for faster job startup Asynchronous and synchronous job API. Synchronous API is great for low latency jobs! Preliminary support for Java (see JavaSparkJob) Works with Standalone Spark as well as Mesos and yarn-client Job and jar info is persisted via a pluggable DAO interface Named RDDs to cache and retrieve RDDs by name, improving RDD sharing and reuse among jobs. Supports Scala 2.10 and 2.11

by wheresalice 10 years ago