A data source implementation for Apache Spark that enables reading XLSX (Excel) files directly into Spark DataFrames.
Spark XLSX is a library that extends Apache Spark's data source API to support reading Microsoft Excel XLSX files. It allows you to seamlessly integrate Excel data into your Spark data processing pipelines.
- Read XLSX files directly into Spark DataFrames
- Support for schema definition
- Simple integration with Spark's standard read API
- Compatible with Apache Spark 3.5.0+
Add the following dependency to your project:
<dependency>
<groupId>dev.victoryil</groupId>
<artifactId>spark-xlsx</artifactId>
<version>0.1.0</version>
</dependency>// Define your schema
StructType schema = StructType.fromDDL("id string, name string");
// Read an XLSX file
Dataset<Row> df = spark.read()
.schema(schema)
.format("dev.victoryil.spark.Xlsx")
.load("path/to/your/file.xlsx");
// Show the data
df.show();Dataset<Row> df = spark.read()
.schema(schema)
.format("dev.victoryil.spark.Xlsx")
.option("path", "path/to/your/file.xlsx")
.load();- Apache Spark 3.5.0+
- Java 11+
For more detailed information on usage and configuration options, please see the Usage Documentation.
If you're interested in contributing to the project, check out our Contribution Guidelines.
This project is configured for deployment to Maven Central via Central Sonatype. To deploy a new version:
- Create a Sonatype account
- Request access to the
dev.victoryilgroup ID in Central Sonatype - Set up GPG signing keys on your machine
- Configure your Maven settings.xml with Sonatype token
- Copy the template settings file from
.mvn/settings.xmlto~/.m2/settings.xml(or merge with your existing settings) - Update the file with your Sonatype token
- Configure your GPG key information if needed
The project includes a Bash script that automates the entire release process:
./release.sh -r RELEASE_VERSION -d DEVELOPMENT_VERSION -u MAVEN_USERNAME -p MAVEN_PASSWORD \
-k "$(cat your_gpg_key.asc)" -i GPG_KEY_ID -s GPG_PASSPHRASE [-g GITHUB_TOKEN]Arguments:
-r RELEASE_VERSION: The version to release (e.g., 0.1.0)-d DEVELOPMENT_VERSION: The next development version with -SNAPSHOT suffix (e.g., 0.2.0-SNAPSHOT)-u MAVEN_USERNAME: Maven Central (Sonatype) username-p MAVEN_PASSWORD: Maven Central (Sonatype) password-k GPG_PRIVATE_KEY: GPG private key (exported withgpg --export-secret-keys --armor KEY_ID)-i GPG_KEY_ID: GPG key ID-s GPG_PASSPHRASE: GPG passphrase-g GITHUB_TOKEN: (Optional) GitHub token for creating GitHub releases
This script will:
- Update the version to the release version
- Build and deploy to Maven Central
- Create a Git tag for the release
- Update to the next development version
- Push changes to main branch and tags
- Handle the develop branch
- Create a GitHub release (if GitHub token is provided)
To deploy a snapshot version:
mvn clean deployTo deploy a release version:
mvn clean deploy -P releaseThis will:
- Compile the code
- Run tests
- Generate Javadocs and source JARs
- Sign all artifacts with GPG
- Deploy to Central Sonatype
- Automatically release to Maven Central (if using the release profile)
This project is licensed under the MIT License - see the LICENSE file for details.
- Victor Yil (https://victoryil.dev)