CSV Export in Spring Boot

When it comes to handling CSV export processes in Spring Boot, choosing the right approach is crucial for ensuring optimal performance. After evaluating several common methods, I have identified the most efficient options to consider without hesitation. Let’s take a closer look at the performance metrics of these selected techniques.

Here are the general specifications I used for this experiment:

  • Total rows: 1000019
  • Related entities: Jobs, Companies, CompanyLocations, Occupations
  • Spring Boot 3
  • JDK 17
  • commons-csv: 1.10.0

Useful techniques

1.1. Spring Batch

To prepare the batch processing, we will need to understand the batch structure. In general, this is the process you should remember.

In order to create a batch process, it is necessary to have some batch tables in the current database. To allow the application to do it, remember to add the following lines to your YAML configuration file.

spring.batch.job.enabled=false
spring.batch.jdbc.initialize-schema=always

And from Spring Boot 3, we don’t need the annotation @EnableBatchProcessing. If we add it, the batch related configurations in the YAML property file will be back off.

I might write another article to introduce Spring Batch, but in this section, I would like to tell you how long it will take to export around 1000000 rows if I use the batch process.

Here is the result:

Finished after: 8163.085(s)

1.2. ByteArrayOutputStream

  • BufferedWriter

In Spring Boot, the BufferedWriter class is a subclass of Writer that efficiently writes text to an output stream by using a buffer. This buffering mechanism enhances the performance of I/O operations by minimizing the number of interactions with the underlying output destination. By accumulating characters in a buffer, BufferedWriter reduces the frequency of physical write operations, thereby improving overall efficiency. This class is particularly useful when dealing with large amounts of data, as it writes data in bulk to the output stream only when the buffer is full or when explicitly instructed to do so.

In this experiment, I also applied pagination to see if it can make the process quicker.

public void exportDataToCSVFile(OutputStream outputStream) throws Exception {
    try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream))) {
        int pageNumber = 0;
        Page<Job> page;

        // Write CSV header
        writer.write("JobId,CompanyName,BusinessOverview,Location,Title,Occupation,JobDescription,PRText,Status,Specialty,PostingStartDate,PostingEndDate,PVTotal");

        do {
            PageRequest pageRequest = PageRequest.of(pageNumber, 100);
            page = jobsRepository.findAll(pageRequest);

            for (Job entity : page.getContent()) {
                writer.newLine();
                writer.write(formatEntityToCSVRow(entity));
            }
            pageNumber++;
        } while (page.hasNext());
        writer.flush();
    }
}

This is the controller layer.

@GetMapping("/jobs/buffer/download")
ResponseEntity<byte[]> exportDataToByte() throws Exception {
    long start = System.currentTimeMillis();
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    exportService.exportDataToCSVFile(outputStream);

    byte[] csvBytes = outputStream.toByteArray();
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.parseMediaType("text/csv"));
    headers.set(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=data.csv");
    headers.setContentLength(csvBytes.length);
    long end = System.currentTimeMillis();
    System.out.println("Finished after: "+ ((double)(end - start)/1000));

    return ResponseEntity.ok()
            .headers(headers)
            .body(csvBytes);
}

Here is the result:

Finished after: 18527.395(s)

  • Apache Commons CSV

Apache Commons CSV is a robust and flexible library provided by the Apache Commons project for reading and writing CSV (Comma-Separated Values) files in Java.

Here’s how I implemented it in my experiment.

private byte[] generateJobListCsv(List<Job> data) throws IOException {
    try {
        CSVFormat csvFormat = CSVFormat.DEFAULT.builder().setHeader("JobId", "CompanyName", "BusinessOverview", "Location", "Title", "Occupation", "JobDescription", "PRText", "Status", "Specialty", "PostingStartDate", "PostingEndDate", "PVTotal").build();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        CSVPrinter csvPrinter = initialCsvPrinter(csvFormat, outputStream);

        for (Job record : data) {
            csvPrinter.printRecord(
                    record.getId(),
                    record.getLocation().getCompany().getCompanyName(),
                    record.getLocation().getCompany().getBusinessOverview(),
                    record.getLocation().getLocationName(),
                    record.getTitle(),
                    record.getOccupation().getOccupationName(),
                    record.getJobDescription(),
                    record.getPrText(),
                    record.getPublicStatus(),
                    record.getSpecialty(),
                    record.getPostingStartDate(),
                    record.getPostingEndDate(),
                    record.getTotalPv()
            );
            csvPrinter.println();
        }
        csvPrinter.flush();
        return outputStream.toByteArray();
    } catch (Exception e) {
        e.printStackTrace();
        throw e;
    }
}

public ResponseEntity<Resource> downloadJobCSV() throws IOException {
    List<Job> data = jobsRepository.findAll();
    byte[] byteData = generateJobListCsv(data);

    Resource csvFile = new ByteArrayResource(byteData);
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.parseMediaType("text/csv"));
    headers.setContentDispositionFormData("attachment",
            "jobs.csv");
    return ResponseEntity.ok()
            .headers(headers)
            .body(csvFile);
}

The response time is quite impressive.

Finished after: 45.824(s)

  • InputStreamResource

In Spring Boot, the InputStreamResource class is a subclass of AbstractResource that provides a straightforward way to wrap an InputStream as a resource. This is particularly useful for handling large files or streams of data that need to be processed or transferred without fully loading them into memory. By leveraging the InputStreamResource, you can efficiently manage data streams, such as reading from a file or network socket, and pass them to other components, like REST controllers, for further processing or direct response to clients. This class simplifies the integration of input streams within the Spring framework, ensuring seamless resource management and optimal performance.

This is the implementation to download CSV files.

public File exportDataToCSVFile() throws IOException {
    List<Job> dataList = jobsRepository.findAll();
    File csvFile = File.createTempFile("data", ".csv");

    try (BufferedWriter writer = new BufferedWriter(new FileWriter(csvFile))) {
        // Write CSV header
        writer.write("JobId,CompanyName,BusinessOverview,Location,Title,Occupation,JobDescription,PRText,Status,Specialty,PostingStartDate,PostingEndDate,PVTotal");

        // Write CSV data
        for (Job data : dataList) {
            writer.newLine();
            writer.write(formatEntityToCSVRow(data));
        }
    }

    return csvFile;
}

@GetMapping("/jobs/inputStream/download")
public ResponseEntity<InputStreamResource> exportDataByInputStream() throws IOException {
    long start = System.currentTimeMillis();
    File csvFile = exportService.exportDataToCSVFile();

    InputStreamResource resource = new InputStreamResource(new FileInputStream(csvFile));

    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.parseMediaType("text/csv"));
    headers.set(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=data.csv");
    headers.setContentLength(csvFile.length());
    long end = System.currentTimeMillis();
    System.out.println("Finished after: "+ ((double)(end - start)/1000));

    return ResponseEntity.ok()
            .headers(headers)
            .body(resource);
}

The response time is significantly more impressive compared to the techniques mentioned above. Honestly, I didn’t expect such an outstanding result. Here it is:

Finished after: 37.491(s)

Combinated techniques

In this section, I will explore the combination of two data types, List and Stream, with the previously mentioned techniques to determine which is the best option for CSV export.

2.1. Stream vs List

2.1.1. Differences between Stream and List

Feature List Stream
Origin Java Collections Framework Java Stream API (Java 8)
Nature Ordered collection of elements Sequence of elements supporting functional operations
Storage Stores elements Does not store elements; processes them on-the-fly
Mutability Mutable (supports add, remove operations) Immutable (operations produce new streams)
Access Allows indexed access to elements No direct access to elements
Traversal No direct access to elements Can be traversed only once
Operations Supports CRUD operations Supports functional operations (map, filter, reduce, etc.)
Parallelism Requires manual handling for parallel operations Easily supports parallel processing
Use case Suitable for storing and manipulating collections Suitable for processing data sequences in a functional manner
Pagination Yes No
Annotation No @Transaction required

2.1.2 How is its performance?

(These results were measured in seconds)

Techniques List Stream
Batch 8163.085 465.023
BufferedWriter 18527.395 54.302
Apache Commons CSV 45.824 60.289
InputStreamResource 37.491 63.738

2.2. Custom DTOs

So, what if I use a custom DTO to retrieve only the necessary data? Will the performance change significantly as well?

@Query(value = """
            select job.id as id,
            c.companyName as companyName,
            c.businessOverview as businessOverview,
            loc.locationName as locationName,
            job.title as title,
            o.occupationName as occupationName,
            job.jobDescription as jobDescription,
            job.prText as prText,
            job.publicStatus as publicStatus,
            job.specialty as specialty,
            job.postingStartDate as postingStartDate,
            job.postingEndDate as postingEndDate,
            job.totalPv as totalPv from jobs job join company_locations loc on job.location.id = loc.id
            join companies c on c.id = loc.company.id
            join occupations o on o.id = job.occupation.id""")
    Stream<IJobExportItem> findAllByStreamByDto();

Here is a comparison table of response times measured in seconds after 2 times.

First time:

Techniques List Stream Dto
Batch 8163.085 465.023 129.655
BufferedWriter 18527.395 54.302 102.978
Apache Commons CSV 45.824 60.289 106.527
InputStreamResource 37.491 63.738 99.591

Second time:

Techniques List Stream Dto
Batch 6109.746 596.127 125.961
BufferedWriter 14446.221 51.409 93.24
Apache Commons CSV 44.447 56.177 102.944
InputStreamResource 40.872 49.621 97.948

Summary

In summary, InputStreamResource has proven to be the best option for now. The second option is Apache Commons CSV. However, it’s important to note that using List may not be ideal for CSV export, as this article only addresses response time. Achieving optimal application performance often requires balancing response time with memory usage. While I plan to delve deeper into this topic in the next article, I can reveal that Stream is a more suitable choice for CSV processing. We will also examine memory consumption during the process. Stay tuned for more insights!

この情報は役に立ちましたか?


フィードバックをいただき、ありがとうございました!

関連記事

カテゴリー:

ブログ

情シス求人

  1. チームメンバーで作字やってみた#1

ページ上部へ戻る