How to append items to the CSV file without header row?


https://docs.scrapy.org/en/latest/_images/scrapy_architecture_02.png
Scrapy Architecture

Scrapy provides a few item exporters by default to export items in commonly used file formats like CSV/JSON/XML. I usually use CSV to export items, it is pretty convenient, and it comes in two ways:

  • appending mode, for example, scrapy crawl foo -o test.csv

  • overwriting mode with -O option, like scrapy crawl foo -O test.csv

But in the appending mode, it's a bit annoying that it always appends the header row before the newly scraped items, which is not correctly in terms of CSV format.

So how can we resolve this issue?

In case you didn't know, the item exporter is in charge of saving scraped items from spiders to files. So a natural way is to work around it by improving the CSV item exporter.

There was also an Stack Overflow question regarding this, and the accepted answer from @furas was in this way. It enhanced the default CsvItemExporter by adjusting include_headers_line in the constructor:

from scrapy.exporters import CsvItemExporter


class HeadlessCsvItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):

        # args[0] is (opened) file handler
        # if file is not empty then skip headers
        if args[0].tell() > 0:
            kwargs['include_headers_line'] = False

        super(HeadlessCsvItemExporter, self).__init__(*args, **kwargs)

That's one way by utilizing the scrapy architecture.

Here I want to share another approach that's independent of scrapy. If we look at the result CSV file alone, it's merely a text file. So removing the redundant header rows is equivalent to removing the specific duplicate row.

With that in mind, here is the snippet that runs after the scrapy command finishes:

def dedup_csv_header(fname, fname_new):
    if not os.path.exists(fname):
        print('csv file not exist:', fname)
        return

    print(f'dedup csv headers, file {fname} to {fname_new}')
    fnew = open(fname_new, 'wt')

    with open(fname, 'rt') as f:
        header = None
        first = True
        for line in f:
            if None == header:
                header = line

            if not first and header == line:
                continue
            fnew.write(line)
            first = False

    fnew.close()

See also

comments powered by Disqus