Scrapy provides a few item exporters by default to export items in commonly used file formats like CSV/JSON/XML. I usually use CSV to export items, it is pretty convenient, and it comes in two ways:
- appending mode, for example,
scrapy crawl foo -o test.csv
- overwriting mode with
-O
option, likescrapy crawl foo -O test.csv
But in the appending mode, it's a bit annoying that it always appends the header row before the newly scraped items, which is not correctly in terms of CSV format.
So how can we resolve this issue?
In case you didn't know, the item exporter is in charge of saving scraped items from spiders to files. So a natural way is to work around it by improving the CSV item exporter.
There was also an Stack Overflow question regarding this, and the
accepted answer from @furas was in this way. It enhanced the default
CsvItemExporter
by adjusting include_headers_line
in the constructor:
from scrapy.exporters import CsvItemExporter
class HeadlessCsvItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
# args[0] is (opened) file handler
# if file is not empty then skip headers
if args[0].tell() > 0:
kwargs['include_headers_line'] = False
super(HeadlessCsvItemExporter, self).__init__(*args, **kwargs)
That's one way by utilizing the scrapy architecture.
Here I want to share another approach that's independent of scrapy. If we look at the result CSV file alone, it's merely a text file. So removing the redundant header rows is equivalent to removing the specific duplicate row.
With that in mind, here is the snippet that runs after the scrapy command finishes:
def dedup_csv_header(fname, fname_new):
if not os.path.exists(fname):
print('csv file not exist:', fname)
return
print(f'dedup csv headers, file {fname} to {fname_new}')
fnew = open(fname_new, 'wt')
with open(fname, 'rt') as f:
header = None
first = True
for line in f:
if None == header:
header = line
if not first and header == line:
continue
fnew.write(line)
first = False
fnew.close()