Monday, 3 June 2019

Reading specific chunks pandas / not reading all chunks in pandas

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since i'm not native with python i got a optimization problem and looking for a better solution here.

What my code does:

I read in the line count of my csv with

with open(file) as f:
    row_count = sum(1 for line in f)

afterwards i "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, i plot it every 5 steps with modulo (which may be variated). For this i use an external counter.

chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
    df = chunk
    print(counter)
    if ((counter % 5) == 0 | (counter == 0):
        plt.plot(df["Variable"])
    counter = counter +1
plt.show()

Now to my question:

It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since i read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:

wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
   for chunk[i] in pd.read_csv(file, chunksize=chunksize):
   .
   .

I think i have understanding issues how i can manipulate this syntax of the for loop range. There should be an elegant way to fix this.

Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.

Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.


Amendment: I'm aware plotting a lot of data in matplotlib is really slow. I preprocess it earlier, but for making this code readable i removed all unnecessary parts.

Also, the answer here could be trivial. Maybe i overthink this.



from Reading specific chunks pandas / not reading all chunks in pandas

No comments:

Post a Comment