I am running a django app with a postgreSQL database and I am trying to send a very large dictionary (consisting of time-series data) to the database.
My goal is to write my data into the DB as fast as possible. I am using the library requests to send the data via an API-call (built with django REST):
My API-view is simple:
@api_view(["POST"])
def CreateDummy(request):
for elem, ts in request.data['time_series'] :
TimeSeries.objects.create(data_json=ts)
msg = {"detail": "Created successfully"}
return Response(msg, status=status.HTTP_201_CREATED)
request.data['time_series']
is a huge dictionary structured like this:
{Building1: {1:123, 2: 345, 4:567 .... 31536000: 2345}, .... Building30: {..... }}
That means I am having 30 keys with 30 values, whereas the values are each a dict with 31536000 elements.
My API request looks like this (where data is my dictionary described above):
payload = {
"time_series": data,
}
requests.request(
"post", url=endpoint, json=payload
)
The code saves the time-series data to a jsonb-field in the backend. Now that works if I only loop over the first 4 elements of the dictionary. I can get that data in in about 1minute. But when I loop over the whole dict, my development server shuts down. I guess it's because the memory is insufficient. I get a requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
. Is the whole dict saved to memory before it starts iterating? I doubt it because I read that in python3 looping with .items()
returns an iterator and is the preferred way to do this.
Is there a better way to deal with massive dicts in django/python? Should I loop through half of it and then through the other half? Or is there a faster way? Maybe using pandas
? Or maybe sending the data differently? I guess I am looking for the most performant way to do this.
Happy to provide more code if needed.
Any help, hints or guides are very much appreciated! Thanks in advance
EDIT2: I think it is not my RAM usage or the size of the dict. I still have 5GiB of RAM left when the server shuts down. ~~And the size of the dict is 1176bytes~~ Dict is much larger, see comments
EDIT3: I can't even print the huge dict. It also shuts down then
EDIT4: When split the data up and send it not all at once the server can handle it. But when I try to query it back the server breaks again. It breaks on my production server (nginx AWS RDS setup) and it breaks on my local dev server. I am pretty sure it's because django can't handle queries that big with my current setup. But how could I solve this?
EDIT5: So what I am looking for is a two part solution. One for the creation of the data and one for the querying of the data. The creation of the data I described above. But even if I get all that data into the database, I will still have problems getting it out again.
I tried this by creating the data not all together but every time-series on its own. So let's assume I have this huge data in my DB and I try to query it back. All time-series objects belong to a network so I tried this like so:
class TimeSeriesByTypeAndCreationMethod(ListAPIView):
"""Query time-series in specific network."""
serializer_class = TimeSeriesSerializer
def get_queryset(self):
"""Query time-series
Query by name of network, type of data, creation method and
source.
"""
network = self.kwargs["name_network"]
if TimeSeries.objects.filter(
network_element__network__name=network,
).exists():
time_series = TimeSeries.objects.filter(
network_element__network__name=network,
)
return time_series
else:
raise NotFound()
But the query breaks the server like the data creation before. I think also this is too much data load. I thought I could use raw sql avoid breaking the server... Or is there also a better way?
from Sending large dictionary via API call breaks development server
No comments:
Post a Comment