I'm trying to build a web service (in Python) which can accept potentially tens of gigabytes of data and process this data. I don't want this to be completely received and built into an in-memory object before passing to my logic as a) this will use a ton of memory, and b) the processing will be pretty slow and I'd love to have a processing thread working on chunks of the data while the rest of the data is being received asynchronously.
I believe I need some sort of streaming solution for this but I'm having trouble finding any Python solution to handle this case. Most things I've found are about streaming the output (not an issue for me). Also it seems like wsgi has issues by design with a data streaming solution.
Is there a best practice for this sort of issue which I'm missing? And/or, is there a solution that I haven't found?
Edit: Since a couple of people asked, here's an example of the sort of data I'd be looking at. Basically I'm working with lists of sentences, which may be millions of sentences long. But each sentence (or group of sentences, for ease) is a separate processing task. Originally I had planned on receiving this as a json array like:
{"sentences: [
"here's a sentence",
"here's another sentence",
"I'm also a sentence"
]
}
For this modification I'm thinking it would just be newline delimited sentences, since I don't really need the json structure. So in my head, my solution would be; I get a constant stream of characters, and whenever I get a newline character, I'd split off the previous sentence and pass it to a worker thread or threadpool to do my processing. I could also do in groups of many sentences to avoid having a ton of threads going at once. But the main thing is, while the main thread is getting this character stream, it is splitting off tasks periodically so other threads can start the processing.
Second Edit: I've had a few thoughts on how to process the data. I can't give tons of specific details as it's proprietary, but I could either store the sentences as they come in into ElasticSearch or some other database, and have an async process working on that data, or (ideally) I'd just work with the sentences (in batches) in memory. Order is important, and also not dropping any sentences is important. The inputs will be coming from customers over the internet though, so that's why I'm trying to avoid a message queue like process, so there's not the overhead of a new call for each sentence.
Ideally, the customer of the webservice doesn't have to do anything particularly special other than do the normal POST request with a gigantic body, and all this special logic is server-side. My customers won't be expert software engineers so while a webservice call is perfectly within their wheelhouse, handling a more complex message queue process or something along those lines isn't something I want to impose on them.
from Simultaneously receive and process lots of data in Python web service
No comments:
Post a Comment