How to serve a fastai model in browser.
My intention was clear. To my mind what I wanted was quite simple. How can I generate a model using fastai library (easily) and deploy it on a browser? This way I would have no need for servers (hence expenses) and also I would have also guaranteed the privacy of users.
For those who do not know fastai library simplifies training fast and accurate neural nets using modern best practices. From a high-level perspective, we can say that the fastai library is a python wrapper of the PyTorch Open source framework. But fastai has embedded all the best practices and easily provides state-of-the-art results in standard deep learning domains. Another important advantage of fastai is that you can experiment using it in Jupyter notebooks without the need of installing any local library or having a dedicated GPU for training your model. Lastly but most importantly you can follow the brilliant lessons of Jeremy Howard that really accelerate your deep learning struggle.
By training on top of a Resnet34 model, I had created my own learner model in Jupyter. When I saved (exported) my model I saw it needed about 50MB of disk space. The options that I have now in order to deploy in production are the following:
- Serve from a dedicated PyTorch server. Pytorch is a big framework so you need to spin up a VM and install PyTorch and use a Flask server to serve your model. Although this solution seems the most direct it has some drawbacks like cost, maintainability, and scalability. All those issues should be thought through before using this approach. In my opinion, this is a good approach in case we need to serve really large models (like >50MB) but still, we have to think beforehand about how to scale up (maybe using Kubernetes, ECS), what type of VMs we are going to use, etc.
- To avoid all those scaling issues one may suggest the solution of serverless architecture. Indeed this is a viable solution that has already been tested: We can use API Gateway as our proxy that will call our lambda inference endpoint and serve clients. The main issue with this approach is: “Because the plain language model is already around 250 MB, the initial function run can take up to 25 seconds and may even exceed the maximum API timeout of 29 seconds. That time can also be reached when the function wasn’t called for some time and therefore is in a cold start mode. When the Lambda function is in a hot state, one inference run takes about 150 milliseconds.”
- Lastly, AWS offers Amazon Sagemaker but the disadvantage of this approach is the big cost of using this managed approach. If not for the cost it seemed the easiest approach of all the previous alternatives.
- To avoid the bottleneck of Pytorch I thought to move away from it. The major alternative ML framework is the Tensorflow platform that is offered by Google. But before jumping to TensorFlow I had thought of using the Open standard for machine learning interoperability: ONNX. ONNX doesn’t support all and every operator but is pretty stable and backed up by all major companies (with Microsoft being the leader). Also, ONNX has introduced ONNX Runtime Web (ORT Web), a new feature in ONNX Runtime to enable JavaScript developers to run and deploy machine learning models in browsers. This really made me super excited. But I had first to convert my existing PyTorch model to ONNX format. This was pretty simple to do directly in my Jupyter notebook:
!pip install onnxtorch.onnx.export(learn.model,
torch.randn(1, 3, 224, 224).cuda(),
'gdrive/MyDrive/models/behind_resnet.onnx',)
The exported model file was saved in Google drive but the size of it exceeded the 40MB. I had tried various methods that decrease the onnx model file size without great success. That is why I was forced to abandon the ONNX approach since 40MB is a big file size to download to your browsers, especially for our mobile phones.
One of the methods that I have used to decrease the onnx model size is the following where we try to remove the duplication of weights:
!pip install onnxruntime
!pip install onnximport onnx
from onnxruntime.transformers.onnx_model import OnnxModel
model=onnx.load('gdrive/MyDrive/models/behind_resnet.onnx')onnx_model=OnnxModel(model)
count = len(model.graph.initializer)
same = [-1] * count
print(count)
print(model.graph.initializer[0])
print(model.graph.initializer[0].raw_data)def has_same_value(a,b):
if a.name == b.name and a.raw_data == b.raw_data and a.dims == b.dims:
return True
return Falsefor i in range(count - 1):
if same[i] >= 0:
continue
for j in range(i+1, count):
if has_same_value(model.graph.initializer[i], model.graph.initializer[j]):
same[j] = ifor i in range(count):
if same[i] >= 0:
onnx_model.replace_input_of_all_nodes(model.graph.initializer[i].name, model.graph.initializer[same[i]].name)
onnx_model.update_graph()
onnx_model.save_model_to_file('gdrive/MyDrive/models/behind_resnet_reduced.onnx')
- This is where I decided to try the TensorFlow approach. We can say that TensorFlow is the more mature of the 3 frameworks and offers a great variety of libraries and tools. We may divide our options into TensorFlow library (that has the same size pitfalls as PyTorch), tensorflow.js (which is a library for machine learning in JavaScript -browser, and node.js), and TensorFlow lite (which is an open-source deep learning framework for on-device inference). Now the plan is to use tensorflow.js to serve our model from the browser and decrease the model size by using TensorFlow lite. This is the main documentation page of our approach.
Since we have the model in onnx format it is easy to convert to TensorFlow lite format. The model Optimization Tensorflow Lite page is a gem and really has a lot of very helpful insights. To do so we have to do:
import tensorflow as tf
import onnx
from onnx_tf.backend import prepare# Declare paths
saved_model_dir = 'gdrive/MyDrive/models/behind_tfmodel'
full_tflite_mode_dir = 'gdrive/MyDrive/tflite/behind_model.tflite'
optimized_tflite_mode_dir = 'gdrive/MyDrive/tflite/behind_optimized_model.tflite'
size_tflite_mode_dir = 'gdrive/MyDrive/tflite/behind_size_model.tflite'# Load onnx model
onnx_model = onnx.load('gdrive/MyDrive/models/behind_resnet.onnx')
tf_model = prepare(onnx_model)
tf_model.export_graph('gdrive/MyDrive/models/behind_18_tfmodel16')# Convert the modelconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT]
full_tflite_model = converter.convert()
converter.target_spec.supported_types = [tf.float16]
optimized_mode = converter.convert()
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
size_mode = converter.convert()# Save the model.
with open(full_tflite_mode_dir, 'wb') as f:
f.write(full_tflite_model)# Save the model.
with open(optimized_tflite_mode_dir, 'wb') as f:
f.write(optimized_mode)# Save the model.
with open(size_tflite_mode_dir, 'wb') as f:
f.write(size_mode)
This seems that it made the trick and in our case, we have saved a model of size 11MB (with the Optimize_for_size flag). Of course, the real trade-off when we reduce the model size is to see if we have an acceptable model accuracy. To check the accuracy we can run the following code in the Jupyter notebook. This way we compare the accuracy of the quantized model and the full-sized model.
# Test TFLite inference
# Load quantized TFLite model
filenames = [‘gdrive/MyDrive/behind/1_001.jpeg’,’gdrive/MyDrive/behind/6_12.jpg’,’gdrive/MyDrive/behind/4_27.jpeg’,’gdrive/MyDrive/behind/6_13.jpg’,’gdrive/MyDrive/behind/7_15.jpg’,’gdrive/MyDrive/behind/9_15.jpeg’,]tflite_interpreter = tf.compat.v1.lite.Interpreter(model_path=full_tflite_mode_dir)
tflite_interpreter_quant = tf.compat.v1.lite.Interpreter(model_path=optimized_tflite_mode_dir)# Learn about its input and output details
input_details = tflite_interpreter_quant.get_input_details()
output_details = tflite_interpreter_quant.get_output_details()
print(“== Input details ==”)
print(“name:”, input_details[0][‘name’])
print(“shape:”, input_details[0][‘shape’])
print(“type:”, input_details[0][‘dtype’])
print(“\n== Output details ==”)
print(“name:”, output_details[0][‘name’])
print(“shape:”, output_details[0][‘shape’])
print(“type:”, output_details[0][‘dtype’])tflite_interpreter_quant.allocate_tensors()
tflite_interpreter.allocate_tensors()
tensor_index_quant = tflite_interpreter_quant.get_input_details()[0][‘index’]
tensor_index = tflite_interpreter.get_input_details()[0][‘index’]
print(‘Tensor Index = {}’.format(tensor_index))
for f in filenames:
print(“For file: {}”.format(f.split(‘gdrive/MyDrive/behind/’)[1]))
img = tf.keras.preprocessing.image.load_img(f, target_size=(224,224))
print(img.shape)
image_array = np.array(img) # tf.keras.preprocessing.image.img_to_array(img, data_format=’channels_first’)
# show(image_array)
image_array = image_array/255
#print(image_array)
#print(image_array.shape)
input_tensor_z = np.expand_dims(tf.convert_to_tensor(image_array, np.float32), axis=0)
#print(input_tensor_z.shape)
n_image=np.swapaxes(input_tensor_z,3,1)
n_image=np.swapaxes(n_image,2,3)
#print(n_image)
tflite_interpreter_quant.set_tensor(tensor_index_quant, n_image)
tflite_interpreter.set_tensor(tensor_index, n_image)
# img_resized = Image.open(file_name).convert(‘RGB’).resize((224,224))
# print(img_resized.shape)
# n_image = np.array(img_resized)
# print(n_image.shape)
# n_image=np.swapaxes(n_image,0,2)
# n_image = np.expand_dims(n_image, axis=0)
# #input_tensor[:] = n_image
# print(“TENSOR = {}”.format(tensor_index))
# print(n_image.shape)
# input_tensor_z=tf.convert_to_tensor(n_image, np.float32)
# print(input_tensor_z)
# tflite_interpreter_quant.set_tensor(tensor_index, input_tensor_z)
tflite_interpreter_quant.invoke()
tflite_interpreter.invoke()
output_details_quant = tflite_interpreter_quant.get_output_details()[0]
output_details = tflite_interpreter.get_output_details()[0]
output_quant = np.squeeze(tflite_interpreter_quant.get_tensor(output_details_quant[‘index’]))
output = np.squeeze(tflite_interpreter.get_tensor(output_details[‘index’]))
print(output)
print(output_quant)
for i,o in enumerate(output):
if o == max(output):
print(“Max: {}: {}”.format(i+2,o))
for i,o in enumerate(output_quant):
if o == max(output_quant):
print(“Max Quant: {}: {}”.format(i+2,o))
We are almost there and since we have finished our exploration of all our available deployment options it's worth testing our model also in a browser. The code to load and invoke a TensorFlow lite model through the tensorflow.js library is shown below:
<script src=”https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js"></script><script src=”https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-tflite@0.0.1-alpha.5/dist/tf-tflite.min.js"></script>
const raw_img = tf.browser.fromPixels(document.getElementById(image_rank),3)<script>const model = await tflite.loadTFLiteModel("https://models_in_s3.s3.amazonaws.com/back/model.tflite");const raw_img =
tf.browser.fromPixels(document.getElementById('img'),3)// lets convert input shape to [1,3,224,224] which is what our model expects
const inputTensor = tf.transpose(tf.image.resizeNearestNeighbor(raw_img, [224,224]).div(tf.scalar(255)).expandDims(0), perm=[0,3,1,2])
let result = model.predict(inputTensor);
console.log(result.dataSync();)