You want Tensorflow 2.3, PyTorch, GPU computing with Float16 Tensor-Core support and a built-in Edge-TPU — here we go…
A hardware/software guide based on our setup.

Getting a GPU machine running with recent versions of Cuda, Tensorflow, Pytorch are quite some steps. There are many guides out there, but it is still a gamble to get the right combination of hardware together with the right combination of software. This guide shows one working combination which currently works at our office and does the job — if you want to resemble this or if you want to get clues how we got things running, read on.
We start with the hardware we got and our decisions behind choosing these components. Second part is about getting software installed — operating systems, drivers and libraries to start crunching numbers.
Hardware:
CPU
We went for a Ryzen 5 3400G. This is a CPU with 4 cores and a built in GPU. Speed and cores: this is not the bottleneck for most AI applications so we went with something in the 100$ range. More than the speed, we checked if the CPU supports all kinds of virtualizations, extensions, etc. As some more budget CPUs might be restricted and you do not want to end up installing some wacky virtual I-dont-know extension in the future to just figure out that your CPU is not supporting a certain instruction set. I experienced this with our mini-server in which we out of luck got the right extensions to setup nested KVM virtualizations.
There are many discussion if an internal GPU will be a hassle while also fielding a dedicated GPU. For us, it felt natural to have a dedicated GPU for AI with no display attached and another one for the display (although we typically run this machine headless). We experienced no bigger problems out of this issue during installation of drivers.
Memory
A lot — we went with 32GB in two modules to be able to add additional 32GB. Speed should not matter that much, but size matters as this enables you to cache a lot of your training data from batch to batch.
Persistent storage
Speed > Size or go with a hybrid approach of a fast SSD and a big HDD. Watch out with SSDs as there are fast and slow models out there. >2GB/s for read and write is standard for a ‘fast’ SSD. We use a Samsung 970 Evo Plus M2 SSD. M2 is handy and very fast — just like some kind of stick you put on your mainboard. No wiring and no case space needed.
Mainboard
We picked a Gigabyte B450 AORUS Elite. Should have enough space to be able to work with 2GPUs and having space for two M2s (we later show why we needed a second one). Available PCIe lanes are an issue — as the CPU we use also features an internal GPU, only 16 PCIe lanes are available outside (as 8 are already used for the internal GPU). Asking if this is a bottleneck — no it should not be: training an e.g. YOLO works with images of size 448x448x3 float32 and if you transfer 32 images in a batch, this means a transfer of roughly 73MB over the available bandwidth of 8GB/s. Adding around 5ms extra to your training time per batch compared to a x16 PCIe transfer to the GPU.
I also wondered how the SSD performs and if there are restrictions from missing or overused PCIe lanes: The model we used should be able to read with up to 3.5GB/s. A test with ‘dd’ (and cleared cache) showed 3GB/s.
Also have a check if your mainboard is able to boot and run headless. The mainboard we use is booting and running headless and without keyboard for us.
Power Supply
More is more, so you should opt for a lot of Watts to power up to two GPUs. Efficiency is an issue too and you will really pay the price on your electricity bill if you did not opt for the more efficient model. Noise is relevant if this machine resides in your office. We opted for a BeQuiet Straight Power11 750W — we wanted an even bigger one, but these were desperately sold out as we bought our machine.
GPU
This could fill pages of discussion: We went with a Gigabyte RTX2070 Windforce 2x with 8GB. Nowadays, go for a 30xx model and better spent more money on RAM than on speed (there are sometimes these ‘super’ or ‘founder’ versions which are a little faster — better go for more memory). RAM will be the upper bound of the size of models you can train on the GPU and on the amount of concurrent running trainings which will occur if you use the machine by different people at once (you really want to do that and we later show how this works). Another good point for the RTX is the availability of Tensor-Cores which compute half precision floating point. This is a little faster but more important, it helps to reduce memory footprint your model takes on the GPU.
Cuda is heavily paired with NVidia — there are examples of setting up a working stack on AMD hardware but we did not want to try this as we feared the smaller user community which comes with potentially lesser availability of guides, workarounds and code samples.
Edge-TPU
As a special feature we also install a Google Coral Edge-TPU. These are available as M2 cards for roughly 20$. Check the coding of the M2-card as it has to fit your mainboard slot. Our B/M-Key module did fit.
Case, Monitor, Mouse etc…
Take your spare parts. UI-things are needed once for setup until you can boot/run headless and configure things over SSH if needed.
With roughly 1000$ you got a decent hardware which can be upgraded if needed.
Software:
Debian Linux has emerged as our base for headless machines for development/ experimenting. You can start with a sleek install and add additional modules if you want. Even if you never worked that much with Linux, there are just a bunch of key concepts you have to understand and then you are ready to go. Debian does not come with that many out-of-the-box security features enabled like firewall or SE-linux but as our machine is not Internet facing (and should not be with our setup), we spare this to focus on getting the AI things running.
The main goal of getting things running also guides through our installation process, so you will be able to experiment with AI and not hassle too much with compiling custom libraries etc.
Word of caution: we experienced quite some problems as drivers and libraries will not fit if you just go with the latest libraries. Especially as we tried online guides written some months ago, it turned out that current ‘latest’ versions sometimes do not fit to other ‘latest’ versions of your system. You often can fix this by identifying the ‘latest’ version which was available while the guide was written and manually step back to exactly that version. We try to circumvent this by using commands which select certain versions and do not opt for „installing latest updates“ if not needed. This is not a secure setup as you might miss breach-filling updates on libraries, but this will lead to something which will run well on your local network. We suggest to try to update later as we will show you how to rollback your whole machine if things went wrong.
Installing Debian
As of debian, we used the following image ‘debian-10.5.0-amd64-netinst.iso’ and put it on a USB stick with the tool: ‘rufus’. Making a USB drive bootable is still not that easy and the tools-of-choice seems to shift over time. Here, ‘rufus’ worked well.
Attach display, keyboard, mouse to the machine and the usb-stick to the machine and power on.
Now work through the detailed setup. Select the system option ‘Debian-4.19…’ when asked. Use the ‘generic’ install with all drivers and allow all additional repositories. We have chosen to not allow automatic updates, as we wanted to make snapshots of our system before installing anything which breaks up our working setup. Asked for software, do only install SSH-server and system-utils, but no X-server.
Boot your system.
Installing Timeshift
Timeshift is a simple tool to make snapshots of your system while trying to install drivers etc. It will serve you well for the task of making system snapshots and rolling back if things went wrong.
After login, this was the first thing we installed and it really saved us hours:
sudo add-apt-repository -y ppa:teejee2008/timeshift
sudo apt-get update
sudo apt-get install timeshift
sudo timeshift — create — comments “blank system”You can always roll back to the created snapshot in a matter of seconds. Timeshift does not backup/restore the contents of your home-directory (by default) which has its own pros and cons — know it and act accordingly.
Prepare installation of NVIDIA-Drivers
First, disable the ‘nouveau’ drivers: In our first try without disabling them, we got an error stating that, “free nouveau kernel module is currently loaded and conflicts with the non-free nvidia kernel module” this belongs to xserver-xorgs packages.
The ‘Nouveau’ driver is an open source driver able to handle Nvidia cards and installed by default.
Execute these steps to disable the driver. Crete a new config file:
sudo nano /etc/modprobe.d/blacklist-nouveau.confContents for this file should be:
blacklist nouveau
options nouveau modeset=0Now execute these steps to enable the config and reboot:
sudo update-initramfs -u
sudo rebootInstall NVIDIA-Drivers
(this step might be omitted, as we later update to a newer driver — but, this was the exact way it worked for us and you might therefore just do this step also OR give me a hint if it worked for you without)
First check your available versions:
sudo apt list -a nvidia-driverFor us, we got the following output:
nvidia-driver/buster-backports 440.100–1~bpo10+1 amd64
nvidia-driver/stable 418.152.00–1 amd64
We installed the default ones from the ‘stable’ branch:
sudo apt-get -t stable install nvidia-driverIf you got a newer version in this step and got errors later, try the exact version we used (syntax is ‘nvidia-driver=VERSION’ for selecting a version).
Install newer kernel
This step installs some newer modules and drivers which especially enabled us later to get Tensorflow 2.3:
sudo apt install linux-headers-5.4.0–0.bpo.2-amd64
sudo apt install linux-image-5.4.0–0.bpo.2-amd64-unsigned
sudo rebootInstall CUDA
You can install CUDA from the backports but we need the ‘force-overwrite’-option as some existing libraries are not compatible and need to be overwritten:
sudo apt -t buster-backports -o Dpkg::Options::=” — force-overwrite” install nvidia-cuda-toolkitNow install CuDNN. You have to login to Nvidia to get the mentioned cudnn install. Search through their archives and get the one we used here and copy it to the machine. It is also needed to copy some libraries after the installation — this will be done with the following commands:
tar -xzvf cudnn-10.2-linux-x64-v7.6.5.32.ga.tgz
sudo cp cuda/include/cudnn*.h /usr/include
sudo cp cuda/lib64/libcudnn* /usr/lib/x86_64-linux-gnu
sudo chmod a+r /usr/include/cudnn*.h /usr/lib/x86_64-linux-gnu/libcudnn*Install Tensorflow
This is now as simple as one line. We install with ‘sudo’ as we later want to use these libraries with different users:
sudo python3 -m pip install tensorflow-gpuTest
Now, everything should be running. Test for installed drivers and CUDA:
nvidia-smiThis should show you driver version 440.100 and CUDA version 10.2, which support compute capability 7.5 so you can run float16 GPU computations with Tensor-Core support.
Check if it computes on the GPU:
python3 -c “import tensorflow as tf;print(tf.reduce_sum([1000, 1000]))”The last outputs should include something like ‘…physical GPU…’. Thats it.
Installing PyTorch and FastAi
These are straight forward now. If you do not need these, you can skip this step:
sudo python3 -m pip install torch torchvision
sudo python3 -m pip install fastaiInstall Edge-TPU drivers
We use the Edge TPU card to run uint8-compiled models which are extremely fast. E.g. a human pose detection within an image has an inference time of ~13ms.
After installing the card in the second M2-Slot of the mainboard, the device needs to be registered in Linux.
Have a look at the latest Google-Coral instructions, as they do add new features and new versions regularly. We used the steps as described in the following section:
First, check if already an APEX kernel driver is installed:
lsmod | grep apexThere should not be any.
Now, the PCIe drivers can be installed safely. Execute the following steps (install curl, if not yet on the system):
echo “deb https://packages.cloud.google.com/apt coral-edgetpu-stable main” | sudo tee /etc/apt/sources.list.d/coral-edgetpu.listcurl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -sudo apt update
Install the PCIe drivers and the Edge-TPU runtime package:
sudo apt-get install gasket-dkms libedgetpu1-stdNext step is setting up an APEX group for the device and registering the group:
sudo sh -c “echo ‘SUBSYSTEM==\”apex\”, MODE=\”0660\”, GROUP=\”apex\”’ >> /etc/udev/rules.d/65-apex.rules”sudo groupadd apex
Add all relevant users to the newly created group (use a user-name instead of $USER to add other users as well):
sudo adduser $USER apexReboot.
After restart, check if the device is recognized by the system:
lspci -x | grep 089aThere should be some output for a device which includes ‘089a’. Now, check if the driver for the APEX device is loaded:
ls /dev/apex_0The device is now correctly installed and we can install Python modules for TFLite and the Edge-TPU.
First, check the Python version on the machine to be able to select the matching installer for TFLite:
sudo python3 -versionAnd install the package from the Coral repository accordingly. This should be the latest version, so check on
https://www.tensorflow.org/lite/guide/python
which one is the latest for your platform and Python version and change the below command accordingly (I entered that #, so you can’t just copy-past by mistake):
#sudo python3 -m pip install https://github.com/google-coral/pycoral/releases/download/release-frogfish/tflite_runtime-2.5.0-cp37-cp37m-linux_x86_64.whlNow, we can get the Edge-TPU library:
sudo apt-get install python3-pycoralThe M2-card does not have any heatsinks installed but uses an automatic thermal throttling. Current temperature can be checked through
cat /sys/class/apex/apex_0/tempWe got ourselves a heatsink for an M2 SSD and glued it to the TPU card.
We also install the Edge-TPU-Compiler. Execute the following steps:
sudo apt-get install edgetpu-compilerUseful additional code-snippets for Tensorflow
Run multiple processes on the GPU
The default configuration does not allow to work with multiple processes (and/or different users) concurrently on one GPU. Use the following snippets (Python) to enable this. We copy these to the top cells of the Jupyter notebooks we use to be able to run multiple notebooks on one GPU.
Enable memory growth, so not every process takes the full memory:
import tensorflow as tfgpus = tf.config.experimental.list_physical_devices(‘GPU’)if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices(‘GPU’)
print(len(gpus), “Physical GPUs,”, len(logical_gpus), “Logical GPUs”)
except RuntimeError as e:
print(e)
Enable float16 mixed precision
Float16 calculation can be enable with setting a policy in Tensorflow:
import tensorflow as tfpolicy = tf.keras.mixed_precision.experimental.Policy(‘mixed_float16’)
tf.keras.mixed_precision.experimental.set_policy(policy)
Watch out, as now every layer will run on float16 per default. You have to set layers which needs to run with high precision manually back to float32. E.g. (assuming you add some layers to your sequential ‘classmodel’):
classmodel.add( layers.Dense(4, activation=’softmax’, dtype=’float32'))That’s it!
Have fun. This guide is far from perfect, but I hope it will save you some time by figuring out your own way.






