Developing deep learning models for resource-constrained Internet-of-Things (IoT) devices is challenging, as it is difficult to achieve both good quality of results (QoR), such as DNN model inference accuracy, and quality of service (QoS), such as inference latency, throughput, and power consumption. Existing approaches typically separate the DNN model development step from its deployment on IoT devices, resulting in suboptimal solutions. In this paper, we first introduce a few interesting but counterintuitive observations about such a separate design approach, and empirically show why it may lead to suboptimal designs. Motivated by these observations, we then propose a novel and practical bi-directional co-design approach: a bottom-up DNN model design strategy together with a top-down flow for DNN accelerator design. It enables a joint optimization of both DNN models and their deployment configurations on IoT devices as represented as FPGAs. We demonstrate the effectiveness of the proposed co-design approach on a real-life object detection application using Pynq-Z1 embedded FPGA. Our method obtains the state-of-the-art results on both QoR with high accuracy (IoU) and QoS with high throughput (FPS) and high energy efficiency.