So in order for this defocus to work, I need render layer instead of image layer? Is there no workaround for this?
Great question. You will need to use a Render Layer node only if you're NOT rendering frames to external files. If you're rendering an animation on a render farm, for example, then you will be receiving an image file for every frame in your animation. In that situation you need to use an Image Layer node to source those frames for compositing.
In the latter case you need to have a z-depth pass embedded into the rendered frame files as a pass. EXR is the only multi-layer image format that facilitates this embedding of the depth channel.