This study presents a framework that combines Bayesian inference with reinforcement learning to guide drone-based sampling for methane source estimation. Synthetic gas concentration and wind observations are generated using a calibrated model derived from real-world drone measurements, providing a more representative testbed that captures atmospheric boundary layer variability. We compare three path planning strategies—preplanned, myopic (short-sighted), and non-myopic (long-term)—and find that non-myopic policies trained via deep reinforcement learning consistently yield more precise and accurate estimates of both source location and emission rate. We further investigate centralized multi-agent collaboration and observe comparable performance to independent agents in the tested single-source scenario. Our results suggest that effective source term estimation depends on correctly identifying the plume and obtaining low-noise concentration measurements within it. Precise localization further requires sampling in close proximity to the source, including slightly upwind. In more complex environments with multiple emission sources, multi-agent systems may offer advantages by enabling individual drones to specialize in tracking distinct plumes. These findings support the development of intelligent, data-driven sampling strategies for drone-based environmental monitoring, with potential applications in climate monitoring, emission inventories, and regulatory compliance.