Back to Community
Quantopian platform limitations comments & questions

This post is an attempt to capture Quantopian platform limitations that end up scattered about the forum. This is meant to be a constructive approach, and my hope is that there will be a high degree of transparency on the part of Quantopian to shed light on the rationale behind the current state of the platform, and their plans for improving it. My thought is that users may have general recommendations or specific technical solutions that would be helpful.

My initial set:

  • Why is memory limited, for the separate cases of the research platform and the backtester? At its root, is it simply the economics of the business? Or are there other factors involved? Is there any plan to relieve the limitations? Would there be a way to allocate it on a per user basis (perhaps with the user paying the incremental cost above a base free level)?
  • At one point, there was talk of github-like integration for user content? This would seem to have numerous benefits. What is the status?
  • The transfer of generic data from the backtester to the research platform is limited to recorded variables. It would seem beneficial to be able to store and transfer data in a more general way (e.g. compressed or binary file). Is this something that could be added?
  • There is no way to store the results of analysis done in the research platform, making them available to the backtester/live trading platform. Is this feasible?
42 responses

1) The limits to memory are driven by economics; we cannot afford unlimited memory. There is no magic number of dollars or memory size; the limit changes over time with the API, community needs, and our costs. You ask "Is there a plan to relieve the limitations?" That's like asking if we have a plan to end hunger, poverty, and achieve world peace! There will always be a limit of some sort. (Does unlimited memory even exist?) But yes, we plan to increase the power of the platform. Sometimes we increase the amount of memory available to algorithms. More commonly, we enable you to do more with the memory that you already have. Off the top of my head, here are a few things we've done

  • The Pipeline API is one of the features we released to relieve the memory limits. Pipeline enables a huge amount of computation that wouldn't be otherwise possible on the platform.
  • The Quantopian 2 API update earlier this year dramatically improved memory efficiency and increased the tradeable universe sizes.
  • The Q500US/Q1500US are pre-computed, freeing up computation.

You can expect to continue to see improvements in the weeks, months, and years ahead. Platform performance is a high priority and is under active development in several areas.

As you suggest, one possible feature we could build is a more granular, per-person or per-algorithm memory limit. And, we could also presumably implement a payment system to defray the costs of the computation. We don't have any immediate plans to add either of these features, but they might become priorities in the future. As you know, feature development is not free. We have other features that we judge to be higher value, particularly because they benefit everyone in the community. That certainly might change in the future.

2) We have considered and been suggested many different implementations of code management, including a github integration. We don't have any ongoing development in this area. We haven't settled on any particular paradigm or solution, either. I don't have a timeline for when we will work on this area.

3) Transfer from backtester to research is a new request to me. Research has access to your backtests and live trading, including positions, orders, transactions, and recorded variables. It also has access to the same raw data that the backtester does. I can't recall a request for additional data transfer in that direction. What data are you trying to transfer?

4) It definitely is feasible, but we obviously haven't built it yet. One of the challenges would be doing it in a way that protects you from look-ahead bias. The research environment isn't event driven, and if you're not careful you can pass the future into your backtest and start fooling yourself.

Workflow
Your questions, particular 2 and 4, fall into a category we think of as "workflow." Quantopian started as a backtester, and later we added live trading, and then the research environment. They are not as tightly integrated as we'd like. When we watch quants work on the platform, they spend a lot of time in research trying to find some alpha. After they get something they like they move to the IDE for backtesting and live trading. And then they move back to research to use pyfolio, generate tearsheets, and analyze performance. That flow today is far more clunky than I'd like. Concepts like saved code, shared code, version management, etc. are all possible parts of the workflow solution. I don't personally have clear vision yet on how to put that all together. It will be a big project when we finally tackle it.

Priorities
I've said in a few places here "that's not a current priority." Implicit in your post is the question, "what is your current priority?" Things we're working on today:

  • Futures backtesting is in private alpha. Work on futures is ongoing.
  • Data. We're always working on adding more data. We're also doing a project on making the corporate fundamentals queries faster.
  • Test Harness. Before we ship code to production, we put it into an extensive test harness run to make sure we're not affecting any algorithms trading real money. The test harness needs care and feeding, and it's getting an upgrade.
  • Prime broker. We're getting ready to trade external money next year, and we're working on the plumbing necessary to make that happen.
  • We always have some number of people working on bugs, stability improvements, security patches, and other smaller-scale projects.

That list is just for our development team. We have other lists for our education, research, and investment teams, but they aren't as relevant to your questions.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Dan - by "relieve" I meant increase but not to infinity. In some sense, memory is unlimited for the backtester, since there is no limit on the number of backtests that can be running in parallel at any given time by a user, but the memory is spread across those N backtests. This, of course, is not the case for the research platform, where 4 GB is the limit; no "parallel processing" as with the backtester.

More feedback later, but I'm wondering for the ML stuff if being able to read from/write to disk as a backtest is running would help? Disk space is almost free, I'd think, but maybe the latency is a killer, if it is not local, as it is on a desktop workstation.

Transfer from backtester to research is a new request to me. Research has access to your backtests and live trading, including positions, orders, transactions, and recorded variables. It also has access to the same raw data that the backtester does. I can't recall a request for additional data transfer in that direction. What data are you trying to transfer?

One application would be to store intermediate results, for post-processing in the research platform. For example, one of my algos calculates a set of parameters over an expanding trailing window of data. I could write the data to disk, and then pull them up in the research platform to better understand the behavior. Another application would be to capture algo settings. For example, say I were to run 10 backtests, with 10 different sets of settings. Having those settings within the get_backtest data set would be handy. Recorded variables could be used, but it is a kludge and very limited.

Perhaps other users can think of use cases, as well.

Another angle on this might be to make it easy to run a backtest from within the research platform, but with the added feature of retaining context after the backtest ends (which I guess you know how to persist, at least for live trading). If one could simply copy backtest code into the research platform and run it there (without changes), it could be a nice, integrated approach to the problem (versus having to copy-paste the backtest ID and run get_backtest).

Regarding your "Workflow" comments, if you have access, I'd have a look at what quants do and the tools used by them at hedge funds and related financial institutions. One difference might be that you are looking for everything in one person, versus the division of labor that I suspect exists at larger operations (i.e. the workflow is spread across many people/departments). Not everyone needs an expensive workstation on their desk, and one can probably sort out where to deploy the high-end resources based on roles and abilities. Whereas for Quantopian, your present paradigm is that each "manager" researches, writes, and deploys the entire strategy, soup-to-nuts (and then presumably reaps a larger chunk of the benefit, than at a traditional institution).

The limits to memory are driven by economics; we cannot afford unlimited memory.

I guess part of the problem is that your approach is to tightly integrate the backtester with the pretend-/real-money trading platform, which is a good thing in some respects. If my backtest runs, then I can just click a button to deploy, with a high degree of assurance that the algo will run live. And there is no limit to the number of live algos I could launch. So, if a significant fraction of your 100,000 users start launching live algos, you'll have a lot of memory tied up that you'll have to pay for. Since very few of them will make money for your fund, and it is not clear if you are profiting otherwise, the economics may not work out. It seems that part of the calculus here depends critically on how you plan to make money and/or pump up your valuation. For the limited few who have a decent shot at getting allocations, it would seem like a slam dunk to give them all the resources they can consume. But if you provide a turbo-charged platform to the masses, without a way to pay for it, then you won't have a very good story to tell your investors. In the end, you are probably approaching things correctly, but without additional contextual details, it is hard to make a definite conclusion.

  • Having the ML process write to disk is definitely a path we could pursue. I'll make sure it's under consideration as we work on expanding our ML capabilities.
  • The idea of running backtests with different parameters and evaluating them falls squarely in the "workflow" domain. There're a lot of different use cases in there, and I don't personally have a vision on how to put it all together yet. It's quite a product design challenge.
  • Your most recent post can be restated in a way that matches our business proposition. Quantopian has an innovative approach to finding quant talent. Most of the existing talent searches try to guess who has talent, and then hire them. Quantopian is teaching thousands of people to become quants, and the talent proves itself. It means that some of our community members using resources will never actually provide a profitable algorithm; we embrace that, and we embrace them.

Thanks Dan -

Some more feedback:

  • Not sure how you'd categorize it (workflow?), but a lot of the comments I and others have made over the years basically say "Why can't we have something that more closely emulates a desktop computing experience?" It ranges from organizational and navigational stuff like folders, notes, searches, etc. to software engineering best-practices, like revision control, libraries, etc. to flexibility in software tools, such as language/editors/debuggers to wanting more computing resources (e.g. memory, GPU), and so on. This problem is solved by other sites by simply putting the problem out there, with downloadable data (e.g. Numerai, Kaggle). As a general approach, this is a non-starter for Quantopian, since, as I understand, it would go against some of your licensing agreements (but you've cracked the problem in some respects, by offering Zipline). I don't have the answer, but I'm not sure your present platform is long-term workable given the scope of work you are asking the crowd to do (e.g. A Professional Quant Equity Workflow). What if you could virtualize a desktop workstation experience for users?
  • It would be nice to be able to call the backtesting engine from the research platform. I'm imagining something like backtest = run_backtest(algo_id). One use case would be the ability to launch multiple backtests in parallel, and then pull the results into the research platform seamlessly. One could then perform sensitivity analysis (or do a really good job over-fitting!), for example, with the algo picking a different set of random parameters each time it is called (if only a few parameters are studied, the existing record could be used, but it would be better to allow more generic data to flow to the research platform, as we discuss above). To keep things from getting out of hand and breaking the bank (or crashing your system), you'd probably have to put a regulator on the total number of backtests that could be launched/run at any given time.
  • Another angle on supporting high performance computing would be to do it for users en masse, and then supplying the results. For example, users could submit proposals (and perhaps code), you'd evaluate them, run the interesting jobs, and supply the results to everyone (I'm thinking in terms of data sets, such as specialized point-in-time universes that could then be used in factors/algos). It sounds pretty high-touch, but I have to think that there are plenty of candidate egg-headed college interns wandering the streets of Cambridge/Boston who would jump at the opportunity. Or maybe you could convince a professor to give it a go (who would assign it to a graduate student, no doubt), if publications might result.
  • One thought for a new, unique data set would be to use the trading data from your own users. This is a very fuzzy thought at this point, and obviously you would have to stay on the right side of regulations, and not undermine their profitability. There may be some information in the data, though, that could be supplied to users, along with your more conventional data sets.

Just to add my 2 cents, I usually face the platform limitations (memory or timeout errors) when one of my algorithms has all the following characteristics:
- a large universe ( >300 securities)
- a large portfolio ( >200 positions)
- many pipeline factors ( >10 morningstar fields) or few fundamentals and some technical factors with a large window_length

This is unfortunate as an algorithm that satisfies Q hedge fund requirements probably needs all the above, so those limits affect the Q hedge fund in some way.

I'd add to Luca's comments above that the browser-based backtester doesn't scale very well. As extensive backtests are run (or loaded after they are run), the amount of local memory grows, and there can be glitchiness in the browser/pc performance. One approach would be to give users the option of rendering backtest output in a non-interactive image/video format (probably some standard streaming thingy out there that could be used). I have to think, though, that the Q folks are looking to a Jupyter notebook integration of the backtester, versus continuing to polish up the fully custom backtesting interface.

@ Dan -

Regarding your "business proposition" it is not so clear, particularly with regard to the retail side of the your business (the "free" trading platform). You recently secured $25M in additional funding, so I'm guessing this was part of the discussion. On Quantopian Milestones November 2016, Fawce alludes to 1M members, but the idea that your plan is to teach all of them to be quants for the sole purpose of building up the crowd-sourced hedge fund sounds a little too innovative to me (maybe the venture capital world bought it?). Could you elaborate a bit on this topic? It is germane to the intent of this thread, since obviously your business trajectory is the base driver.

Luca - your feedback is on point, thank you. We complete agree that we need to keep advancing the platform to better support algorithms of the type you describe. When you have a problematic example you can share with us, please let us know at [email protected]. We like building a library of test cases that we can use during development.

Grant - Starting with your post yesterday:

  • Your first bullet touches more than one idea, and I'd like to tease them apart some.
    • The desktop computing experience is one design metaphor that we could move toward. One could imagine an online desktop, apps that you click on and interact in, a file system to store and pass things around. It's not a slam-dunk choice, though. It leaves a lot of questions unanswered: how does version control actually work? How broad, or narrow, is the file system support? There also isn't a great track record for websites that have adopted this metaphor. The browser-based interaction is different than a desktop interaction, and neither of them generally feels "right" when you force one on the other. I won't rule it out as an option, but I'd like to find a better one.
    • I think that Kaggle and Numerai are solving a different problem. You note that Quantopian makes far more data available than they do, but that is just the tip of the iceberg. Those sites aren't providing education, or a research environment, or helper functions like pipelines and optimizers, or market execution, etc. I'm sure there are things to learn from what they do, but their implementation isn't a solution for the broad education, platform, and execution that we're offering.
    • I agree that our platform today can't do everything we want it to. We're not done! We have so many improvements and advancements coming that sometimes it feels like we just started. I don't share your fear that it's not "long-term workable." It's a versatile and extensible platform that will continue to support what we hope to do.
  • Running backtests in research is also in the wheelhouse of "workflow." As noted previously, we agree we need to make the tools smoother and tighter integrated.
  • Your idea of having submitted proposals is interesting. It could be used to cut costs. There are two downsides, though, that I see. First is that it would immensely lengthen the feedback cycle to authors. Think how much slower the learning cycle would be if you had to wait days or weeks to see how your idea looked - it would be like going back to punchcards and a mainframe. There's a lot of value in a fast feedback lookp. Second, it makes a new gatekeeper. If someone is researching something innovative, the gatekeeper might not be able to see it. If we were to ever hit a brick wall in our progress on the platform, perhaps we'd have to fall back to something like this. But for now I'd much rather keep improving the platform that we have.

Having read through this thread again, the point that I'd like to emphasize is that the platform is getting better all the time. We have a proven track record of making the platform stronger, and we continue to invest and improve every day. The platform limitations that you find today are different than the ones you found last year, and the limits will be different again next year. I know the limits today are periodically frustrating - to you, to me, to everyone else. But I take comfort and pride in knowing that we knock down limitations frequently - and then we find the next limit.

On the browser-backed backtester, I think the evolution will be towards pyfolio and tearsheets. We want to provide powerful analytics tools that assist authors as much as possible. It's another aspect of the workflow we need to integrate better.

On the business question, the key is to understand the reinforcing cycle that we're building. New quants write new strategies. New strategies mean the creation of additional investing capacity. New investing capacity permits the addition of external capital from investors. New investor capital (and successful strategies) leads to more and higher payments to quants. New and higher payments to quants leads to recruiting more quants. New quants write new strategies, and we start the cycle all over. That cycle is what convinced our venture investors to invest in Quantopian, and is one of the reasons why Steve Cohen wants Quantopian to manage some of his capital.

When you look at it that way, it makes perfect sense that we're aiming 10X our community of authors, and more. There is a lot of capital in the world that needs investment advice. There are a lot of smart people out there who haven't given Quantopian a try yet. Our search for talent, our education of authors, is ongoing. We're going to continue to invest in that, too, and that means we'll be paying for the computation for a few crappy backtests on the way. We will gladly make that investment in the future.

Thanks Dan,

Your responses have been thoughtful and thorough. I'll see what I can add later; it would be interesting to hear from other users, as well.

I don't know if there's anything to be had, but my thought on the high-performance computing (HPC) would be to generate data sets that would be shared across all users. I don't know if it has any merit, but it would be a natural progression of the Q500US/Q1500US universes to offer universes that have been derived from more extensive number crunching. For example, if you see http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html , there is a clustering example. So, one could imagine running something like it point-in-time on the entire Q500/Q1500US. Again, I'm out of my league here, but there might be some generic HPC stuff that would just require grabbing your existing data sets and letting a computer do the work to tease out relationships/categories.

By the way, speaking of punch cards (which I thankfully avoided), I'm finishing The Innovators by Walter Isaacson. Really good book to put your platform into a technological and historical perspective. Thank goodness for the invention of the transistor! Imagine running your platform with vacuum tubes.

Thanks for this post and responses!
Are convenient features like adding folders to the algorithms list (so that I can more easily group my stuff) or adding a clone button to the list of algorithms so that I can fork them a bit more conveniently, even considered at this point?
I would imagine these and similar things are relatively easy to develop and deploy, and although they don't directly enable me to do more stuff, they improve productivity and thus support our work here...

On the browser-backed backtester, I think the evolution will be towards pyfolio and tearsheets. We want to provide powerful analytics tools that assist authors as much as possible. It's another aspect of the workflow we need to integrate better.

One comment here is that I would avoid going to a workflow that runs backtests entirely in the background. It is very helpful to be able to see the equity curve (and some of the statistics, e.g. beta) versus time. I can see how you would enable a backtest as a callable function within the research platform (I guess this can be done kinda/sorta today), but if it runs in the background, it won't provide the kind of immediate feedback that is helpful in a variety of ways. So, you'd need to understand how to adapt the Jupyter/IPython notebook to provide updates as computations are carried out (some sort of lightweight buffered image/video streaming format perhaps?). And of course there would need to be a way to kill a backtest to free up resources (assuming that you don't adopt the "run as many as you'd like in parallel" paradigm as you have now).

Regarding your global million-strong quant army vision, if it is solely geared toward your envisioned $10B hedge fund concept, I'd continue to do some head-scratching on that point (even if your funders/customers are drinking the Kool-Aid...nice job if you actually convinced them, by the way). Assuming you have no other means to profit off of the user base, it is not obvious that the platform scale and support required will justify the profit. Perhaps you have fit to a recent trend and can portend the future, but being able to harness a million users, asking them all to plug into A Professional Quant Equity Workflow seems a bit too hopeful. That said, there may be innovative ways that a large crowd can participate and can be compensated for their micro-contributions, versus the idea that a large fraction of them will be signing contracts as "managers" for the fund. Trying to apply what works within a traditional hedge fund may not scale to the masses. The innovative thing would be to sort out what would scale, that nobody has ever tried. That's not to say you won't get somewhere with the current approach, but maybe there are parallel approaches that could be considered, with which you could achieve 100% crowd participation.

I think that Kaggle and Numerai are solving a different problem. You note that Quantopian makes far more data available than they do, but that is just the tip of the iceberg. Those sites aren't providing education, or a research environment, or helper functions like pipelines and optimizers, or market execution, etc. I'm sure there are things to learn from what they do, but their implementation isn't a solution for the broad education, platform, and execution that we're offering.

Not sure what the answer might be, but you could have your cake and eat it too if you could de-couple, in same fashion, access to the data and the platform on which it can be processed (which could still include the option of using yours). Apparently, Numerai takes advantage of progress in encryption technology to be able to share data that otherwise would be protected under copyright/licensing. Again, out of my intellectual league, but the claim is "you can run machine learning algorithms on encrypted data." (see https://medium.com/numerai/encrypted-data-for-efficient-markets-fffbe9743ba8#.xj4uljmop ). So, you could imagine a Quantopian API that would allow access to such data, along with a way to plug the results back into your platform. Unless the Numerai folks have some secret sauce or have everything locked up in patents, it might be something to consider for the future.

On the browser-backed backtester, I think the evolution will be towards pyfolio and tearsheets.

On this point, pyfolio/tearsheets are great, and a lot of information can be captured, but I find it challenging to answer the question "Am I done? Would this have any shot at all in getting an allocation, supposing I haven't 'over-fit' (committed one or more quant sins) and I can show N months of consistent out-of-sample performance?" You could say, "Enter the contest and see how you rank!" but there is strong evidence that over-fitting is pervasive (see paper), and that the final out-of-sample ranking may not be strongly indicative of the applicability of a given algo to your fund.

What's missing, I think, is more specific, immediate, summary feedback, relative to what you are looking to add to the fund in the next 6-12 months. There is a lot of information, but nothing that says "Dude, this looks reasonable. Paper trade it for 6 months, and if the trend continues, you have a decent shot at an allocation."

Loading notebook preview...
Notebook previews are currently unavailable.

Hi Dan -

Not sure if you saw it, but Luca published a way of running pipeline in chunks:

https://www.quantopian.com/posts/run-pipeline-in-chunks-and-2-bugs

I haven't tried it yet, but it seems pretty useful (my understanding is that it overcomes some limitations of alphalens, for example).

What happened on your end? Scott S. took a look at the idea, but will it be re-visited? My narrow interest is that I'm thinking of finally using alphalens, but my understanding is that one can't do extensive work (larger portfolios, long timeframes) unless Luca's work-around is used. Could a supported version of the chunking be implemented?

Another thought for the idea hopper would be to add a daily VWAP field to pipeline, computed from your minute bar database. For example, see:

https://www.quantopian.com/posts/vwap-are-there-any-plans-to-fix-this

Note that this is not a request to run pipeline on minute bars, but rather simply to include a new point-in-time data set, which you (or your data vendor) can compute from existing minutely price and volume data. The idea is to have a more representative price for the day than the close.

My impression is that the equity long-short algos that you are requesting from the crowd are kinda slow-moving large portfolio asset allocation type strategies, and so intuitively, it makes sense to beat down the noise with something like a daily VWAP, versus using a single set of OHLC trades per day.

Hey guys, wishing you the best for 2017!

As a I am pretty new here, I was wondering if there has ever been a talk about creating some sort of chatroom / live discussions to encourage collaboration amongst the members?

@ Nicolas -

Yeah, that's a tricky one, since my sense is that very few, if any, serious users (trading their own money, or trying to get an allocation, or whatever) will collaborate, beyond perhaps sharing "tips and tricks". Also, there are already full-featured sites like Slack that are geared toward this sort of thing. It is probably a disadvantage of the crowd-sourced hedge fund concept, though, since a lot of innovation comes from collaboration, which I expect occurs naturally when you have traditional hedge fund R&D teams all under the same roof, within earshot of one another. Quantopian is kinda relying to the lone genius (or someone who picked up ideas from work/consulting, in collaboration with others, and is free from any NDAs/conflicts of interest, etc.).

Hi Dan,

Another potential improvement (or perhaps just a matter of clarification/documentation) would be to lay out the process for getting a given algo evaluated for the Q fund and the feedback one might expect. Perhaps you aren't there yet, but if you are aiming for 1M members and $10B in capital (multiplied times 6X leverage, so $60B total?), you'll need something automated, I would think. Even at 100K members, you could get overwhelmed. Personally, I have a set of algos that have been running on Quantopian paper trading for over 6 months. If I look on https://www.quantopian.com/fund , it clearly says:

We evaluate your algorithms, and selected authors receive an offer to license their algorithm.

But how does one know if a given algo has been evaluated? And the outcome of the evaluation? Is it simply a matter of sending in the links to the live-trading algos to Quantopian support, and you'll take it from there?

Hi Dan -

Could you comment on the idea of adding daily VWAP to pipeline (see above and other comments on the forum)? It would seem very straightforward and beneficial. Are there specific technical impediments? Or is it just a matter of resources and priorities at this point?

I think the idea would be to add V*(H+L+C)/3 on a daily basis, where the HLCV values would be from the minute bar database. At first-glance, the required software "plumbing" would appear trivial.

Thanks,

Grant

On VWAP - there are no technical impediments, it's a priority thing. We definitely think it's a good idea, and we want to add it. Digging a layer deeper, our research and interviews are clear that we don't want to do a shortcut calculation like V*(H+L+C)/3, either on a daily bar or even on minutely bars. The results you get using those methods are too often significantly different from a true VWAP. The good news is that our data source, NxCore, gives us every trade. That means we can calculate a VWAP for every minute, and those minutely VWAPs can be aggregated into a daily bar. We will add the feature at some point. It isn't currently being worked on, but it's moderately high on our list.

Looking through some of the other comments I haven't gotten to:

You make a point about making sure that you can see backtests in progress in some way. I think that is one of the considerations of what will make a good workflow that integrates the research activities with the code authorship of the IDE with the backtest analysis of pyfolio. It's quite a design challenge.

On the question of the million-quant army, you describe our thesis as "a bit too hopeful." I'm quite confident that we can execute on it. Time will tell, of course.

On the question of "am I done?", I agree, we need to do a better job at giving guidance about algo performance. As I mentioned in a previous conversation we had, it's actually a very difficult thing for us to convey. What we're looking for is a range of things. An algorithm might be outside the norm on criteria X, but be amazing at Z and Y, and that is interesting to us. There are some criteria we can't bend on, but other than that, we need to keep the variety flowing. We need uncorrelated returns; if we over-specify what we want, the correlation will become prohibitively high. I don't have an automated way yet to give you this information.

There is a little-used Slack channel created by some community members. It's not owned by Quantopian. We haven't invested a lot in chat, but maybe we should: quantopianusers.slack.com

Hello Dan -

Thanks for the feedback regarding daily VWAP. I knew about your use of NxCore for live trading: as I understand, you take in trades from NxCore, and output minute bars real-time. I was not aware that you have the NxCore trades back to 2002, correct? If so, I see that you could do a true VWAP calculation using those data, and update it real-time by modifying your "injestor" code (the "real-time" code that takes in trades and spits out OHLCV minute bars). For intraday trading, I can understand that you'd want to do the VWAP computation in this fashion. However, for pipeline, which operates on trailing daily data, it would seem that the sum(V*(H+L+C)/3) (over minute bars) would suffice to provide a decent representation of the price for the day, and would be easy to do (with no modification to your real-time injestor). It just seems like a much more tractable project than computing a true VWAP, supplied as a minutely feed. And naively, I would think that for the kind of pipeline-based long-short workflow you are promoting, from a practical standpoint, the VWAP estimated from minute bars would be equivalent to a true VWAP computed from individual trades.

The issue of the requirements for an allocation and providing feedback to users is important, I think. The broad requirements are fine, I suppose, and probably pretty vanilla for the industry. The workflow makes sense. As you've explained, all algos with backtests in your system are evaluated, and then passed on to your R&D team, if certain criteria are met. Then, they are looked at more closely, including their correlation with other algos you are considering. However, in all of this, no feedback flows back to the author, unless more information is requested from the author to incorporate into your decision-making process (which was standard practice awhile back), which would give the author a clue that his algo may have some merit. From a user perspective, the process is a total black hole. Basically, all one can do is wait 6 months for an e-mail from Quantopian, and if one doesn't come, assume that the algo was passed over--there is no feedback.

While I appreciate that you need to cobble together a fund that is more than the sum of its parts (by selecting uncorrelated return streams), from an author perspective, the "Low Correlation to Peers" requirement is difficult, if impossible, to assess. Perhaps for seasoned industry insiders, it is known where to look (and not to look) for sources of untapped, uncorrelated alpha, but for novices, it is kinda ill-defined and daunting. This requirement is coupled with the "Strategic Intent" one which suggests that an author will be expected to explain his strategy, providing a kind of theoretical basis (assuming that an scalable, uncorrelated strategy was found, that meets the risk management criteria, as well). You are asking for a lot, and without a process for definite, specific feedback to users, it feels like more of a game of chance than an R&D effort.

The other issue is that the requirements have been changing, pretty much constantly, since the fund concept was introduced in fall 2014 (I see that you've fleshed them out a bit recently). Presumably, if one writes an algo today, it will be judged on the requirements 6 months from now, and not on the requirements when it was written. Presumably the requirements are stabilizing, but history suggests that they could be a bit of a moving target.

I have to think that for quants working within a traditional hedge fund, there is some sort of feedback within the R&D cycle. One would not just put up a bunch of requirements, provide some tools, and not provide feedback to the R&D team.

Perhaps my sentiment on the fund requirements and feedback is unique. It would be interesting to hear other constructive perspectives.

Hi Dan,

If you do have the complete historical tick data from NxCore, then another thought is that you would have an API that would allow users to derive their own custom daily signals, for pipeline (which could include VWAP). All of the processing would be done outside of the trading day, and pipeline wouldn't require any substantive changes, since the signals would be daily.

Digging a layer deeper, our research and interviews are clear that we don't want to do a shortcut calculation like V*(H+L+C)/3, either on a daily bar or even on minutely bars. The results you get using those methods are too often significantly different from a true VWAP.

It would be interesting to understand better why it is considered a no-no to use the shortcut calculation? Was this only in the context of intraday trading (where I can see it might make a difference)?

I'd like to highlight what appears to be a significant limitation to the fetcher function for a a seemingly simple use case.

I have a custom universe of approximately 450-550 symbols. I ran my calculations outside of the Q platform, so what I have is a CSV file with the index as dates and the columns as c0, c1, c2...cN. Each row represents a portfolio of securities to be traded that day.

I figured it should be straight forward to test the performance of the strategy using fetcher. However this is not the case. Here is what I have tried thus far:

  1. Hard coding the symbol universe in Initialize using the symbols() function. This does not work because of the max 255 symbol limit imposed by Quantopian.
  2. Building a list of raw ticker symbols retrieved from my CSV using fetcher, and then dynamically creating a symbol() instance using the tickers. This does not work because both symbols() and symbol() require string literals as arguments.
  3. Same as number 2 except I tried both *list_obj and *tuple_obj unpacking into the symbols() function and that failed too.

Maybe I'm missing a simple workaround but I'm kind of shocked that this use case seems so convoluted/constrained. Anyone who has solved this issue please advise. Thank you.

Grant - On VWAP, the advice we've received is that estimates like sum(V*(H+L+C)/3) (over minute bars) do not suffice. My experiments with the data bear this out. I find it helpful to think about this in physical terms. VWAP is about the "center of gravity." High/Low/Close are dimensions on the object. The object is not even close to homogeneous - it has clusters and masses throughout the space. The middle of the dimensions is generally quite different from the center of mass.

I will agree with you, yet again, that providing feedback to authors about their algorithms' performance relative to allocation selection is something that we need to be better at.

Brian - thanks for the feedback. Fetcher does have some real limitations. I think Ernesto helped you out with a workaround in the support system.

Dan - Ernesto is trying to find a workaround but so far we have not identified a solution.

To reiterate the problem; After I have imported my list of symbols from fetcher, how do I create orders using the imported list of symbols when both the symbol() and symbols() methods require string literals?

Hard coding the symbols is not an option because my ETF universe is ~450 symbols when the limit is ~250. Another possibility is that there may be some fundamental ETF flag that I could use to filter the default Quantopian universe but I didn't find anything searching the community forums.

Hi Dan,

Regarding the VWAP question, I'm just not getting it, yet. Say I want the average price for the day. Well, I would just grab 390 samples, acquired at 1 Sa/minute, and average them (or I could use 2X390 points, by using OC, or 4x390 points, by using OHLC). Now, say I wanted to incorporate volume. I don't actually have the volume for each of those price samples, but I do have the smoothed volume over the day, since I have the sum of the volume every minute. Doing a volume-weighted average, using the smoothed volume, intuitively, should work reasonably well, if I'm simply trying to assign a representative price for the day. But maybe I'm thinking in terms of some ideal case; perhaps for thinly traded stocks, or whacky/skewy/volatile situations, it falls apart.

Regarding your comment "we don't want to do a shortcut calculation like V*(H+L+C)/3, either on a daily bar or even on minutely bars" I would point out that you already do it with the AverageDollarVolume pipeline factor on daily bars (using daily closing prices, presumably). I guess you are saying that it is a flawed approach, since the volume is the sum over the whole day, but you are using the closing price? But you still released it, implying it is useful? Why would it be o.k. in the case of daily bars, but not minute bars? I'm confused.

I guess you are saying that since you have, or can get, the tick data, your long-term plan is to do the ideal VWAP computation (perhaps both on daily, and real-time minutely time scales), rather than an intermediate kludge. I suppose that makes sense.

I also have come across a limitation and would appreciate any comments/suggestions:

In a backtest its possible to have an order_targetpercent =1.0 when switching between two securities (market orders) at 8:31 am, but this is not possible when live trading (IB). From my discussion with IB, there are any number of reasons why a order_targetpercent < 1.0 is required for market orders. IB have told me that the credit system works to add an additional % requirement for market orders and the actual value also depends on the liquidity, movement and bid/ask spread of the two securities that are being switched.

So although an order_targetpercent of 1.0 works in the back test, this is not the case for live trading. The problem is IB cant tell you what ratio to practically use. Your order will get rejected at 1.0, so what ratio do you use>? 0.90 or 0.60? This choice can dramatically impact LT returns. It comes down to a process of trial and error - but this becomes a problem when your algo trades infrequently. So it would be nice if these limitations were actually highlighted in the backtest. It would be best if you could only backtest what was practically possible in live trading?

Of course, I am only a beginner - so if there is some code which actually allows you in practice to switch securities at 1.0 using market orders at 8:31 without having these problems, please let me know! :-)

Stephanie -

Do you have a margin account at IB?

Stephanie,

I have found that order_target_percent should not be used in any sort of live trading because it doesn't respect leverage and cash via sell order -> filled/cleared -> buy order -> filled/cleared. I would advise using only the order() function if you're planning on live trading. You will need to execute sells, wait for no open orders, then calculate how many shares you can buy of your new stock using availible cash, then execute the order.

Hi Luke. That sounds very helpful! Do you have any sample code that would represent a 100% switch from one security to another? Tx

Hi Dan,

Memory management is very tricky within the backtester. I'm wondering if, upon termination of a backtest (either error or full completion), a report could be output, showing the memory usage, at some helpful level of detail? Something like:

set_system_report()  

I understand you can't whitelist the usual tools to sort out things like memory problems, but perhaps you could support just spitting it all out at the end?

Unfortunately, it's not that easy. Memory usage is highly variable over the duration of a backtest as some data data is loaded and released (like prices) while other data is kept more persistently (like orders and transactions).

I restate that we will continue to increase the power of the platform. That takes many forms. Bigger machine limits, more efficient code, improved APIs, and memory management tools are all avenues we'll exploit. I refer you to my previous comments on this theme, higher in the thread.

Thanks Dan...just a thought I personally hadn't had before that you could allow requests for system info, to be reported at the end (or even as an algo runs), but still block direct access by users to keep things under wraps from a security standpoint.

Hi Dan -

Rob's post on short-term mean reversion reminded me of a topic related to our daily VWAP discussion above. Rob suggests that relatively short-term mean reversion strategies may be valuable to Quantopian. I'm wondering, though, if using daily OHLCV bars is optimal, or if some form of smoothed data might be better, given the short-term nature of the strategy. To this end, would it make sense for Quantopian to create a data set and offer it on https://www.quantopian.com/data (perhaps initially on an experimental basis) derived from your minutely OHLCV bar database, but compatible with the daily pre-market-open updates provided by pipeline?

As a crude specific example:

    prices = data.history(context.stocks, 'price', 390*context.N, '1m').dropna(axis=1)  
    context.stocks = list(prices.columns.values)  
    prices = prices.ewm(com=390).mean().as_matrix(context.stocks)  

Then, prices would be down-sampled to a single value every day.

The general concept here is that it should be a matter of data crunching and data-basing to offer various indicators derived from your minute bar database on https://www.quantopian.com/data that would be compatible with pipeline (e.g. daily values). Users might even be willing to pay a premium for the data (even though it would simply be derived from your free minute bar data). The advantage would be compatibility with pipeline, and a lessening of the computational load and complication of doing such indicator computations within the algo.

Hi Dan -

I've seen the request for Tensorflow numerous times on the forum, and it sounds like it is somewhere on the to-do list. It would be interesting to hear from your engineering team what it would take to get something like this implemented in a useful fashion on Quantopian. What chunks of work/changes would be necessary, just to get a feel for the scope/feasibility? And how would it fit with your existing ML efforts (perhaps an apples-to-oranges thing--I'm not so familiar with ML)?

Perhaps not feasible on the current Quantopian platform, but one could imagine allowing users to set a priority on computations. For example, a low-priority backtest would be allocated more computing resources, but be run at a time of lower load on the system, whereas a high-priority backtest would get the minimum set of computing resources, but run on-demand. This could be coupled with an e-mail (or other) notification system. But maybe this kind of model doesn't work out well with the cloud-based little sandboxed virtual thingys you are presumably using. The mechanics and economics don't work out. You don't own a server farm in the basement and have to manage the utilization, and in a world of globalized cloud computing with lots of demand, there's no problem keeping the systems busy making money.

Another thought is that you seem to be consuming precious RAM with things like storing transactions. In a desktop computing environment, it would be a matter of writing them to disk, but maybe they are needed by the algo? Or to display the running backtest stats? Or in your computing environment, it would be problematic to write to disk as a backtest runs?

I'm just curious here, mostly, and understand that it may not be a priority to discuss this stuff and/or there is too much overlap with proprietary details you don't want to share.

If Quantopian can limit the number of parallel/concurrent backtests per user, it will greatly solve the memory problem.

Pravin -

I'm not sure I understand, unless you are thinking that there is a memory pool, and that the amount of memory available per backest is something like (total memory pool)/(number of backtests currently running). I've gotten the sense that the amount of memory per backtest is fixed, but perhaps not.

EDIT - I guess the idea would be that if Quantopian limited the running of backtests to N at a given time per user, then potentially money could be saved to bump up the memory per backtest. But my hunch is that most users aren't running parallel backtests.

Just interjecting a way to overcome the memory limitation a lot of people voiced, if you have a threshold something needs to hit in some large window thing, you could precompute it at home (I'm working on building an rpi farm for precomputation on loads of things) then feed it a CSV with empty sets to drop it down some or the same fixed low int, since python references these instead of allocates for ints under 255 and some other things, reusing that data instead of a new float should drop your memory usage HARD. If you're computing ON quantopian, you could go back through your pandas frame and replace uninteresting values with low ints and accomplish the same. A lot more legwork, but it would be a memory optimization to consider.

EDIT - As quantopian goes through growing pains, I suspect we'll need to see running more than X live algorithms paper or otherwise be a paid service, but running a single algo or <X on paper or whatever at a time still being free to farm talent. There's no reason anyone should want to live paper trade 15 algos at once, or live trade 15 algos at once on real money and not be able to afford paying some fee per algo after the first X algos.

Reading the whole thread now, and doing one big response to how I would solve each individual issue personally hoping to talk to people out of band.

If I'm overly verbose, sorry. Just providing the feedback I think the thread needs. @Dan specifically please skip to the end if you think some of this is a non-event.

@Eyal Netanel - problem - organizing code - wants - folders - possible solution - adding the codefolds extension to notebooks
@Luca - problem - memory usage - wants - presumably more memory - possible solution - replace data with a type that references instead of allocates or use null sets/empty sets so there is no memory consumption for a field
@Nicolas - problem - area to discuss - wants - same - possible solution - the forum here tries to be a discussion platform, but it's not chat. I'm with you on that one. The reddit flopped, but I post there sometimes anyway. Talking to non-quant pennies traders on discord is less than helpful, but 2 out of 30 are pretty sharp, so playing the numbers. Hopefully we find a unified platform for that without people just screaming all day "give me your alpha" A valid problem I want to fix, but I don't have a real solution for...

@Grant - many:
not addressing memory at all since I sort of addressed it as did the Q people
problem - workflow - wants - more seamless integration of the algo and research pane - possible solution - they (Quantopian people) want it too. it's a hard problem. working towards a code pattern that solves this.
problem - ML in Quantopian - wants - more memory/power in quantopian, but it's not economically viable - possible solution - Zipline is super nice and free, but I personally don't think ML hosted on quantopian is the way to solve it. If you could have fetches only on some accounts (approval process) to feed an AI and get a response... it could be a solution. See: fetch CSV only available on init, some folks would/could build an AI at home, then give responses on a per diem and sidestep the memory limitations.
problem - browser based back tester and kneejerk feedback - wants - same - possible solution - would feedback for tapout and stop the test be enough? what if the current backtesting thing said "if we hit X% drawdown the test stops" but it could be parameterized to say "these are failure conditions. just stop."
oddball comment - backtest priorities shouldn't be necessary. that's making quantopian responsible for something you should handle yourself - possibly relevant to @Dan

@Brian Christopher - I agree. That came up recently, and the problems with solving it are non-trivial. I'm sorry. I don't see a solution yet.
@Luke Izlar @Stephanie P. - the order API is sort of broken. It's workable though. I don't see a reason to not be able to say .order_with_relative() and maintain current portfolio weights at lower speed. The only thing I can suspect is they're trying to make it faster in order to get execution and not error check at all. If you want some more in depth portfolio balancing code AND you wouldn't mind using it later - hit me via email. I started on it the other day...

@Dan - I really respect that stakeholders in Quantopian are so involved. I also like that you looked directly at Luca's points about long window memory problems. I know nothing about the Quantopian infrastructure, but is there a possibility for shared memory that stores common precomputes, even if it's only on the Q500 or Q1500? That being the "target set", you'ld think that common precomputes wouldn't count against memory usage. So, maybe a way to share that between users? Definitely a complex technical problem though. VERY complex. An example of something I spent HOURS writing myself at home was throwing the whole of the market at every default talib setting. If they're the defaults, people probably use them. So, could these me used as fixed/shared referential memory at least for the QXX00? Just a thought.