A Hacky Guide to Hive (part 2.2.1: blocks)
10 comments
Context
In the previous post, I made a special transaction.
I broadcasted a custom_json
transaction of the type
: YO
.
This information will forever be stored in block 89040473 of Hive's blockchain.
To get to this information again, I could querry a Hive node's:
- block_api.get_block, by blocknumber
- transaction_status_api.find_transaction, by transaction ID
If I don't know those 2 parameters, but want to find my move, I could use:
- account_history_api.get_account_history, by account name...
...you can access blockchain data many different ways, use the above enpoints with Beem or lighthive...
I demonstrated, how anyone can YO, now I want to show a method, to get to all YOs.
It could be any custom_json. Or a different event. It's just an example. It could be a move in a blockchain game, or you could go as far as trying to build your own little hive engine.
You might want to observe votes or comments as they come in, and store some, so you don't have to look them up again later, maybe for a notification system...
A Better Stream
In another post I explained, how the Hive blockchain is really just a very long list.
block_api
The block_api gives you access to all blocks.
You can access the block_api on all public nodes.
If you want to use your own node, having only the block_api should be one of the cheapest options.
stream()
Basically you could build most things around just looking at all blocks as they are written.
That will not include all information for everything (virtual values and such), but a lot.
This might not be the best approach to build everything, but once you've got a stable block stream going, you can build good stuff around it...
Beem
Beem's stream() method still works and you could use it as is.
The main logic behind Beem's stream is hidden in the blocks() mehod. That part alone is 278 lines long and does a lot of things.
In the background, Beem can handle:
- node switching
- threading
- syncing
- private keys
... and more.
I could not build it better. I don't have to.
Procedure
The main procedure to get to a block is still just a querry.
The speed and reliability of that querry depends mostly on the source (the node), not on the Python code.
Python isn't particularily fast to begin with.
But all we need it to do during this procedure:
- Querry next block
- Filter the block for YO
- Store YO
That's a job done.
At the moment, querrying the latest block from api.hive.blog takes about 1 second.
Maximum block size is a witness parameter:
The value must not be more than 2MB (2097152).
...so there are 2 seconds left to handle 2MB at most. (current max: 65536 Bytes)
To just filter and store a block takes only miliseconds even in Python...
Which means, this thing can idle for almost 2 seconds and repeat the procedure.
Beem actually does that too 😅:
# Sleep for one block
time.sleep(self.block_interval)
Storage
It doesn't really matter how I build the stream; without storage, I'll lose all progress when the stream ends or crashes.
I'll use SQL. I could use Redis, or Mongo...
There are many different storage solutions and I could never build anything better.
This stuff handles sessions and serialization. It comes with built in backup solutions.
It's fast. It's scalable: I'll use SQLite, but you could plugin in a giant cluster of whatever.
I am trying to move the responsibility of storage handling where it belongs: the database level.
threading and node switching
Beem can switch through nodes from a list and even manage worker threads.
But why manage that inside Python in the first place?
I will just build one single procedure and can run it as a background service.
If I need another thread, I can just run another instance of the same procedure.
I could run one thread for every node, or even use separate machines.
Anyhow, the procedure does not need to know which thread it's in.
As long as I funnel the data to the same database in the end, all synchronization and serialization and whatnot is taken care of automatically.
I am trying to move the responsibility of concurrency where it belongs: the operating system- and database layer.
Live Stream
block_api.get_block_range
import requests
def get_block_range(start, count, url):
data = '{"jsonrpc":"2.0", "method":"block_api.get_block_range","params":{"starting_block_num":'+str(start)+',"count": '+str(count)+'},"id":1}'
response = requests.post(url=url, data=data)
return response.json()['result']['blocks']
The only function you really need.
I am not even joking.
- Usage:
url = 'https://api.hive.blog'
for block in get_block_range(89040473, 1, url):
print(block)
Loop
For a stream you only need to loop this; you need a start block and then increment.
Repeat every 3 seconds and it's basically Beem's stream(), without all the fluff.
But that's an infinite loop.
For the final service, that's what I'd want; For a code snippet, I feel like avoiding it.
In the early days, nodes accepted websockets. I don't know, why that got turned off. Maybe it was too expensive. Maybe you can still do something like that on your own node.
Anyways, if you test this on the public nodes you are stuck with this 3-second-querry loop. It seems crude, but it seems as that's how it's done.
The documentation recommends Beem's stream.
time.sleep(self.get_approx_sleep_until_block(throttle, config, status['time']))
So yeah... I also wait 3 seconds.
Interrupt
Best case would be, I start the loop once and it runs infinitly (fire&forget).
In reality I have to prepare for what happens should it stop.
Maybe I need to resync the whole service...
The above is all it takes to rebuild Beem's stream or any other.
Wrap some try excepts around it and it can't really break down.
But for something useful, storage is necessary.
So that I at least know, where the last tream stopped. And where to begin...
For YO, I could ignore all 89040473 blocks before the first YO.
Traffic
That 3-second-querry thing may seem like a lot of traffic.
But if it's planned well, and stored well, it only has to be done once for any block.
From that point on, it can feed a whole network of other things, which don't have to make any queries outside of my own database.
Again: For things like posts and author balance, the standard apis can be enough.
Also: Posts, votes, account balance, can change, blocks can't.
Sending one request every 3 seconds, receiving 60KB max data...
I don't know, how annyoing this is for node providers.
I guess it's ok...
Syncing might be different. In the docs, there's a get_block_range example with count=1000.
The response could be 60MB. But that could also sync 50 minutes in a single call...
Filter YO
def get_yos(block):
yos = []
for transaction in block['transactions']:
for operation in transaction['operations']:
if operation['type'] == 'custom_json_operation':
if operation['value']['id'] == 'YO':
yos.append(operation)
return yos
Returns all YOs in a block, but loses the information, which transaction and which block each YO was in.
I'll try to avoid data manipulation in this part of the service.
This part is the stream and shouldn't be involved in anything else.
However, I do want to store the block num, which already got lost along the way.
I also want block id and previous. This just demonstrates how to filter data.
It's best to start by building the tables first, though.
Conclusion
It might not look like much, but the part that needs to connect to a Hive node is done.
This is the absolute minimum necessary and can only fail at very few points so far.
Most possible problems can be caught outside of this core logic.
All that's missing is persistent storage, which I will conclude next post.
Anyways, threading, concurrency, data manipulation, whatever... everything else can and should happen later, upstream.
What I keep trying to point out: All extra logic should be avoided.
I am looking at a Hive querry as a single step - a procedure. It should be a single function.
Next post, storage will be wrapped in as few procedures as possible and that will conclude in a YO crawler/watcher that feeds a db, that you could plug anything into. It will probably be short and include only minimal logic. That's a feature.
Naming
I think, the hardest question in programming is naming.
'YO crawler' isn't good. I should give this thing a name, before it's finished.
custom_jacksn, or custom_YOson maybe? Or YOmind...
Comments