Python pandas
Table of Contents
Python and panads overview
This is more an investigation on how to use pandas
I need to start somewhere. :)
First of all, the most essential, what is pandas?
- Pandas is a python library used for working with data sets.
In essence , you can read, analyze , manipulate, clean and explore data sets. So now we have full control right? Well, maybe not, but at least we know what to expect.
Data source
First things first, we need something to work with, some data. So lets dive into open-meteo.com , which seems quite cool tool (API) to get weather data.
Ok, im going to use a Long: 57.70761164715305 Lat: 11.952310770887784 which is in sweden, Gothenburg.
Here is a link to available data, there is a huge amount of different variables, I want go through them all, but some data which is understandable would be nice.
There is a curl command on the website that can be used to get some data, let bring in some data so we can inspect it:
curl "https://api.open-meteo.com/v1/forecast?\ latitude=57.707&\ longitude=11.95&\ hourly=temperature_2m,\ apparent_temperature,\ rain,\ showers,\ wind_speed_10m,\ wind_direction_180m&\ wind_speed_unit=ms&\ &forecast_days=1"| jq
{ "latitude": 57.70479, "longitude": 11.956985, "generationtime_ms": 0.10097026824951172, "utc_offset_seconds": 0, "timezone": "GMT", "timezone_abbreviation": "GMT", "elevation": 2.0, "hourly_units": { "time": "iso8601", "temperature_2m": "°C", "apparent_temperature": "°C", "rain": "mm", "showers": "mm", "wind_speed_10m": "m/s", "wind_direction_180m": "°" }, "hourly": { "time": [ "2024-10-05T00:00", .... "2024-10-05T23:00" ], "temperature_2m": [ 12.0, ... 11.4 ], "apparent_temperature": [ 9.1, ... 9.7 ], "rain": [ 0.00, ... 0.00 ], "showers": [ 0.00, ... 0.00 ], "wind_speed_10m": [ 4.10, ... 2.60 ], "wind_direction_180m": [ 245, ... 205 ] } }
Nice! we have some data for a weather forecast the coming 24h, but maybe we want to do it with python to start with, and then use pandas to extract what we want.
it seems as openmeteo
already have a python package, but thats
somewhat cheating (actually not, but for this exporation its better to
use requests module for python to make it more clear what happens)
When we retrieved the data, we can construct Pandas Series and Dataframe with the values, but lets not get ahead of our selfs. Lets first explore some about pandas
Basic Structure
Let define the basic structure of pandas
. There are two types of
data structure that is involved. These are the Series and
DataFrame. The main difference is that a Series is used as a
one-dimensional labled array and a Dataframe is a 2-dimensional
labled data. To be fair, DataFrame consists of different columns of
Series, but lets dig into each every in the following section.
Series
One can see a Series as an array, or maybe a column of database (with an index, (row name)). This array need to have the same type for all fields for example int,float,string,object…. A Series is a fundamental data structure in pandas. How is a Series constructed? Series can be made from lists, dictionaries, iteratables or scalar values.
lets show some code; first we create a table with just one columns from a list of numbers.
import pandas #Table from a list list_of_vals = [1,200,300,400] series = pandas.Series(list_of_vals) # Table with index series_vals = [500,600,700,800] series_index = ['five','six','seven','eight'] series_with_label = pandas.Series(data=series_vals,index=series_index) print(f"{series}\n----------\n{series_with_label}")
0 1 1 200 2 300 3 400 dtype: int64 ---------- five 500 six 600 seven 700 eight 800 dtype: int64
The output above shows the row is either named by the index, or default to a iteratable number 0,1,2,3… Obivously, this is also a match for the dictionaries. Where the Key becomes the row name.
import pandas dicker = {'a': 2.1,'b': 3.2, 'c': 7.7} serie =pandas.Series(dicker) print(serie)
a 2.1 b 3.2 c 7.7 dtype: float64
Another thing to note is that if data is a dictionary the order is maintained.
I guess we could also use a generator function. A generator function
is in essence a iterator, which returns something for each time it is
called. It keeps the state for each call (that is the num
is
increased for every function call).
import pandas def calc_num_times(n: int) -> int: num = 0 while num < n: yield num *2 num += 1 series = pandas.Series(calc_num_times(8)) print(series)
0 0 1 2 2 4 3 6 4 8 5 10 6 12 7 14 dtype: int64
We could even use this as storing data object. But now we need to be careful, since pandas series stores the reference. That means that any changes to the original list, will be reflected in the panda series.
The following code explain the problem.
1: import pandas 2: 3: 4: class Person: 5: def __init__(self, name, age, weight): 6: self.name = name 7: self.age = age 8: self.weight = weight 9: 10: def __str__(self): 11: return f"Person(name={self.name}, age={self.age}, weight={self.weight})" 12: 13: def __eq__(self, other): 14: return (self.name, self.age, self.weight) == (other.name, other.age, other.weight) 15: 16: 17: 18: 19: persons = [ 20: Person("Alice", 30, 60), 21: Person("Bob", 25, 70), 22: Person("Charlie", 35, 80), 23: Person("David", 40, 75), 24: Person("Eve", 28, 65) 25: ] 26: 27: index = [person.name for person in persons] 28: series = pandas.Series(data=persons, index=index) 29: print(f"{series}\n----------") 30: 31: series['Bob'].age=4 #Change age of Bob! 32: 33: print(f"{series}\n----------") 34: [print(f"{person}") for person in persons]
Alice Person(name=Alice, age=30, weight=60) Bob Person(name=Bob, age=25, weight=70) Charlie Person(name=Charlie, age=35, weight=80) David Person(name=David, age=40, weight=75) Eve Person(name=Eve, age=28, weight=65) dtype: object ---------- Alice Person(name=Alice, age=30, weight=60) Bob Person(name=Bob, age=4, weight=70) Charlie Person(name=Charlie, age=35, weight=80) David Person(name=David, age=40, weight=75) Eve Person(name=Eve, age=28, weight=65) dtype: object ---------- Person(name=Alice, age=30, weight=60) Person(name=Bob, age=4, weight=70) Person(name=Charlie, age=35, weight=80) Person(name=David, age=40, weight=75) Person(name=Eve, age=28, weight=65)
- Line 29 prints out the list where
Bob.age=25
- Line 31 Changes Bob in the Series , now
Bob.age=4
- Line 33 prints out the series (bob age is now 4, as expected)
- Line 34 prints out the initial list with persons. Bob has been changed here too, which means any changes to the Series will be reflected in the Person list too
What I wanted to highlight is the fact that when I change the age of Bob in series it also reflects the Bob in list persons. This is due to that pandas.Series only stores references.
One way to deal with this is to copy all the values from person to the series; bascially copying each of the items and create a new series before we insert it.
1: import pandas 2: import copy 3: 4: class Person: 5: def __init__(self, name, age, weight): 6: self.name = name 7: self.age = age 8: self.weight = weight 9: 10: def __str__(self): 11: return f"Person(name={self.name}, age={self.age}, weight={self.weight})" 12: 13: def __eq__(self, other): 14: return (self.name, self.age, self.weight) == (other.name, other.age, other.weight) 15: 16: 17: persons = [ 18: Person("Alice", 30, 60), 19: Person("Bob", 25, 70), 20: Person("Charlie", 35, 80), 21: Person("David", 40, 75), 22: Person("Eve", 28, 65) 23: ] 24: 25: index = [person.name for person in persons] 26: series = pandas.Series(data=[copy.copy(person) for person in persons] 27: , index=index, copy=True) 28: print(f"{series}\n----------") 29: 30: series['Bob'].age=4 #Change name of bob in series 31: 32: print(f"{series}\n----------") 33: [print(f"{person}") for person in persons] 34:
Alice Person(name=Alice, age=30, weight=60) Bob Person(name=Bob, age=25, weight=70) Charlie Person(name=Charlie, age=35, weight=80) David Person(name=David, age=40, weight=75) Eve Person(name=Eve, age=28, weight=65) dtype: object ---------- Alice Person(name=Alice, age=30, weight=60) Bob Person(name=Bob, age=4, weight=70) Charlie Person(name=Charlie, age=35, weight=80) David Person(name=David, age=40, weight=75) Eve Person(name=Eve, age=28, weight=65) dtype: object ---------- Person(name=Alice, age=30, weight=60) Person(name=Bob, age=25, weight=70) Person(name=Charlie, age=35, weight=80) Person(name=David, age=40, weight=75) Person(name=Eve, age=28, weight=65)
- Line 26 copies each and every item in the persons list and creates a new list by using list comprehension, this list is then used in the series constructor.
In this case we made a copy of the object by using shallow copy. This worked great because both name (str) and age(int) are immutable. But if we add another attribute, called tags, then we have the same problem but in another level.
1: import pandas 2: import copy 3: from pprint import pprint 4: 5: class Person: 6: def __init__(self, name, age, tags): 7: self.name = name 8: self.age = age 9: self.tags = tags 10: 11: def __str__(self): 12: return f"Person(name={self.name}, age={self.age}, tags={', '.join(self.tags)})" 13: 14: def __eq__(self, other): 15: return (self.name, self.age, self.tags) == (other.name, other.age, other.tags) 16: 17: 19: # generate a list of 5 persons with different names and tags , a tag is a list of attribute for a person 20: persons = [ 21: Person("Alice", 30, Alice_tags), 22: Person("Bob", 25, ["analytical", "quiet"]), 23: Person("Charlie", 35, ["outgoing", "adventurous"]), 24: Person("Diana", 28, ["organized", "detail-oriented"]), 25: Person("Eve", 32, ["curious", "independent"]) 26: ] 27: 28: #Make a shallow copy 29: series_copy = pandas.Series(data=[copy.copy(person) for person in persons] 30: ,index=[person.name for person in persons], copy=True) 31: 32: Alice_tags.append('fishy') 33: 34: series_copy['Alice'].age = 12 35: pprint([f"{person.name}: {person.tags}, {person.age}" for person in series_copy]) 36: pprint("----------") 37: pprint([f"{person.name}: {person.tags}, {person.age}" for person in persons])
["Alice: ['friendly', 'creative', 'fishy'], 12", "Bob: ['analytical', 'quiet'], 25", "Charlie: ['outgoing', 'adventurous'], 35", "Diana: ['organized', 'detail-oriented'], 28", "Eve: ['curious', 'independent'], 32"] '----------' ["Alice: ['friendly', 'creative', 'fishy'], 30", "Bob: ['analytical', 'quiet'], 25", "Charlie: ['outgoing', 'adventurous'], 35", "Diana: ['organized', 'detail-oriented'], 28", "Eve: ['curious', 'independent'], 32"]
This example has some hightlights i wanted to show.
- Line 18
- Creating a list of tags, which is used to create Alice Person object. We know that this is a reference.
- Line 21
- Here we create the Person object Alice and provide the tag list that we created on line 18
- Line 29
- At this point we make a shallow copy of all the items in the person list, which means we should be safe changing data in the Series, and any changes to Alice_tags should only be reflected in the Alice in person list, OR?
- Line 32
- Now we add a fishy to the Alice_tags , our intention is that this should only be reflected in the Persons list.
- Line 34
- We also change the age of Alice to become 12 instead, again, since the Persons list is copied into the series, any changes to the Alice in Persons should be safe.
But if we start looking at the output, we spot a small but significant error of our hypothesis. The age change to 12 is in fact only changed in the Person list. But!! any changes to Alice tags list is also reflected in the Series!! This is not what we wanted, What happened? The problem is that age field in Alice is immutable, which means it will be copied (when shallow copy) into the series. The tags however is not immutable, and since we use a shallow copy this is stored as a reference. Which means that tags are indeed stored as reference, and therefor any changes to the tags list will be reflected in all three structures (Alice_tags,Persons,Series). This could be a good idea in some circumstances, but it could also be devestating in other. The idea for this test was to have Series and Persons completely separated. Any changes to one list should not be reflected in the other. So how do we deal with this situation? of course if there is a shallow copy there has to be a deep copy.
A deep copy looks at every structure and tries to copy it. In the case of Person list, it will copy the age,name since these are immutable. When it comes to the tag list it will take each of the items and try to do copy on them. If tags list would have other objects that are not immutable, it will still try to copy them and recursivly go down the path to have everything copied. Lets have an example:
import pandas import copy class Person: def __init__(self, name, age, tags): self.name = name self.age = age self.tags = tags def __str__(self): return f"Person(name={self.name}, age={self.age}, tags={', '.join(self.tags)})" def __eq__(self, other): return (self.name, self.age, self.tags) == (other.name, other.age, other.tags) Alice_tags = ["friendly", "creative"] # generate a list of 5 persons with different names and tags , a tag is a list of attribute for a person persons = [ Person("Alice", 30, Alice_tags), Person("Bob", 25, ["analytical", "quiet"]), Person("Charlie", 35, ["outgoing", "adventurous"]), Person("Diana", 28, ["organized", "detail-oriented"]), Person("Eve", 32, ["curious", "independent"]) ] index = [person.name for person in persons] series = pandas.Series(data=copy.deepcopy(persons) , index=index) print(f"{series}\n----------") series['Alice'].tags = ['angry'] Alice_tags.append("sad") print(f"{series}\n----------") [print(f"{person}") for person in persons]
Alice Person(name=Alice, age=30, tags=friendly, crea... Bob Person(name=Bob, age=25, tags=analytical, quiet) Charlie Person(name=Charlie, age=35, tags=outgoing, ad... Diana Person(name=Diana, age=28, tags=organized, det... Eve Person(name=Eve, age=32, tags=curious, indepen... dtype: object ---------- Alice Person(name=Alice, age=30, tags=angry) Bob Person(name=Bob, age=25, tags=analytical, quiet) Charlie Person(name=Charlie, age=35, tags=outgoing, ad... Diana Person(name=Diana, age=28, tags=organized, det... Eve Person(name=Eve, age=32, tags=curious, indepen... dtype: object ---------- Person(name=Alice, age=30, tags=friendly, creative, sad) Person(name=Bob, age=25, tags=analytical, quiet) Person(name=Charlie, age=35, tags=outgoing, adventurous) Person(name=Diana, age=28, tags=organized, detail-oriented) Person(name=Eve, age=32, tags=curious, independent)
In this example made a deepcopy of the persons list. A deepcopy means that it will copy all underlying structure too even if they aren't immutable (as for example a list). So instead of using copy each item and then create a new list with copied objects, the deepcopy does the same thing, it also makes sure that tags in each person object get copied.
using Series.
Series are extremly useful and have multiple functions/methods wich are related to it. I want go through all, but lets check out one.
import pandas def transform_to_str(element:int)->str: return f"String Element: {element*2}" series = pandas.Series(data=[1,2,3,4,5], index=["one","two","three","four","five"]) print(f"{series}") pd = series.transform(transform_to_str) print(f"{pd}")
one 1 two 2 three 3 four 4 five 5 dtype: int64 one String Element: 2 two String Element: 4 three String Element: 6 four String Element: 8 five String Element: 10 dtype: object one 1 two 2 three 3 four 4 five 5 dtype: int64 one String Element: 2 two String Element: 4 three String Element: 6 four String Element: 8 five String Element: 10 dtype: object
This produced a new series with the transformed values to a string.
Another useful method is the reduce, though this is not actually part of the pandas, it still available as a function on list.
lets see how this works.
import pandas import functools def agg(init_val: dict, ele: int)->dict: init_val[f"element_{ele}"] = ele*3 return init_val series = pandas.Series(data=range(1,6), index=["one","two","three","four","five"]) print(f"{series}") # Creating a dictionary from the series using reduce. pd = functools.reduce(agg,series,{}) print(f"{pd}")
one 1 two 2 three 3 four 4 five 5 dtype: int64 {'element_1': 3, 'element_2': 6, 'element_3': 9, 'element_4': 12, 'element_5': 15} one 1 two 2 three 3 four 4 five 5 dtype: int64 {'element_1': 3, 'element_2': 6, 'element_3': 9, 'element_4': 12, 'element_5': 15}
So lets leave it at this for the moment. But i strongly recommend checking out Functools and reading grokking simplicity. But thats another story for another day.
Dataframe
Lets move over to DataFrame (DF); a DF is a two-dimensional, size mutable heterogenous tabular data.
What does this mean? First its two-dimensional. That kind of make sense
Lets make a table
A | B | C | D | E |
---|---|---|---|---|
a1 | b1 | c1 | d1 | e1 |
a2 | b2 | c2 | d2 | e2 |
a3 | b3 | c3 | d3 | e3 |
a4 | b4 | c4 | d4 | e4 |
Each column represents an array in the array (or Series if you want panda). in the above we would get something like
[ ["a1", "b1", "c1", "d1", "e1"], ["a2", "b2", "c2", "d2", "e2"], ["a3", "b3", "c3", "d3", "e3"], ["a4", "b4", "c4", "d4", "e4"] ]
import pandas from tabulate import tabulate df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4']) org_table = tabulate(df, headers='keys', tablefmt='orgtbl', showindex=True) print(org_table)
A | B | C | D | E | |
---|---|---|---|---|---|
row1 | a1 | b1 | c1 | d1 | e1 |
row2 | a2 | b2 | c2 | d2 | e2 |
row3 | a3 | b3 | c3 | d3 | e3 |
row4 | a4 | b4 | c4 | d4 | e4 |
to summarize:
- data
- the data table of items
- columns
- the column header, this has to be the same size as columns
- index
- the row index.. (by default its 0,1,2,3…) has to be same size as rows in table.
Each of the columns becomes a Series in the DataFrame.
DataFrame to dictionary (and other)
DataFrames and Series are useful, but to be able to use them in different context its necessary to be able to convert to different types and containers, why? There are several reason why you want to transform to something else. Maybe you want to use some algorithm that doesn't know any thing other than the standard containers, which is sensible enough. How do we convert it to a standard?
Lets say we have our Dataframe as before, and want to convert it to a dictionary. The problem becomes how is this dictionary interpreted?
Default
from pprint import pprint import pandas from tabulate import tabulate df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4']) pprint(df.to_dict())
{'A': {'row1': 'a1', 'row2': 'a2', 'row3': 'a3', 'row4': 'a4'}, 'B': {'row1': 'b1', 'row2': 'b2', 'row3': 'b3', 'row4': 'b4'}, 'C': {'row1': 'c1', 'row2': 'c2', 'row3': 'c3', 'row4': 'c4'}, 'D': {'row1': 'd1', 'row2': 'd2', 'row3': 'd3', 'row4': 'd4'}, 'E': {'row1': 'e1', 'row2': 'e2', 'row3': 'e3', 'row4': 'e4'}}
This pretty much resembles the table we looked before. There is a dictionary for each column name, which holds another dictionary for the rows. Example
\(df['C']['row3'] \rightarrow c3\)
But thats not always what you want.
By adding a string argument to to_dict(<arg>)
we can change the
output dictionary of how the DataFrame is represented.
Following section shows different methods.
Series
Lets try series instead.
from pprint import pprint import pandas df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4']) series=df.to_dict('series') pprint(series) pprint(f"A={series['A'].to_dict()}") pprint(df['C']['row3'])
{'A': row1 a1 row2 a2 row3 a3 row4 a4 Name: A, dtype: object, 'B': row1 b1 row2 b2 row3 b3 row4 b4 Name: B, dtype: object, 'C': row1 c1 row2 c2 row3 c3 row4 c4 Name: C, dtype: object, 'D': row1 d1 row2 d2 row3 d3 row4 d4 Name: D, dtype: object, 'E': row1 e1 row2 e2 row3 e3 row4 e4 Name: E, dtype: object} "A={'row1': 'a1', 'row2': 'a2', 'row3': 'a3', 'row4': 'a4'}" 'c3'
This is exactly the same as the default, where each column dictionary contains dictionary for rows.
Split
split means split up into three different keys in the dictionary.
- columns
- this resembles the column names.
- data
- a 2 dimensional array where each array resmbles a rows.
- index
- a array with column names
from pprint import pprint import pandas from tabulate import tabulate df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4']) series=df.to_dict(dict_type) pprint(series)
{'columns': ['A', 'B', 'C', 'D', 'E'], 'data': [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ['a3', 'b3', 'c3', 'd3', 'e3'], ['a4', 'b4', 'c4', 'd4', 'e4']], 'index': ['row1', 'row2', 'row3', 'row4']}
The split name probably got its name from that column name, index name, and data is splitted up in different names. example
\(df['data'][3][3]\rightarrow c3\)
index
{'row1': {'A': 'a1', 'B': 'b1', 'C': 'c1', 'D': 'd1', 'E': 'e1'}, 'row2': {'A': 'a2', 'B': 'b2', 'C': 'c2', 'D': 'd2', 'E': 'e2'}, 'row3': {'A': 'a3', 'B': 'b3', 'C': 'c3', 'D': 'd3', 'E': 'e3'}, 'row4': {'A': 'a4', 'B': 'b4', 'C': 'c4', 'D': 'd4', 'E': 'e4'}}
This resmbles the series, as we saw before, the difference is that instead of columns as the first dictionary this has the row dictionary. The first keys are the row name, and then second is the column name So for example \(series['row2']['D'] \rightarrow "d2"\) (row "row2" col "D")
tight
{'column_names': [None], 'columns': ['A', 'B', 'C', 'D', 'E'], 'data': [['a1', 'b1', 'c1', 'd1', 'e1'], ['a2', 'b2', 'c2', 'd2', 'e2'], ['a3', 'b3', 'c3', 'd3', 'e3'], ['a4', 'b4', 'c4', 'd4', 'e4']], 'index': ['row1', 'row2', 'row3', 'row4'], 'index_names': [None]}
This resembles quite alot the split version, where the data key is each of the rows. columns is an array with column names and index key associates an array with row names.
example: \(series['data'][2][3] \rightarrow c2\) (row 2, column 3)
records
[{'A': 'a1', 'B': 'b1', 'C': 'c1', 'D': 'd1', 'E': 'e1'}, {'A': 'a2', 'B': 'b2', 'C': 'c2', 'D': 'd2', 'E': 'e2'}, {'A': 'a3', 'B': 'b3', 'C': 'c3', 'D': 'd3', 'E': 'e3'}, {'A': 'a4', 'B': 'b4', 'C': 'c4', 'D': 'd4', 'E': 'e4'}]
Records give you a list of dictionaries. Where each column is represented by the index of the array, and each column by the column name for example: \(series[2]['D'] \rightarrow d2\) (row 2 col 'D')
Series vs DataFrame
The key difference is of course that a DataFrame works with 2 dimensional data, while Series works in 1 dimensional, and for that reason DataFrame also has a column field which naming the column.
Another difference is that a Series is a homogeonous container. Meaning all its element needs to be the same data type. While DataFrame is designed to be heterogeneous container, allowing different datatypes in different columns. Each column in a DataFrame is essentially a Series and each Series can hold a different data type.
Lets make an example
Name | Age | Score | Hair | Length |
---|---|---|---|---|
Calle | 51 | 100 | Blond | 1.64 |
Lars | 47 | 81 | Dark | 1.81 |
Trump | 78 | 25 | Yellow | 1.90 |
Harris | 56 | 67 | Dark | 1.67 |
There are different way we can filter and sort out fields. In the follwing example show two ways of doing it.
The example (Line 6) shows how to filter on a specific column with a specific value, in this case if score > 50 it creates a new dataframe with these values.
The next example (Line 10) shows how to create a mask, and for every row it filters out the masked items, in this case we created a list
Mask | Name | Age | Score | Hair | Length |
---|---|---|---|---|---|
True | Calle | 51 | 100 | Blond | 1.64 |
False | Lars | 47 | 81 | Dark | 1.81 |
False | Trump | 78 | 25 | Yellow | 1.90 |
True | Harris | 56 | 67 | Dark | 1.67 |
The outcome of this would be that a new DF with
Name | Age | Score | Hair | Length |
---|---|---|---|---|
Calle | 51 | 100 | Blond | 1.64 |
Harris | 56 | 67 | Dark | 1.67 |
As can be seen in the result.
1: import pandas 2: 3: 4: df = pandas.DataFrame(data=table, columns=headers) 5: 6: gt_50 = df['Score'] > 50 7: print(f"{gt_50}\n----------") 8: print(f">50\n {df[gt_50]}\n----------") 9: 10: test = [True, False, False, True] 11: print(df[ test])
0 True 1 True 2 False 3 True Name: Score, dtype: bool ---------- >50 Name Age Score Hair Length 0 Calle 51 100 Blond 1.64 1 Lars 47 81 Dark 1.81 3 Harris 56 67 Dark 1.67 ---------- Name Age Score Hair Length 0 Calle 51 100 Blond 1.64 3 Harris 56 67 Dark 1.67
But to be able to work with this we need to be able to search,filter and calculate on columns for example.
We are not restricted to pandas methods,functions there are also other tools available than can be used for example functools and itertools Any algorithm that uses iterators should be possible to use.
import functools import itertools import pandas import operator df = pandas.DataFrame(data=table, columns=headers) def f(lst, elem): lst.append(elem) return lst # Filter out length > 1.7 def filter_length(length:float)->bool: return length>1.7 # create a list of all lengths. length_lst = functools.reduce( f,df['Length'], []) print(f"Lengths {length_lst}") # Filter out all that has length > 1.7 length_filtered=filter(filter_length, length_lst ) # add 20 to the length_filtered items. def add_20(elem): return elem+20 add_20_res = map( add_20, length_filtered) # Sum of all the ages. it = itertools.accumulate(df['Age'],operator.add) print(f"Sum of ages (step wise) {list(it)}") print(list(add_20_res))
Lengths [1.64, 1.81, 1.9, 1.67] Sum of ages (step wise) [51, 98, 176, 232] [21.81, 21.9]
This shows how we can use itertools and functools to use together with DataFrames. Ok, we dwelved enough about all this , lets go back to the focus area.
Reading in values
So far we have just constructed DataFrame from inline tables. But now we need to focus on getting the data from meteo We saw previously how that can be done with curl. Lets do this with request instead.
So lets dig into the requests
Long | Lat |
---|---|
11.95 | 57.707 |
import requests import pandas from pprint import pprint url = "https://api.open-meteo.com/v1/forecast" params = { "latitude": 57.707, "longitude": 11.95, "hourly": "temperature_2m,apparent_temperature,rain,showers,wind_speed_10m,wind_direction_180m", "wind_speed_unit": "ms", "forecast_days": 1 } response = requests.get(url, params=params) df=pandas.DataFrame(response.json()) print(df) print(df.loc['temperature_2m'])
latitude ... hourly time 57.70479 ... [2024-12-04T00:00, 2024-12-04T01:00, 2024-12-0... temperature_2m 57.70479 ... [-3.2, -3.5, -3.6, -3.8, -3.6, -3.7, -3.1, -2.... apparent_temperature 57.70479 ... [-8.2, -8.7, -8.7, -8.9, -8.8, -8.9, -8.1, -7.... rain 57.70479 ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... showers 57.70479 ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... wind_speed_10m 57.70479 ... [4.0, 4.2, 3.9, 4.1, 4.2, 4.1, 3.9, 3.7, 3.4, ... wind_direction_180m 57.70479 ... [64, 68, 75, 80, 86, 87, 89, 85, 93, 96, 99, 9... [7 rows x 9 columns] latitude 57.70479 longitude 11.956985 generationtime_ms 0.049949 utc_offset_seconds 0 timezone GMT timezone_abbreviation GMT elevation 2.0 hourly_units °C hourly [-3.2, -3.5, -3.6, -3.8, -3.6, -3.7, -3.1, -2.... Name: temperature_2m, dtype: object
This needs some explanation, first of all, we used python module requests to get the data, this data as we saw earlier in Data source as a json stream. This is fed into the construction of the DataFrame. Now we can start using it by extracting data.
Obviously we dont have a error check or anything. So lets clean it up somewhat. Action, Data, Computation Requests Data:
- Data:
- We have the url Obviously
- Position
- Weather variables… Maybe a dictionary lookup
- Unit types (windspeed in ms for example, temperature unit, timeformat)
- forecast_days
Lets construct this data as a preliminary work.
The outline is quite simple
+--------+ +----------+ +---------+ +----------+ +---------+ | Weather|------->|Send data |------>|Read |------>|transform |------>|Tranform | | Data | | to meteo | |Response | |DataFram | | Graph | +--------+ +----------+ +---------+ +----------+ +---------+
using Matplotlib
from enum import Enum import pandas from pymonad.maybe import Maybe, Just, Nothing import requests import pprint import json import matplotlib.pyplot as plt class PositionData: def __init__(self, latitude, longitude): self.lat = latitude self.lon = longitude def __eq__(self, other): return isinstance(other, PositionData) and self.lat == other.lat and self.lon == other.lon def __repr__(self): return f"PositionData(lat={self.lat}, lon={self.lon})" class WindSpeed(Enum): MS = 'ms' KMH = 'kmh' MPH = 'mph' KNOTS = 'knots' class Temperature(Enum): FAHRENHEIT = 'fahrenheit' CELSIUS = 'celsius' class TimeFormat(Enum): UNIX = 'unixtime' ISO8601 = 'ISO8601' class PrecipitationType(Enum): MM = 'mm' INCH = 'inch' class UnitTypes: def __init__(self, wind_speed: WindSpeed = WindSpeed.MS, temperature: Temperature = Temperature.CELSIUS, time_format: TimeFormat = TimeFormat.UNIX, precipitation: PrecipitationType = PrecipitationType.MM): self.wind_speed = wind_speed self.temperature = temperature self.time_format = time_format self.precipitation = precipitation def __str__(self): return (f'Wind Speed: {self.wind_speed.value}, ' f'Temperature: {self.temperature.value}, ' f'Time Format: {self.time_format.value}, ' f'Precipitation: {self.precipitation.value}') def create_unit_params(unit_types: UnitTypes): return { "wind_speed_unit": unit_types.wind_speed.value, #"timeformat": unit_types.time_format.value, "precipitation_unit": unit_types.precipitation.value, "temperature_unit": unit_types.temperature.value} class WeatherData: def __init__(self, position: PositionData, parameters: list[str], types: UnitTypes, forcast_days: int = 1): self.variables = parameters self.types = types self.position = position self.forcast_days = forcast_days def __repr__(self): return (f"WeatherData(variables={self.variables}, " f"types={self.types})") def __str__(self): return (f"Weather Data:\n" f"Position {self.position}\n" f"Variables: {', '.join(self.variables)}\n" f"Types: {self.types}" f"Forcast days: {self.forcast_days}") def create_weather_params(weather: WeatherData): return { "latitude": weather.position.lat, "longitude": weather.position.lon, "hourly": weather.variables, "forecast_days": weather.forcast_days, } def create_request_params(weather: WeatherData): weather_params = create_weather_params(weather) unit_params = create_unit_params(weather.types) return {**weather_params, **unit_params} def send_request(url: str) -> callable: def send_fn(weather: WeatherData): params = create_request_params(weather) response = requests.get(url, params=params,verify=False, timeout=10) if response.status_code == 200: return Just(response.json()) else: return Nothing return send_fn def transform_to_df(json_str:str) -> pandas.DataFrame: return Just(pandas.DataFrame(json_str)) weather_data = WeatherData(PositionData(57.707,11.95), ["temperature_2m", "apparent_temperature", "rain", "showers", "wind_speed_10m", "wind_direction_180m"], UnitTypes(), 2 ) api_ret_json = Just(weather_data).bind(send_request("https://api.open-meteo.com/v1/forecast")) \ .bind(transform_to_df) df = api_ret_json.value df = api_ret_json.value x_axis = df.loc['time']['hourly'] y_axis = df.loc['temperature_2m']['hourly'] rain = df.loc['rain']['hourly'] wind = df.loc['wind_speed_10m']['hourly'] plt.figure(figsize=(10, 5)) plt.plot(x_axis, y_axis, marker='o', color='blue', label='Temperature 2m') plt.bar(x_axis, rain, alpha=0.5, color='orange', label='rain') plt.bar(x_axis, wind, alpha=0.5, color='red', label='Wind') plt.title('Temperature and rain over Time') plt.xlabel('Time') plt.ylabel('Values') plt.xticks(x_axis, rotation=45) plt.legend() plt.tight_layout() plot_path = 'temperature_rain_plot.png' plt.savefig(plot_path) plt.close() plot_link = f'[[file:{plot_path}]]' print(plot_link)
So we manage to get a nice looking graph of the temperature. But we want more! Lets think about it, we can see the graph of the temperature. We could get more information, and add graphs and what not here. Graphs are a beautiful way of getting alot of detailed information, but maybe we just want an overview. What would the weather be today, in a summary; is it going to rain, what cloth should I wear? Something else is needed here.
Using AI
AI is the word of the present and the future (I guess). I "need" to use it. And what better way than trying to make this graph into a nice summary of the weather today, at least to start with.
So where do we start? First we need to break it down , what do we want.
+-----------------------------+ Pos | Transform Pos, Date | summary ------>| To a 24h summary and graph |------------> Date | | ------>| | +-----------------------------+
I'll stick to what I already have, but lets first focus on reaching an AI host.
Simple AI query
There are numerous different ways of talking to a Large Language Models. Some are well known as langchain or llamaindex but there are numerous different others. These libraries are easy to use and are good in many ways. But the core is the same, it makes a request in some way to an API. The API may differ in some parts and different models. But lets not go there in this document. For now I want to make it clear how this can be done with just requests module.
import requests import json import pandas response = requests.post( url="https://openrouter.ai/api/v1/chat/completions", headers={ "Authorization": f"Bearer {API_KEY}", }, data=json.dumps({ "model": "mistralai/ministral-8b", "messages": [ {"role": "system", "content": "Ha Ha Ha! Step right up, folks!" \ "The Clown Prince of Crime turns..." \ "Meteorologist? 😜 " }, { "role": "user", "content": "Why does it rain?" } ] }) ) # Check if the request was successful if response.status_code == 200: data = response.json() print("ID:", data['id']) print("Content:") for choice in data['choices']: print(choice['message']['content']) else: print("Failed to retrieve data. Status code:", response.status_code)
ID: gen-1733348452-mmv3W7JcICL0o1pdYHNc Content: Ah, the humor! So, in my other life (when I'm not juggling boleholopes, skateboarding on a unicycle, or causing general mayhem), I'm guessing you're wondering about the science behind why it rains. Well, folks, it all begins with a simple process called evaporation. The sun heats up the oceans, lakes, and rivers, causing water to turn into water vapor - this is like the water dancing and twirling under the sun's warm gaze! These tiny water drops and ice crystals are lighter than air and ride the winds high into the sky, creating clouds. Sometimes, these clouds gather and bump into each other, like a thousand tiny dancers at a celestial ballet under the gaslit sky! Eventually, they bump into each other so much that they collide and stick together, becoming bigger and heavier. These become the raindrops (or sometimes, they're like poetic snowflakes, melting when they hit the ground). When they become too heavy for the clouds to hold, they tumble down, like tiny dancers singing "it's raining men" as they join the world below. So there you have it, folks! Rain may seem silly or annoying at times, but it's really just nature's way of giving us a big ol' water party, and dancing on the ground and in the oceans! 💧🐳
OK, somewhat overdramatic, but we get the idea, and for this excersise it does not really matter. We just want some answers and we got it.
The idea now is to try merge the two ideas together. But again lets clean this up to hold some more abstractions
first what data are we using?
- Url
- Header
- model
- messages
- system
- user
- Other (we might have more)
If we now group them together we have Url,Header in Communication
data ComData
and then we have data which is more related to the
actual Conversation ConvData
. Now to the data that we need to
summarize, in this case its some json data. AI knows how to interpret
json data, so that works, no need to transform it to something else.
The idea is to use the data from meteo
send it in to the AI , which
produces a summary and then publish this..
The publishing part we havnent yet discussed. Thats another part.
#!/usr/bin/env python import pandas from enum import Enum import pandas from pymonad.maybe import Maybe, Just, Nothing import requests import pprint import json import matplotlib.pyplot as plt class Headers: def __init__(self, headers_dict: dict): if not isinstance(headers_dict, dict): raise TypeError("Headers must be initialized with a dictionary") self.fields = headers_dict def __repr__(self): return f"Headers({self.fields})" @classmethod def from_dict(cls, headers_dict: dict): return cls(headers_dict) class ComData: def __init__(self, *, url: str, model: str, headers: Headers): if not isinstance(headers, Headers): raise TypeError("Headers must be an instance of Headers class") self.url = url self.headers = headers self.model = model class ConversationData: def __init__(self, *, query: str, system: str, data: pandas.DataFrame): self.user = query self.system = system self.data = data def __repr__(self) -> str: return f"ConversationData(user='{self.user}', system='{self.system}')" def make_message(self)->list[dict[str,str]]: return [ { "role": "system", "content": self.system + "\n\nHere's the weather data for analysis:" }, { "role": "system", "content": "The following message contains weather forecast data." \ "Each dictionary in the list represents a different weather " \ "parameter, with the 'hourly' key containing a list of 24 values ," \ "one for each hour of the day." }, { "role": "user", "content": self.user + self.data.to_json() }] def conversation_to_json(conv_data: ConversationData): return json.dumps(conv_data.make_message()) def make_data(com_data, conv_data): msg = { "model": com_data.model, "messages": conv_data.make_message() } return msg def ai_send_fn(com_data: ComData): def send_fn(con_data): data = make_data(com_data,conv_data) reqObj = requests.Request('POST', comdata.url, json=data, headers=com_data.headers.fields) with requests.Session() as session: req=reqObj.prepare() response = session.send(req, verify=False) if response.status_code == 200: return Just(response) else: return Maybe(value=response, monoid=False) return req return send_fn class PositionData: def __init__(self, latitude, longitude): self.lat = latitude self.lon = longitude def __eq__(self, other): return isinstance(other, PositionData) and self.lat == other.lat and self.lon == other.lon def __repr__(self): return f"PositionData(lat={self.lat}, lon={self.lon})" class WindSpeed(Enum): MS = 'ms' KMH = 'kmh' MPH = 'mph' KNOTS = 'knots' class Temperature(Enum): FAHRENHEIT = 'fahrenheit' CELSIUS = 'celsius' class TimeFormat(Enum): UNIX = 'unixtime' ISO8601 = 'ISO8601' class PrecipitationType(Enum): MM = 'mm' INCH = 'inch' class UnitTypes: def __init__(self, wind_speed: WindSpeed = WindSpeed.MS, temperature: Temperature = Temperature.CELSIUS, time_format: TimeFormat = TimeFormat.UNIX, precipitation: PrecipitationType = PrecipitationType.MM): self.wind_speed = wind_speed self.temperature = temperature self.time_format = time_format self.precipitation = precipitation def __str__(self): return (f'Wind Speed: {self.wind_speed.value}, ' f'Temperature: {self.temperature.value}, ' f'Time Format: {self.time_format.value}, ' f'Precipitation: {self.precipitation.value}') def create_unit_params(unit_types: UnitTypes): return { "wind_speed_unit": unit_types.wind_speed.value, #"timeformat": unit_types.time_format.value, "precipitation_unit": unit_types.precipitation.value, "temperature_unit": unit_types.temperature.value} class WeatherData: def __init__(self, position: PositionData, parameters: list[str], types: UnitTypes, forcast_days: int = 1): self.variables = parameters self.types = types self.position = position self.forcast_days = forcast_days def __repr__(self): return (f"WeatherData(variables={self.variables}, " f"types={self.types})") def __str__(self): return (f"Weather Data:\n" f"Position {self.position}\n" f"Variables: {', '.join(self.variables)}\n" f"Types: {self.types}" f"Forcast days: {self.forcast_days}") def create_weather_params(weather: WeatherData): return { "latitude": weather.position.lat, "longitude": weather.position.lon, "hourly": weather.variables, "forecast_days": weather.forcast_days, } def create_request_params(weather: WeatherData): weather_params = create_weather_params(weather) unit_params = create_unit_params(weather.types) return {**weather_params, **unit_params} def send_request(url: str) -> callable: def send_fn(weather: WeatherData): params = create_request_params(weather) response = requests.get(url, params=params,verify=False, timeout=10) if response.status_code == 200: return Just(response.json()) else: return Nothing return send_fn def transform_to_df(json_str:str) -> pandas.DataFrame: return Just(pandas.DataFrame(json_str)) ############################################################################### # Here we set the values we want. # ############################################################################### headers = Headers({ "Authorization": f"Bearer {API_KEY}", # "Content-Type": "application/json" }) comdata = ComData( url="https://openrouter.ai/api/v1/chat/completions", headers=headers, model="google/gemini-flash-1.5" ) weather_data = WeatherData(PositionData(57.707,11.95), ["temperature_2m", "apparent_temperature", "rain", "showers", "wind_speed_10m", "wind_direction_180m" ], UnitTypes(), forcast_days=3 ) api_ret_json = Just(weather_data).bind(send_request("https://api.open-meteo.com/v1/forecast")) \ .bind(transform_to_df) df = api_ret_json.value context = """ This weather data contains the following information: 1. Time: Hourly timestamps 2. Temperature (2m above ground): in °C 3. Apparent temperature: in °C 4. Rain: precipitation in mm 5. Showers: precipitation from showers in mm 6. Wind speed (10m above ground): in m/s 7. Wind direction (180m above ground): in degrees The 'hourly' key in each dictionary contains a list of values corresponding to these measurements for each hour of the day. The role weather_data will contain a json with 24h weather data. Analyse the data and answere the user question. """ conv_data = ConversationData( query="My name is Carl, i will take my bike every morning and home every evening" \ " between 08:00-09:00 and going home at 17:00-18:00, Give me a summary of the weather forecast," \ "Also i had west on the morings, and east on the evnings, I want to know if im heading into the wind or tailwind" \ "also give me suggestions for clothing during my ride and if there are any significant changes during the period", system=context, data=df ) send_fn = ai_send_fn(comdata) maybe_response = send_fn(conv_data) if maybe_response.is_just(): data = maybe_response.value.json() content = data['choices'][0]['message']['content'] print(content)
TTS
We have the generated some text which is good, we could also print out some graphs , which is nice. The problem is that all these things one needs to open some kind of web page to retrive the information, and then reading it, and interpreting what todo. I for one don't have the time in the morning, i just want someone to tell me what will happen today. A brief nice summary of todays weather so I know what to expect. I guess im talking about TTS (text-to-speech).
cat << EOF > /tmp/test.txt Hi Carl, here's a summary of your bike ride weather forecast for October 22nd and 23rd, considering your morning and evening commutes between 8:00-9:00 am and 5:00-6:00 pm, and your orientation to wind direction. **Morning Commute (8:00-9:00 am):** * **Temperature:** Around 12.5°C (average of 12.4°C and 12.7°C on the 22nd and 23rd respectively). Apparent temperature will be slightly lower around 10°C. * **Wind:** The wind speed will be approximately 5.2 m/s to 5.4 m/s on both days. Wind direction is around 226° to 231°(Morning). Since you're heading west, this means you'll experience a **headwind** in the mornings. **Evening Commute (5:00-6:00 pm):** * **Temperature:** Temperatures will be around 12.9°C and 12.7°C (average of 12.9°C and 12.7°C on the 22nd and 23rd respectively). Apparent temperature will be around 9.9°C and 9.5°C. * **Wind:** The wind speed will be around 6.4 m/s and 5.8 m/s in the evenings. The wind direction is around 251° and 246°. As your heading is east, you'll have a **tailwind** in the evenings. **Overall:** * **Temperature:** Expect mild temperatures throughout your commutes, but it might feel a tad cooler due to the wind chill. * **Precipitation:** No rain is predicted during your commute hours. * **Clothing Suggestions:** Layers are your friend! Start with a base layer (thermal top and bottom if it feels particularly cold), add a mid-layer (fleece or light jacket) and a light windbreaker or waterproof shell for protection against the wind(especially in the morning. **Significant Changes During the Period:** There is a moderate change in temperature in the evenings of both days. The wind will increase in speed between your morning and evening commutes, although it will be a tailwind in the evenings. There's a light rain of 0.2 mm at 4pm on the 22nd. Remember to check the specific forecast closer to your ride time for the most up-to-date information. Have a pleasant ride, Carl! EOF gtts-cli -f /tmp/test.txt -t com.au | play -t mp3 -
I will stop here for now, there are however better ways of getting Text-to-speech. But I will leave that to another exercise.