UP | HOME

Python pandas

Table of Contents

Python and panads overview

This is more an investigation on how to use pandas I need to start somewhere. :)

First of all, the most essential, what is pandas?

  • Pandas is a python library used for working with data sets.

In essence , you can read, analyze , manipulate, clean and explore data sets. So now we have full control right? Well, maybe not, but at least we know what to expect.

Data source

First things first, we need something to work with, some data. So lets dive into open-meteo.com , which seems quite cool tool (API) to get weather data.

Ok, im going to use a Long: 57.70761164715305 Lat: 11.952310770887784 which is in sweden, Gothenburg.

Here is a link to available data, there is a huge amount of different variables, I want go through them all, but some data which is understandable would be nice.

There is a curl command on the website that can be used to get some data, let bring in some data so we can inspect it:

curl "https://api.open-meteo.com/v1/forecast?\
latitude=57.707&\
longitude=11.95&\
hourly=temperature_2m,\
apparent_temperature,\
rain,\
showers,\
wind_speed_10m,\
wind_direction_180m&\
wind_speed_unit=ms&\
&forecast_days=1"|  jq
{
  "latitude": 57.70479,
  "longitude": 11.956985,
  "generationtime_ms": 0.10097026824951172,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "timezone_abbreviation": "GMT",
  "elevation": 2.0,
  "hourly_units": {
    "time": "iso8601",
    "temperature_2m": "°C",
    "apparent_temperature": "°C",
    "rain": "mm",
    "showers": "mm",
    "wind_speed_10m": "m/s",
    "wind_direction_180m": "°"
  },
  "hourly": {
    "time": [
      "2024-10-05T00:00",
        ....
      "2024-10-05T23:00"
    ],
    "temperature_2m": [
      12.0,
        ...
      11.4
    ],
    "apparent_temperature": [
      9.1,
      ...
      9.7
    ],
    "rain": [
      0.00,
      ...
      0.00
    ],
    "showers": [
      0.00,
      ...
      0.00
    ],
    "wind_speed_10m": [
      4.10,
      ...
      2.60
    ],
    "wind_direction_180m": [
      245,
      ...
      205
    ]
  }
}

Nice! we have some data for a weather forecast the coming 24h, but maybe we want to do it with python to start with, and then use pandas to extract what we want.

it seems as openmeteo already have a python package, but thats somewhat cheating (actually not, but for this exporation its better to use requests module for python to make it more clear what happens)

When we retrieved the data, we can construct Pandas Series and Dataframe with the values, but lets not get ahead of our selfs. Lets first explore some about pandas

Basic Structure

Let define the basic structure of pandas. There are two types of data structure that is involved. These are the Series and DataFrame. The main difference is that a Series is used as a one-dimensional labled array and a Dataframe is a 2-dimensional labled data. To be fair, DataFrame consists of different columns of Series, but lets dig into each every in the following section.

Series

One can see a Series as an array, or maybe a column of database (with an index, (row name)). This array need to have the same type for all fields for example int,float,string,object…. A Series is a fundamental data structure in pandas. How is a Series constructed? Series can be made from lists, dictionaries, iteratables or scalar values.

lets show some code; first we create a table with just one columns from a list of numbers.

import pandas

#Table from a list
list_of_vals = [1,200,300,400]
series = pandas.Series(list_of_vals)


# Table with index
series_vals = [500,600,700,800]
series_index = ['five','six','seven','eight']
series_with_label = pandas.Series(data=series_vals,index=series_index)

print(f"{series}\n----------\n{series_with_label}")
0      1
1    200
2    300
3    400
dtype: int64
----------
five     500
six      600
seven    700
eight    800
dtype: int64

The output above shows the row is either named by the index, or default to a iteratable number 0,1,2,3… Obivously, this is also a match for the dictionaries. Where the Key becomes the row name.

import pandas

dicker = {'a': 2.1,'b': 3.2, 'c': 7.7}

serie =pandas.Series(dicker)

print(serie)

a    2.1
b    3.2
c    7.7
dtype: float64

Another thing to note is that if data is a dictionary the order is maintained.

I guess we could also use a generator function. A generator function is in essence a iterator, which returns something for each time it is called. It keeps the state for each call (that is the num is increased for every function call).

import pandas

def calc_num_times(n: int) -> int:
    num = 0
    while num < n:
        yield num *2
        num += 1
     
     
series = pandas.Series(calc_num_times(8))
print(series)

0     0
1     2
2     4
3     6
4     8
5    10
6    12
7    14
dtype: int64

We could even use this as storing data object. But now we need to be careful, since pandas series stores the reference. That means that any changes to the original list, will be reflected in the panda series.

The following code explain the problem.

 1: import pandas
 2: 
 3: 
 4: class Person:
 5:     def __init__(self, name, age, weight):
 6:         self.name = name
 7:         self.age = age
 8:         self.weight = weight
 9:      
10:     def __str__(self):
11:         return f"Person(name={self.name}, age={self.age}, weight={self.weight})"
12:      
13:     def __eq__(self, other):
14:         return (self.name, self.age, self.weight) == (other.name, other.age, other.weight)
15:      
16:      
17:      
18:      
19: persons = [
20:     Person("Alice", 30, 60),
21:     Person("Bob", 25, 70),
22:     Person("Charlie", 35, 80),
23:     Person("David", 40, 75),
24:     Person("Eve", 28, 65)
25: ]
26: 
27: index = [person.name for person in persons]
28: series = pandas.Series(data=persons, index=index)
29: print(f"{series}\n----------")
30: 
31: series['Bob'].age=4 #Change age of Bob!
32: 
33: print(f"{series}\n----------")
34: [print(f"{person}") for person in persons]
Alice        Person(name=Alice, age=30, weight=60)
Bob            Person(name=Bob, age=25, weight=70)
Charlie    Person(name=Charlie, age=35, weight=80)
David        Person(name=David, age=40, weight=75)
Eve            Person(name=Eve, age=28, weight=65)
dtype: object
----------
Alice        Person(name=Alice, age=30, weight=60)
Bob             Person(name=Bob, age=4, weight=70)
Charlie    Person(name=Charlie, age=35, weight=80)
David        Person(name=David, age=40, weight=75)
Eve            Person(name=Eve, age=28, weight=65)
dtype: object
----------
Person(name=Alice, age=30, weight=60)
Person(name=Bob, age=4, weight=70)
Person(name=Charlie, age=35, weight=80)
Person(name=David, age=40, weight=75)
Person(name=Eve, age=28, weight=65)
  • Line 29 prints out the list where Bob.age=25
  • Line 31 Changes Bob in the Series , now Bob.age=4
  • Line 33 prints out the series (bob age is now 4, as expected)
  • Line 34 prints out the initial list with persons. Bob has been changed here too, which means any changes to the Series will be reflected in the Person list too

What I wanted to highlight is the fact that when I change the age of Bob in series it also reflects the Bob in list persons. This is due to that pandas.Series only stores references.

One way to deal with this is to copy all the values from person to the series; bascially copying each of the items and create a new series before we insert it.

 1: import pandas
 2: import copy
 3: 
 4: class Person:
 5:     def __init__(self, name, age, weight):
 6:         self.name = name
 7:         self.age = age
 8:         self.weight = weight
 9:      
10:     def __str__(self):
11:         return f"Person(name={self.name}, age={self.age}, weight={self.weight})"
12:      
13:     def __eq__(self, other):
14:         return (self.name, self.age, self.weight) == (other.name, other.age, other.weight)
15:      
16:      
17: persons = [
18:     Person("Alice", 30, 60),
19:     Person("Bob", 25, 70),
20:     Person("Charlie", 35, 80),
21:     Person("David", 40, 75),
22:     Person("Eve", 28, 65)
23: ]
24: 
25: index = [person.name for person in persons]
26: series = pandas.Series(data=[copy.copy(person) for person in persons]
27:                        , index=index, copy=True)
28: print(f"{series}\n----------")
29: 
30: series['Bob'].age=4 #Change name of bob in series
31: 
32: print(f"{series}\n----------")
33: [print(f"{person}") for person in persons]
34: 
Alice        Person(name=Alice, age=30, weight=60)
Bob            Person(name=Bob, age=25, weight=70)
Charlie    Person(name=Charlie, age=35, weight=80)
David        Person(name=David, age=40, weight=75)
Eve            Person(name=Eve, age=28, weight=65)
dtype: object
----------
Alice        Person(name=Alice, age=30, weight=60)
Bob             Person(name=Bob, age=4, weight=70)
Charlie    Person(name=Charlie, age=35, weight=80)
David        Person(name=David, age=40, weight=75)
Eve            Person(name=Eve, age=28, weight=65)
dtype: object
----------
Person(name=Alice, age=30, weight=60)
Person(name=Bob, age=25, weight=70)
Person(name=Charlie, age=35, weight=80)
Person(name=David, age=40, weight=75)
Person(name=Eve, age=28, weight=65)
  • Line 26 copies each and every item in the persons list and creates a new list by using list comprehension, this list is then used in the series constructor.

In this case we made a copy of the object by using shallow copy. This worked great because both name (str) and age(int) are immutable. But if we add another attribute, called tags, then we have the same problem but in another level.

 1: import pandas
 2: import copy
 3: from pprint import pprint
 4: 
 5: class Person:
 6:     def __init__(self, name, age, tags):
 7:         self.name = name
 8:         self.age = age
 9:         self.tags = tags
10:      
11:     def __str__(self):
12:         return f"Person(name={self.name}, age={self.age}, tags={', '.join(self.tags)})"
13:      
14:     def __eq__(self, other):
15:         return (self.name, self.age, self.tags) == (other.name, other.age, other.tags)
16:      
17:      
18: Alice_tags = ["friendly", "creative"]
19: # generate a list of 5 persons with different names and tags , a tag is a list of attribute for a person
20: persons = [
21:     Person("Alice", 30, Alice_tags),
22:     Person("Bob", 25, ["analytical", "quiet"]),
23:     Person("Charlie", 35, ["outgoing", "adventurous"]),
24:     Person("Diana", 28, ["organized", "detail-oriented"]),
25:     Person("Eve", 32, ["curious", "independent"])
26: ]
27: 
28: #Make a shallow copy
29: series_copy = pandas.Series(data=[copy.copy(person) for person in persons]
30:                             ,index=[person.name for person in persons], copy=True)
31:      
32: Alice_tags.append('fishy')
33: 
34: series_copy['Alice'].age = 12
35: pprint([f"{person.name}: {person.tags}, {person.age}" for person in series_copy])
36: pprint("----------")
37: pprint([f"{person.name}: {person.tags}, {person.age}" for person in persons])
["Alice: ['friendly', 'creative', 'fishy'], 12",
 "Bob: ['analytical', 'quiet'], 25",
 "Charlie: ['outgoing', 'adventurous'], 35",
 "Diana: ['organized', 'detail-oriented'], 28",
 "Eve: ['curious', 'independent'], 32"]
'----------'
["Alice: ['friendly', 'creative', 'fishy'], 30",
 "Bob: ['analytical', 'quiet'], 25",
 "Charlie: ['outgoing', 'adventurous'], 35",
 "Diana: ['organized', 'detail-oriented'], 28",
 "Eve: ['curious', 'independent'], 32"]

This example has some hightlights i wanted to show.

Line 18
Creating a list of tags, which is used to create Alice Person object. We know that this is a reference.
Line 21
Here we create the Person object Alice and provide the tag list that we created on line 18
Line 29
At this point we make a shallow copy of all the items in the person list, which means we should be safe changing data in the Series, and any changes to Alice_tags should only be reflected in the Alice in person list, OR?
Line 32
Now we add a fishy to the Alice_tags , our intention is that this should only be reflected in the Persons list.
Line 34
We also change the age of Alice to become 12 instead, again, since the Persons list is copied into the series, any changes to the Alice in Persons should be safe.

But if we start looking at the output, we spot a small but significant error of our hypothesis. The age change to 12 is in fact only changed in the Person list. But!! any changes to Alice tags list is also reflected in the Series!! This is not what we wanted, What happened? The problem is that age field in Alice is immutable, which means it will be copied (when shallow copy) into the series. The tags however is not immutable, and since we use a shallow copy this is stored as a reference. Which means that tags are indeed stored as reference, and therefor any changes to the tags list will be reflected in all three structures (Alice_tags,Persons,Series). This could be a good idea in some circumstances, but it could also be devestating in other. The idea for this test was to have Series and Persons completely separated. Any changes to one list should not be reflected in the other. So how do we deal with this situation? of course if there is a shallow copy there has to be a deep copy.

A deep copy looks at every structure and tries to copy it. In the case of Person list, it will copy the age,name since these are immutable. When it comes to the tag list it will take each of the items and try to do copy on them. If tags list would have other objects that are not immutable, it will still try to copy them and recursivly go down the path to have everything copied. Lets have an example:

import pandas
import copy


class Person:
    def __init__(self, name, age, tags):
        self.name = name
        self.age = age
        self.tags = tags
     
    def __str__(self):
        return f"Person(name={self.name}, age={self.age}, tags={', '.join(self.tags)})"
     
    def __eq__(self, other):
        return (self.name, self.age, self.tags) == (other.name, other.age, other.tags)
     
     
Alice_tags = ["friendly", "creative"]
# generate a list of 5 persons with different names and tags , a tag is a list of attribute for a person
persons = [
    Person("Alice", 30, Alice_tags),
    Person("Bob", 25, ["analytical", "quiet"]),
    Person("Charlie", 35, ["outgoing", "adventurous"]),
    Person("Diana", 28, ["organized", "detail-oriented"]),
    Person("Eve", 32, ["curious", "independent"])
]

index = [person.name for person in persons]
series = pandas.Series(data=copy.deepcopy(persons)
                       , index=index)
print(f"{series}\n----------")

series['Alice'].tags = ['angry']
Alice_tags.append("sad")

print(f"{series}\n----------")
[print(f"{person}") for person in persons]

Alice      Person(name=Alice, age=30, tags=friendly, crea...
Bob         Person(name=Bob, age=25, tags=analytical, quiet)
Charlie    Person(name=Charlie, age=35, tags=outgoing, ad...
Diana      Person(name=Diana, age=28, tags=organized, det...
Eve        Person(name=Eve, age=32, tags=curious, indepen...
dtype: object
----------
Alice                 Person(name=Alice, age=30, tags=angry)
Bob         Person(name=Bob, age=25, tags=analytical, quiet)
Charlie    Person(name=Charlie, age=35, tags=outgoing, ad...
Diana      Person(name=Diana, age=28, tags=organized, det...
Eve        Person(name=Eve, age=32, tags=curious, indepen...
dtype: object
----------
Person(name=Alice, age=30, tags=friendly, creative, sad)
Person(name=Bob, age=25, tags=analytical, quiet)
Person(name=Charlie, age=35, tags=outgoing, adventurous)
Person(name=Diana, age=28, tags=organized, detail-oriented)
Person(name=Eve, age=32, tags=curious, independent)

In this example made a deepcopy of the persons list. A deepcopy means that it will copy all underlying structure too even if they aren't immutable (as for example a list). So instead of using copy each item and then create a new list with copied objects, the deepcopy does the same thing, it also makes sure that tags in each person object get copied.

using Series.

Series are extremly useful and have multiple functions/methods wich are related to it. I want go through all, but lets check out one.

import pandas

def transform_to_str(element:int)->str:
    return f"String Element: {element*2}"

series = pandas.Series(data=[1,2,3,4,5], index=["one","two","three","four","five"])

print(f"{series}")
pd = series.transform(transform_to_str)
print(f"{pd}")



one 1 two 2 three 3 four 4 five 5 dtype: int64 one String Element: 2 two String Element: 4 three String Element: 6 four String Element: 8 five String Element: 10 dtype: object one 1 two 2 three 3 four 4 five 5 dtype: int64 one String Element: 2 two String Element: 4 three String Element: 6 four String Element: 8 five String Element: 10 dtype: object

This produced a new series with the transformed values to a string.

Another useful method is the reduce, though this is not actually part of the pandas, it still available as a function on list.

lets see how this works.

import pandas
import functools


def agg(init_val: dict, ele: int)->dict:
    init_val[f"element_{ele}"] = ele*3
    return init_val



series = pandas.Series(data=range(1,6), index=["one","two","three","four","five"])

print(f"{series}")
# Creating a dictionary from the series using reduce.
pd = functools.reduce(agg,series,{})
print(f"{pd}")

one 1 two 2 three 3 four 4 five 5 dtype: int64 {'element_1': 3, 'element_2': 6, 'element_3': 9, 'element_4': 12, 'element_5': 15} one 1 two 2 three 3 four 4 five 5 dtype: int64 {'element_1': 3, 'element_2': 6, 'element_3': 9, 'element_4': 12, 'element_5': 15}

So lets leave it at this for the moment. But i strongly recommend checking out Functools and reading grokking simplicity. But thats another story for another day.

Dataframe

Lets move over to DataFrame (DF); a DF is a two-dimensional, size mutable heterogenous tabular data.

What does this mean? First its two-dimensional. That kind of make sense

Lets make a table

A B C D E
a1 b1 c1 d1 e1
a2 b2 c2 d2 e2
a3 b3 c3 d3 e3
a4 b4 c4 d4 e4

Each column represents an array in the array (or Series if you want panda). in the above we would get something like

[
    ["a1", "b1", "c1", "d1", "e1"],
    ["a2", "b2", "c2", "d2", "e2"],
    ["a3", "b3", "c3", "d3", "e3"],
    ["a4", "b4", "c4", "d4", "e4"]
]
import pandas
from tabulate import tabulate

df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4'])
org_table = tabulate(df, headers='keys', tablefmt='orgtbl', showindex=True)

print(org_table)
  A B C D E
row1 a1 b1 c1 d1 e1
row2 a2 b2 c2 d2 e2
row3 a3 b3 c3 d3 e3
row4 a4 b4 c4 d4 e4

to summarize:

data
the data table of items
columns
the column header, this has to be the same size as columns
index
the row index.. (by default its 0,1,2,3…) has to be same size as rows in table.

Each of the columns becomes a Series in the DataFrame.

DataFrame to dictionary (and other)

DataFrames and Series are useful, but to be able to use them in different context its necessary to be able to convert to different types and containers, why? There are several reason why you want to transform to something else. Maybe you want to use some algorithm that doesn't know any thing other than the standard containers, which is sensible enough. How do we convert it to a standard?

Lets say we have our Dataframe as before, and want to convert it to a dictionary. The problem becomes how is this dictionary interpreted?

Default
from pprint import pprint
import pandas
from tabulate import tabulate

df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4'])

pprint(df.to_dict())
{'A': {'row1': 'a1', 'row2': 'a2', 'row3': 'a3', 'row4': 'a4'},
 'B': {'row1': 'b1', 'row2': 'b2', 'row3': 'b3', 'row4': 'b4'},
 'C': {'row1': 'c1', 'row2': 'c2', 'row3': 'c3', 'row4': 'c4'},
 'D': {'row1': 'd1', 'row2': 'd2', 'row3': 'd3', 'row4': 'd4'},
 'E': {'row1': 'e1', 'row2': 'e2', 'row3': 'e3', 'row4': 'e4'}}

This pretty much resembles the table we looked before. There is a dictionary for each column name, which holds another dictionary for the rows. Example

\(df['C']['row3'] \rightarrow c3\)

But thats not always what you want. By adding a string argument to to_dict(<arg>) we can change the output dictionary of how the DataFrame is represented. Following section shows different methods.

Series

Lets try series instead.

from pprint import pprint
import pandas

df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4'])


series=df.to_dict('series')
pprint(series)
pprint(f"A={series['A'].to_dict()}")
pprint(df['C']['row3'])
{'A': row1    a1
row2    a2
row3    a3
row4    a4
Name: A, dtype: object,
 'B': row1    b1
row2    b2
row3    b3
row4    b4
Name: B, dtype: object,
 'C': row1    c1
row2    c2
row3    c3
row4    c4
Name: C, dtype: object,
 'D': row1    d1
row2    d2
row3    d3
row4    d4
Name: D, dtype: object,
 'E': row1    e1
row2    e2
row3    e3
row4    e4
Name: E, dtype: object}
"A={'row1': 'a1', 'row2': 'a2', 'row3': 'a3', 'row4': 'a4'}"
'c3'

This is exactly the same as the default, where each column dictionary contains dictionary for rows.

Split

split means split up into three different keys in the dictionary.

columns
this resembles the column names.
data
a 2 dimensional array where each array resmbles a rows.
index
a array with column names
from pprint import pprint
import pandas
from tabulate import tabulate

df = pandas.DataFrame(data=table, columns=headers, index=['row1','row2','row3','row4'])


series=df.to_dict(dict_type)
pprint(series)
{'columns': ['A', 'B', 'C', 'D', 'E'],
 'data': [['a1', 'b1', 'c1', 'd1', 'e1'],
          ['a2', 'b2', 'c2', 'd2', 'e2'],
          ['a3', 'b3', 'c3', 'd3', 'e3'],
          ['a4', 'b4', 'c4', 'd4', 'e4']],
 'index': ['row1', 'row2', 'row3', 'row4']}

The split name probably got its name from that column name, index name, and data is splitted up in different names. example

\(df['data'][3][3]\rightarrow c3\)

index
{'row1': {'A': 'a1', 'B': 'b1', 'C': 'c1', 'D': 'd1', 'E': 'e1'},
 'row2': {'A': 'a2', 'B': 'b2', 'C': 'c2', 'D': 'd2', 'E': 'e2'},
 'row3': {'A': 'a3', 'B': 'b3', 'C': 'c3', 'D': 'd3', 'E': 'e3'},
 'row4': {'A': 'a4', 'B': 'b4', 'C': 'c4', 'D': 'd4', 'E': 'e4'}}

This resmbles the series, as we saw before, the difference is that instead of columns as the first dictionary this has the row dictionary. The first keys are the row name, and then second is the column name So for example \(series['row2']['D'] \rightarrow "d2"\) (row "row2" col "D")

tight
{'column_names': [None],
 'columns': ['A', 'B', 'C', 'D', 'E'],
 'data': [['a1', 'b1', 'c1', 'd1', 'e1'],
          ['a2', 'b2', 'c2', 'd2', 'e2'],
          ['a3', 'b3', 'c3', 'd3', 'e3'],
          ['a4', 'b4', 'c4', 'd4', 'e4']],
 'index': ['row1', 'row2', 'row3', 'row4'],
 'index_names': [None]}

This resembles quite alot the split version, where the data key is each of the rows. columns is an array with column names and index key associates an array with row names.

example: \(series['data'][2][3] \rightarrow c2\) (row 2, column 3)

records
[{'A': 'a1', 'B': 'b1', 'C': 'c1', 'D': 'd1', 'E': 'e1'},
 {'A': 'a2', 'B': 'b2', 'C': 'c2', 'D': 'd2', 'E': 'e2'},
 {'A': 'a3', 'B': 'b3', 'C': 'c3', 'D': 'd3', 'E': 'e3'},
 {'A': 'a4', 'B': 'b4', 'C': 'c4', 'D': 'd4', 'E': 'e4'}]

Records give you a list of dictionaries. Where each column is represented by the index of the array, and each column by the column name for example: \(series[2]['D'] \rightarrow d2\) (row 2 col 'D')

Series vs DataFrame

The key difference is of course that a DataFrame works with 2 dimensional data, while Series works in 1 dimensional, and for that reason DataFrame also has a column field which naming the column.

Another difference is that a Series is a homogeonous container. Meaning all its element needs to be the same data type. While DataFrame is designed to be heterogeneous container, allowing different datatypes in different columns. Each column in a DataFrame is essentially a Series and each Series can hold a different data type.

Lets make an example

Name Age Score Hair Length
Calle 51 100 Blond 1.64
Lars 47 81 Dark 1.81
Trump 78 25 Yellow 1.90
Harris 56 67 Dark 1.67

There are different way we can filter and sort out fields. In the follwing example show two ways of doing it.

The example (Line 6) shows how to filter on a specific column with a specific value, in this case if score > 50 it creates a new dataframe with these values.

The next example (Line 10) shows how to create a mask, and for every row it filters out the masked items, in this case we created a list

Mask Name Age Score Hair Length
True Calle 51 100 Blond 1.64
False Lars 47 81 Dark 1.81
False Trump 78 25 Yellow 1.90
True Harris 56 67 Dark 1.67

The outcome of this would be that a new DF with

Name Age Score Hair Length
Calle 51 100 Blond 1.64
Harris 56 67 Dark 1.67

As can be seen in the result.

 1: import pandas
 2: 
 3: 
 4: df = pandas.DataFrame(data=table, columns=headers)
 5: 
 6: gt_50 = df['Score'] > 50
 7: print(f"{gt_50}\n----------")
 8: print(f">50\n {df[gt_50]}\n----------")
 9: 
10: test = [True, False, False,  True]
11: print(df[ test])
0     True
1     True
2    False
3     True
Name: Score, dtype: bool
----------
>50
      Name  Age  Score   Hair  Length
0   Calle   51    100  Blond    1.64
1    Lars   47     81   Dark    1.81
3  Harris   56     67   Dark    1.67
----------
     Name  Age  Score   Hair  Length
0   Calle   51    100  Blond    1.64
3  Harris   56     67   Dark    1.67

But to be able to work with this we need to be able to search,filter and calculate on columns for example.

We are not restricted to pandas methods,functions there are also other tools available than can be used for example functools and itertools Any algorithm that uses iterators should be possible to use.

import functools
import itertools
import pandas
import operator

df = pandas.DataFrame(data=table, columns=headers)


def f(lst, elem):
    lst.append(elem)
    return lst

# Filter out length > 1.7
def filter_length(length:float)->bool:
    return length>1.7

# create a list of all lengths.
length_lst = functools.reduce( f,df['Length'], [])
print(f"Lengths {length_lst}")
# Filter out all that has length > 1.7
length_filtered=filter(filter_length, length_lst )
# add 20 to the length_filtered items.
def add_20(elem):
    return elem+20
add_20_res = map(  add_20, length_filtered)

# Sum of all the ages.
it = itertools.accumulate(df['Age'],operator.add)
print(f"Sum of ages (step wise) {list(it)}")


print(list(add_20_res))

Lengths [1.64, 1.81, 1.9, 1.67]
Sum of ages (step wise) [51, 98, 176, 232]
[21.81, 21.9]

This shows how we can use itertools and functools to use together with DataFrames. Ok, we dwelved enough about all this , lets go back to the focus area.

Reading in values

So far we have just constructed DataFrame from inline tables. But now we need to focus on getting the data from meteo We saw previously how that can be done with curl. Lets do this with request instead.

So lets dig into the requests

Long Lat
11.95 57.707
import requests
import pandas
from pprint import pprint
url = "https://api.open-meteo.com/v1/forecast"
params = {
    "latitude": 57.707,
    "longitude": 11.95,
    "hourly": "temperature_2m,apparent_temperature,rain,showers,wind_speed_10m,wind_direction_180m",
    "wind_speed_unit": "ms",
    "forecast_days": 1
}

response = requests.get(url, params=params)
df=pandas.DataFrame(response.json())
print(df)
print(df.loc['temperature_2m'])



                      latitude  ...                                             hourly
time                  57.70479  ...  [2024-12-04T00:00, 2024-12-04T01:00, 2024-12-0...
temperature_2m        57.70479  ...  [-3.2, -3.5, -3.6, -3.8, -3.6, -3.7, -3.1, -2....
apparent_temperature  57.70479  ...  [-8.2, -8.7, -8.7, -8.9, -8.8, -8.9, -8.1, -7....
rain                  57.70479  ...  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
showers               57.70479  ...  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
wind_speed_10m        57.70479  ...  [4.0, 4.2, 3.9, 4.1, 4.2, 4.1, 3.9, 3.7, 3.4, ...
wind_direction_180m   57.70479  ...  [64, 68, 75, 80, 86, 87, 89, 85, 93, 96, 99, 9...

[7 rows x 9 columns]
latitude                                                          57.70479
longitude                                                        11.956985
generationtime_ms                                                 0.049949
utc_offset_seconds                                                       0
timezone                                                               GMT
timezone_abbreviation                                                  GMT
elevation                                                              2.0
hourly_units                                                            °C
hourly                   [-3.2, -3.5, -3.6, -3.8, -3.6, -3.7, -3.1, -2....
Name: temperature_2m, dtype: object

This needs some explanation, first of all, we used python module requests to get the data, this data as we saw earlier in Data source as a json stream. This is fed into the construction of the DataFrame. Now we can start using it by extracting data.

Obviously we dont have a error check or anything. So lets clean it up somewhat. Action, Data, Computation Requests Data:

  • Data:
    • We have the url Obviously
    • Position
    • Weather variables… Maybe a dictionary lookup
    • Unit types (windspeed in ms for example, temperature unit, timeformat)
    • forecast_days

Lets construct this data as a preliminary work.

The outline is quite simple

+--------+        +----------+       +---------+       +----------+       +---------+
| Weather|------->|Send data |------>|Read     |------>|transform |------>|Tranform |
|  Data  |        | to meteo |       |Response |       |DataFram  |       | Graph   |
+--------+        +----------+       +---------+       +----------+       +---------+

using Matplotlib

from enum import Enum
import pandas
from pymonad.maybe import Maybe, Just, Nothing
import requests
import pprint
import json
import matplotlib.pyplot as plt

class PositionData:
    def __init__(self, latitude, longitude):
        self.lat = latitude
        self.lon = longitude
     
    def __eq__(self, other):
        return isinstance(other, PositionData) and self.lat == other.lat and self.lon == other.lon
     
    def __repr__(self):
        return f"PositionData(lat={self.lat}, lon={self.lon})"
     
class WindSpeed(Enum):
    MS = 'ms'
    KMH = 'kmh'
    MPH = 'mph'
    KNOTS = 'knots'

class Temperature(Enum):
    FAHRENHEIT = 'fahrenheit'
    CELSIUS = 'celsius'

class TimeFormat(Enum):
    UNIX = 'unixtime'
    ISO8601 = 'ISO8601'

class PrecipitationType(Enum):
    MM = 'mm'
    INCH = 'inch'

class UnitTypes:
    def __init__(self,
                 wind_speed: WindSpeed = WindSpeed.MS,
                 temperature: Temperature = Temperature.CELSIUS,
                 time_format: TimeFormat = TimeFormat.UNIX,
                 precipitation: PrecipitationType = PrecipitationType.MM):
     
        self.wind_speed = wind_speed
        self.temperature = temperature
        self.time_format = time_format
        self.precipitation = precipitation
     
    def __str__(self):
        return (f'Wind Speed: {self.wind_speed.value}, '
                f'Temperature: {self.temperature.value}, '
                f'Time Format: {self.time_format.value}, '
                f'Precipitation: {self.precipitation.value}')
         
         
def create_unit_params(unit_types: UnitTypes):
    return {
        "wind_speed_unit": unit_types.wind_speed.value,
        #"timeformat": unit_types.time_format.value,
        "precipitation_unit": unit_types.precipitation.value,
        "temperature_unit": unit_types.temperature.value}
     
     
class WeatherData:
    def __init__(self, position: PositionData,
                 parameters: list[str],
                 types: UnitTypes,
                 forcast_days: int = 1):
        self.variables = parameters
        self.types = types
        self.position = position
        self.forcast_days = forcast_days
     
     
    def __repr__(self):
        return (f"WeatherData(variables={self.variables}, "
                f"types={self.types})")
         
    def __str__(self):
        return (f"Weather Data:\n"
                f"Position {self.position}\n"
                f"Variables: {', '.join(self.variables)}\n"
                f"Types: {self.types}"
                f"Forcast days: {self.forcast_days}")
         
         
def create_weather_params(weather: WeatherData):
    return {
        "latitude": weather.position.lat,
        "longitude": weather.position.lon,
        "hourly": weather.variables,
        "forecast_days": weather.forcast_days,
    }

def create_request_params(weather: WeatherData):
    weather_params = create_weather_params(weather)
    unit_params = create_unit_params(weather.types)
    return {**weather_params, **unit_params}

def send_request(url: str) -> callable:
    def send_fn(weather: WeatherData):
        params = create_request_params(weather)
        response = requests.get(url, params=params,verify=False, timeout=10)
        if response.status_code == 200:
            return Just(response.json())
        else:
            return Nothing
         
    return send_fn

def transform_to_df(json_str:str) -> pandas.DataFrame:
    return Just(pandas.DataFrame(json_str))

weather_data = WeatherData(PositionData(57.707,11.95),
                           ["temperature_2m",
                            "apparent_temperature",
                            "rain",
                            "showers",
                            "wind_speed_10m",
                            "wind_direction_180m"],
                           UnitTypes(),
                           2
                           )
     
api_ret_json = Just(weather_data).bind(send_request("https://api.open-meteo.com/v1/forecast")) \
                                 .bind(transform_to_df)
                                 
df = api_ret_json.value

df = api_ret_json.value
x_axis = df.loc['time']['hourly']
y_axis = df.loc['temperature_2m']['hourly']
rain = df.loc['rain']['hourly']
wind = df.loc['wind_speed_10m']['hourly']

plt.figure(figsize=(10, 5))
plt.plot(x_axis, y_axis, marker='o', color='blue', label='Temperature 2m')
plt.bar(x_axis, rain, alpha=0.5, color='orange', label='rain')
plt.bar(x_axis, wind, alpha=0.5, color='red', label='Wind')

plt.title('Temperature and rain over Time')
plt.xlabel('Time')
plt.ylabel('Values')
plt.xticks(x_axis, rotation=45)
plt.legend()
plt.tight_layout()

plot_path = 'temperature_rain_plot.png'
plt.savefig(plot_path)
plt.close()

plot_link = f'[[file:{plot_path}]]'
print(plot_link)

So we manage to get a nice looking graph of the temperature. But we want more! Lets think about it, we can see the graph of the temperature. We could get more information, and add graphs and what not here. Graphs are a beautiful way of getting alot of detailed information, but maybe we just want an overview. What would the weather be today, in a summary; is it going to rain, what cloth should I wear? Something else is needed here.

Using AI

AI is the word of the present and the future (I guess). I "need" to use it. And what better way than trying to make this graph into a nice summary of the weather today, at least to start with.

So where do we start? First we need to break it down , what do we want.

       +-----------------------------+
 Pos   | Transform   Pos, Date       |     summary
------>| To a 24h summary and graph  |------------>
 Date  |                             |
------>|                             |
       +-----------------------------+

I'll stick to what I already have, but lets first focus on reaching an AI host.

Simple AI query

There are numerous different ways of talking to a Large Language Models. Some are well known as langchain or llamaindex but there are numerous different others. These libraries are easy to use and are good in many ways. But the core is the same, it makes a request in some way to an API. The API may differ in some parts and different models. But lets not go there in this document. For now I want to make it clear how this can be done with just requests module.

import requests
import json
import pandas

response = requests.post(
  url="https://openrouter.ai/api/v1/chat/completions",
  headers={
    "Authorization": f"Bearer {API_KEY}",
  },
  data=json.dumps({
    "model": "mistralai/ministral-8b",
    "messages": [
        {"role": "system", "content": "Ha Ha Ha! Step right up, folks!" \
         "The Clown Prince of Crime turns..." \
         "Meteorologist? 😜 "
         },
        {
            "role": "user",
            "content": "Why does it rain?"
        }
     
    ]
  })
)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()
    print("ID:", data['id'])
    print("Content:")
    for choice in data['choices']:
        print(choice['message']['content'])
else:
    print("Failed to retrieve data. Status code:", response.status_code)


ID: gen-1733348452-mmv3W7JcICL0o1pdYHNc
Content:
Ah, the humor! So, in my other life (when I'm not juggling boleholopes, skateboarding on a unicycle, or causing general mayhem), I'm guessing you're wondering about the science behind why it rains.

Well, folks, it all begins with a simple process called evaporation. The sun heats up the oceans, lakes, and rivers, causing water to turn into water vapor - this is like the water dancing and twirling under the sun's warm gaze!

These tiny water drops and ice crystals are lighter than air and ride the winds high into the sky, creating clouds. Sometimes, these clouds gather and bump into each other, like a thousand tiny dancers at a celestial ballet under the gaslit sky!

Eventually, they bump into each other so much that they collide and stick together, becoming bigger and heavier. These become the raindrops (or sometimes, they're like poetic snowflakes, melting when they hit the ground). When they become too heavy for the clouds to hold, they tumble down, like tiny dancers singing "it's raining men" as they join the world below.

So there you have it, folks! Rain may seem silly or annoying at times, but it's really just nature's way of giving us a big ol' water party, and dancing on the ground and in the oceans! 💧🐳

OK, somewhat overdramatic, but we get the idea, and for this excersise it does not really matter. We just want some answers and we got it.

The idea now is to try merge the two ideas together. But again lets clean this up to hold some more abstractions

first what data are we using?

  • Url
  • Header
  • model
  • messages
    • system
    • user
    • Other (we might have more)

If we now group them together we have Url,Header in Communication data ComData and then we have data which is more related to the actual Conversation ConvData. Now to the data that we need to summarize, in this case its some json data. AI knows how to interpret json data, so that works, no need to transform it to something else.

The idea is to use the data from meteo send it in to the AI , which produces a summary and then publish this.. The publishing part we havnent yet discussed. Thats another part.

#!/usr/bin/env python
import pandas
from enum import Enum
import pandas
from pymonad.maybe import Maybe, Just, Nothing
import requests
import pprint
import json
import matplotlib.pyplot as plt


class Headers:
    def __init__(self, headers_dict: dict):
        if not isinstance(headers_dict, dict):
            raise TypeError("Headers must be initialized with a dictionary")
        self.fields = headers_dict
     
    def __repr__(self):
        return f"Headers({self.fields})"
     
    @classmethod
    def from_dict(cls, headers_dict: dict):
        return cls(headers_dict)
     
     
class ComData:
    def __init__(self, *, url: str, model: str, headers: Headers):
        if not isinstance(headers, Headers):
            raise TypeError("Headers must be an instance of Headers class")
        self.url = url
        self.headers = headers
        self.model = model
     
     
class ConversationData:
    def __init__(self, *, query: str, system: str, data: pandas.DataFrame):
        self.user = query
        self.system = system
        self.data = data
     
    def __repr__(self) -> str:
        return f"ConversationData(user='{self.user}', system='{self.system}')"
     
    def make_message(self)->list[dict[str,str]]:
        return [
            {
                "role": "system",
                "content": self.system + "\n\nHere's the weather data for analysis:"
            },
            {
                "role": "system",
                "content": "The following message contains weather forecast data." \
                "Each dictionary in the list represents a different weather  " \
                "parameter, with the 'hourly' key containing a list of 24 values ," \
                "one for each hour of the day."
            },
            {
                "role": "user",
                "content": self.user + self.data.to_json()
            }]
         
         
def conversation_to_json(conv_data: ConversationData):
    return json.dumps(conv_data.make_message())

def make_data(com_data, conv_data):
    msg = {
        "model": com_data.model,
        "messages": conv_data.make_message()
    }
    return msg



def ai_send_fn(com_data: ComData):

    def send_fn(con_data):
        data = make_data(com_data,conv_data)
        reqObj = requests.Request('POST', comdata.url, json=data, headers=com_data.headers.fields)
        with requests.Session() as session:
            req=reqObj.prepare()
            response = session.send(req, verify=False)
            if response.status_code == 200:
                return Just(response)
            else:
                return Maybe(value=response, monoid=False)
        return req
     
    return send_fn

class PositionData:
    def __init__(self, latitude, longitude):
        self.lat = latitude
        self.lon = longitude
     
    def __eq__(self, other):
        return isinstance(other, PositionData) and self.lat == other.lat and self.lon == other.lon
     
    def __repr__(self):
        return f"PositionData(lat={self.lat}, lon={self.lon})"
     
     
class WindSpeed(Enum):
    MS = 'ms'
    KMH = 'kmh'
    MPH = 'mph'
    KNOTS = 'knots'

class Temperature(Enum):
    FAHRENHEIT = 'fahrenheit'
    CELSIUS = 'celsius'

class TimeFormat(Enum):
    UNIX = 'unixtime'
    ISO8601 = 'ISO8601'

class PrecipitationType(Enum):
    MM = 'mm'
    INCH = 'inch'

class UnitTypes:
    def __init__(self,
                 wind_speed: WindSpeed = WindSpeed.MS,
                 temperature: Temperature = Temperature.CELSIUS,
                 time_format: TimeFormat = TimeFormat.UNIX,
                 precipitation: PrecipitationType = PrecipitationType.MM):
     
        self.wind_speed = wind_speed
        self.temperature = temperature
        self.time_format = time_format
        self.precipitation = precipitation
     
    def __str__(self):
        return (f'Wind Speed: {self.wind_speed.value}, '
                f'Temperature: {self.temperature.value}, '
                f'Time Format: {self.time_format.value}, '
                f'Precipitation: {self.precipitation.value}')
         
         
def create_unit_params(unit_types: UnitTypes):
    return {
        "wind_speed_unit": unit_types.wind_speed.value,
        #"timeformat": unit_types.time_format.value,
        "precipitation_unit": unit_types.precipitation.value,
        "temperature_unit": unit_types.temperature.value}
     
     
class WeatherData:
    def __init__(self, position: PositionData,
                 parameters: list[str],
                 types: UnitTypes,
                 forcast_days: int = 1):
        self.variables = parameters
        self.types = types
        self.position = position
        self.forcast_days = forcast_days
     
     
    def __repr__(self):
        return (f"WeatherData(variables={self.variables}, "
                f"types={self.types})")
         
    def __str__(self):
        return (f"Weather Data:\n"
                f"Position {self.position}\n"
                f"Variables: {', '.join(self.variables)}\n"
                f"Types: {self.types}"
                f"Forcast days: {self.forcast_days}")
         
         
         
         
         
def create_weather_params(weather: WeatherData):
    return {
        "latitude": weather.position.lat,
        "longitude": weather.position.lon,
        "hourly": weather.variables,
        "forecast_days": weather.forcast_days,
    }

def create_request_params(weather: WeatherData):
    weather_params = create_weather_params(weather)
    unit_params = create_unit_params(weather.types)
    return {**weather_params, **unit_params}





def send_request(url: str) -> callable:
    def send_fn(weather: WeatherData):
        params = create_request_params(weather)
        response = requests.get(url, params=params,verify=False, timeout=10)
        if response.status_code == 200:
            return Just(response.json())
        else:
            return Nothing
         
    return send_fn

def transform_to_df(json_str:str) -> pandas.DataFrame:
    return Just(pandas.DataFrame(json_str))

###############################################################################
#                       Here we set the values we want.                       #
###############################################################################
headers = Headers({
    "Authorization": f"Bearer {API_KEY}",
    # "Content-Type": "application/json"
})

comdata = ComData(
    url="https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    model="google/gemini-flash-1.5"
)

weather_data = WeatherData(PositionData(57.707,11.95),
                           ["temperature_2m",
                            "apparent_temperature",
                            "rain",
                            "showers",
                            "wind_speed_10m",
                            "wind_direction_180m"
                            ],
                           UnitTypes(),
                           forcast_days=3
                           )
api_ret_json = Just(weather_data).bind(send_request("https://api.open-meteo.com/v1/forecast")) \
                                 .bind(transform_to_df)
                                 
df = api_ret_json.value



context = """
This weather data contains the following information:
1. Time: Hourly timestamps
2. Temperature (2m above ground): in °C
3. Apparent temperature: in °C
4. Rain: precipitation in mm
5. Showers: precipitation from showers in mm
6. Wind speed (10m above ground): in m/s
7. Wind direction (180m above ground): in degrees

The 'hourly' key in each dictionary contains a list of values corresponding to these measurements for each hour of the day.
The role weather_data will contain a json with 24h weather data. Analyse the data and answere the user question.
"""

conv_data = ConversationData(
    query="My name is Carl, i will take my bike every morning and home every evening" \
    " between 08:00-09:00 and going home at 17:00-18:00, Give me a summary of the weather forecast," \
    "Also i had west on the morings, and east on the evnings, I want to know if im heading into the wind or tailwind" \
    "also give me suggestions for clothing during my ride and if there are any significant changes during the period",
    system=context,
    data=df
)


send_fn = ai_send_fn(comdata)
maybe_response = send_fn(conv_data)
if maybe_response.is_just():
    data = maybe_response.value.json()
    content = data['choices'][0]['message']['content']
    print(content)


TTS

We have the generated some text which is good, we could also print out some graphs , which is nice. The problem is that all these things one needs to open some kind of web page to retrive the information, and then reading it, and interpreting what todo. I for one don't have the time in the morning, i just want someone to tell me what will happen today. A brief nice summary of todays weather so I know what to expect. I guess im talking about TTS (text-to-speech).

   cat << EOF > /tmp/test.txt

  Hi Carl, here's a summary of your bike ride weather forecast for October 22nd and 23rd, considering your morning and evening commutes between 8:00-9:00 am and 5:00-6:00 pm, and your orientation to wind direction.

**Morning Commute (8:00-9:00 am):**

* **Temperature:**  Around 12.5°C (average of 12.4°C and 12.7°C on the 22nd and 23rd respectively).  Apparent temperature will be slightly lower around 10°C.
* **Wind:**  The wind speed will be approximately 5.2 m/s to 5.4 m/s on both days.  Wind direction is around 226° to 231°(Morning).  Since you're heading west, this means you'll experience a **headwind** in the mornings.

**Evening Commute (5:00-6:00 pm):**

* **Temperature:** Temperatures will be around 12.9°C and 12.7°C (average of 12.9°C and 12.7°C on the 22nd and 23rd respectively). Apparent temperature will be around 9.9°C and 9.5°C.
* **Wind:** The wind speed will be around 6.4 m/s and 5.8 m/s in the evenings. The wind direction is around 251° and 246°. As your heading is east, you'll have a **tailwind** in the evenings.

**Overall:**

* **Temperature:** Expect mild temperatures throughout your commutes, but it might feel a tad cooler due to the wind chill.
* **Precipitation:** No rain is predicted during your commute hours.
* **Clothing Suggestions:** Layers are your friend! Start with a base layer (thermal top and bottom if it feels particularly cold), add a mid-layer (fleece or light jacket) and a light windbreaker or waterproof shell for protection against the wind(especially in the morning.


**Significant Changes During the Period:**

There is a moderate change in temperature in the evenings of both
days. The wind will increase in speed between your morning and evening
commutes, although it will be a tailwind in the evenings.  There's a
light rain of 0.2 mm at 4pm on the 22nd.


Remember to check the specific forecast closer to your ride time for the most up-to-date information. Have a pleasant ride, Carl!


EOF

  gtts-cli -f /tmp/test.txt -t com.au  | play -t mp3 -


I will stop here for now, there are however better ways of getting Text-to-speech. But I will leave that to another exercise.

Links

Date: 2024-10-25 Fri 00:00

Author: Calle Olsen

Created: 2024-12-04 Wed 22:40

Validate